Iso 10646 character code


















It depends on the physical keyboard what the key cap texts are, and it depends on the keyboard settings whether the keys have the same effect or different effects. The name "Alt Gr" for "right Alt" is short for "alternate graphic", and it's mostly used to create additional characters, whereas left "Alt" is typically used for keyboard access to menus. The last method above could often be called "device dependent" rather than program specific, since the program that performs the conversion might be a keyboard driver.

In that case, normal programs would have all their input from the keyboard processed that way. Such an auxiliary key is often called dead key , since just pressing it causes nothing; it works only in combination with some other key.

A more official name for a dead key is modifier key. My keyboard has two keys for such purposes. If I try to enter that way a character outside the ISO Latin 1 repertoire, I get just the diacritic as a separate character followed by the normal character, e. Your situation may well be different, in part or entirely. For example, a typical French keyboard has separate keys for those accented letters that are used in French e.

It is often possible to use various "escape" notations for characters. This rather vague term means notations which are afterwards converted to or just displayed as characters according to some specific rules by some programs.

They depend on the markup, programming, or other language in a broad but technical meaning for "language", so that data formats can be included but human languages are excluded. If different languages have similar conventions in this respect, a language designer may have picked up a notation from an existing language, or it might be a coincidence. The phrase "escape notations" or even "escapes" for short is rather widespread, and it reflects the general idea of escaping from the limitations of a character repertoire or device or protocol or something else.

So it's used here, although a name like meta notations might be better. In TeX systems, there are different ways of producing characters, possibly depending on the "packages" used. In the official HTML terminology, such notations are called entity references denoting characters.

It depends on HTML version which entities are defined, and it depends on a browser which entities are actually supported. This is actually very simple: you specify a character by its index position, number. A most essential point is that for HTML, the "document character set" is completely independent of the encoding of the document!

See Alan J. Flavell 's Notes on Internationalization. Although both entity and character references are markup , to be used in markup languages, they often replaced by the corresponding characters, when a user types text on an Internet discussion forum.

This might be a conscious decision by the forum designer, but quite often it is caused unintentionally. As you can see, the notations typically involve some semi- mnemonic name or the code number of the character, in some number system. And there is some method of indicating that the letters or digits are not to be taken as such but as part of a special notation denoting a character. In cases like these, the character itself does not occur in a file such as an HTML document or a C source program.

Instead, the file contains the "escape" notation as a character sequence, which will then be interpreted in a specific way by programs like a Web browser or a C compiler. One can in a sense regard the "escape notations" as encodings used in specific contexts upon specific agreements. There are also "escape notations" which are to be interpreted by human readers directly. The reader is assumed to understand that e. Quite often the purpose is to use ASCII characters only, so that the typing, transmission, and display of the characters is "safe".

Such usage is based on special though often implicit conventions and can cause a lot of confusion when there is no mutual agreement on the conventions, especially because there are so many of them.

There is an old proposal by K. They are very short, typically two characters, e. Naturally there's the problem that the reader must know whether e. So the system is primarily for referring to characters see below , but under suitable circumstances it could also be used for actually writing texts, when the ambiguities can somehow be removed by additional conventions or by context. RFC cannot be regarded as official or widely known, but if you need, for some applications, an "escape scheme", you might consider using those notations instead of reinventing the wheel.

There are also various ways to identify a character when it cannot be used as such or when the appearance of a character is not sufficient identification. This might be regarded as a variant of the "escape notations for human readers" discussed above, but the pragmatic view is different here.

We are not primarily interested in using characters in running text but in specifying which character is being discussed. Information about encoding The need for information about encoding It is hopefully obvious from the preceding discussion that a sequence of octets can be interpreted in a multitude of ways when processed as character data.

By looking at the octet sequence only, you cannot even know whether each octet presents one character or just part of a two-octet presentation of a character, or something more complicated.

Sometimes one can guess the encoding, but data processing and transfer shouldn't be guesswork. Naturally, a sequence of octets could be intended to present other than character data, too.

It could be an image in a bitmap format, or a computer program in binary form, or numeric data in the internal format used in computers. This problem can be handled in different ways in different systems when data is stored and processed within one computer system. For data transmission , a platform-independent method of specifying the general format and the encoding and other relevant information is needed.

Such methods exist, although they not always used widely enough. People still send each other data without specifying the encoding, and this may cause a lot of harm. Attaching a human-readable note, such as a few words of explanation in an E-mail message body, is better than nothing. But since data is processed by programs which cannot understand such notes, the encoding should be specified in a standardized computer-readable form.

Internet media types , often called MIME media types , can be used to specify a major media type "top level media type", such as text , a subtype such as html , and an encoding such as iso They can be and should be used for specifying the encoding when data is sent over a network, e. The media type concept is defined in RFC The technical term used to denote a character encoding in the Internet media type context is "character set", abbreviated "charset".

This has caused a lot of confusion, since "set" can easily be understood as repertoire! The official registry of "charset" i. I have composed a tabular presentation of the registry , ordered alphabetically by "charset" name and accompanied with some hypertext references. Several character encodings have alternate alias names in the registry. The Content-Type information is an example of information in a header. Headers relate to some data, describing its presentation and other things, but are passed as logically separate from it.

Adequate headers should normally be generated automatically by the software which sends the data such as a program for sending E-mail, or a Web server and interpreted automatically by receiving software such as a program for reading E-mail, or a Web browser. In E-mail messages, headers precede the message body; it depends on the E-mail program whether and how it displays the headers.

For Web documents, a Web server is required to send headers when it delivers a document to a browser or other user agent which has sent a request for the document. In addition to media types and character encodings, MIME addresses several other aspects too. This implies that the sequence of octets becomes longer, and if it is read as an ASCII string, it can be incomprehensible to humans.

But what is gained is robustness in data transfer, since the encoding uses only "safe" ASCII characters which will most probably get through any component in the transfer unmodified. Notice that encoding ISO data this way means that the character code is the one specified by the ISO standard, whereas the character encoding is different from the one specified or at least suggested in that standard.

Since QP only specifies the mapping of a sequence of octets to another sequence of octets, it is a pure encoding and can be applied to any character data, or to any data for that matter.

Naturally, Quoted-Printable encoding needs to be processed by a program which knows it and can convert it to human-readable form. It looks rather confusing when displayed as such. Roughly speaking, one can expect most E-mail programs to be able to handle QP, but the same does not apply to newsreaders or Web browsers.

Therefore, you should normally use QP in E-mail only. Basically, MIME should let people communicate smoothly without hindrances caused by character code and encoding differences. MIME should handle the necessary conversions automatically and invisibly. For example, when person A sends E-mail to person B , the following should happen: The E-mail program used by A encodes A 's message in some particular manner, probably according to some convention which is normal on the system where the program is used such as ISO encoding on a typical modern Unix system.

The program automatically includes information about this encoding into an E-mail header, which is usually invisible both when sending and when reading the message. The message, with the headers, is then delivered, through network connections, to B 's system. When B uses his E-mail program which may be very different from A 's to read the message, the program should automatically pick up the information about the encoding as specified in a header and interpret the message body according to it.

For example, if B is using a Macintosh computer, the program would automatically convert the message into Mac's internal character encoding and only then display it. If the program fails to do such a conversion, strange things will happen.

Unfortunately, there are deficiencies and errors in software so that users often have to struggle with character code conversion problems, perhaps correcting the actions taken by programs. It takes two to tango, and some more participants to get characters right. This section demonstrates different things which may happen, and do happen, when just one component is faulty, i. Typical minor! If a text in such a language is processed so that a necessary conversion is not applied, or an incorrect conversion is applied, the result might be that e.

Sometimes you might be able to guess what has happened, and perhaps to determine which code conversion should be applied, and apply it more or less "by hand".

To take an example which may have some practical value in itself to people using languages mentioned Assume that you have some text data which is expected to be, say, in German, Swedish or Finnish and which appears to be such text with some characters replaced by oddities in a somewhat systematic way.

Assume further that the program you are using interprets text data according to ISO by default and that the actual data is not accompanied with a suitable indication like a Content-Type header of the encoding, or such an indication is obviously in error. To illustrate what may happen when text is sent in a grossly invalid form , consider the following example. I sent myself E-mail, using Netscape 4. In the mail composition window, I set the encoding to UTF That's Finnish for 'This is a test'.

The second and fourth character is letter a with umlaut. Trying to read the mail on my Unix account, using the Pine E-mail program popular among Unix users , I saw the following when in "full headers" mode; irrelevant headers omitted here :. Interesting, isn't it? Even if the encoding had been a registered one, there would have been no guarantee that my E-mail program would have been able to handle the encoding. In fact, it is difficult to say what Netscape was really doing, since it seems to have encoded incorrectly.

The shifted encoding is based on presenting Unicode values first as bit binary integers, then regrouping the bits and presenting the resulting six- bit groups as octets according to a table specified in RFC in the section on Base See also RFC Whenever text data is sent over a network, the sender and the recipient should have a joint agreement on the character encoding used. In the optimal case, this is handled by the software automatically, but in reality the users need to take some precautions.

Most importantly, make sure that any Internet-related software that you use to send data specifies the encoding correctly in suitable headers. There are two things involved: the header must be there and it must reflect the actual encoding used; and the encoding used must be one that is widely understood by the potential recipients' software.

One must often make compromises as regards to the latter aim: you may need to use an encoding which is not yet widely supported to get your message through at all. It is useful to find out how to make your Web browser, newsreader, and E-mail program so that you can display the encoding information for the page, article, or message you are reading.

If you use, say, Netscape to send E-mail or to post to Usenet news, make sure it sends the message in a reasonable form. In particular, avoid sending data in a proprietary encoding like the Macintosh encoding or a DOS encoding to a public network. At the very least, if you do that, make sure that the message heading specifies the encoding! There's nothing wrong with using such an encoding within a single computer or in data transfer between similar computers.

But when sent to Internet, data should be converted to a more widely known encoding, by the sending program. If you cannot find a way to configure your program to do that, get another program. As regards to other forms of transfer of data in digital form, such as diskettes, information about encoding is important, too. The problem is typically handled by guesswork. Often the crucial thing is to know which program was used to generate the data, since the text data might be inside a file in, say, the MS Word format which can only be read by a suitable version of MS Word or by a program which knows its internal data format.

That format, once recognized, might contain information which specifies the character encoding used in the text data included; or it might not, in which case one has to ask the sender, or make a guess, or use trial and error - viewing the data using different encodings until something sensible appears.

Character code problems are part of a topic called internationalization jocularly abbreviated as i18n , rather misleadingly, because it mainly revolves around the problems of using various languages and writing systems scripts. Typically international communication on the Internet is carried out in English! It includes difficult questions like text directionality some languages are written right to left and requirements to present the same character with different glyphs according to its context.

See W3C pages on internationalization. I originally started writing this document as a tutorial for HTML authors. Later I noticed that this general information is extensive enough to be put into a document of its own.

As regards to HTML specific problems, the document Using national and special characters in HTML summarizes what currently seems to be the best alternative in the general case. I have learned a lot about character set issues from the following people listed in an order which is roughly chronological by the start of their influence on my understanding of these things : Timo Kiravuo , Alan J.

But any errors in this document I souped up by myself. Glyph variation Fonts Identity of characters: a matter of definition Failures to display a character Linear text vs.

An "A" or any other character is something like a Platonic entity: it is the idea of an "A" and not the "A" itself. A character may have a broader range of use than the most literal interpretation of its name might indicate; coded representation, name, and representative glyph need to be taken in context when establishing the semantics of a character.

Date of last revision: Date of last update: Minor modifications Link fixes and This page belongs to section Characters and encodings of the free information site IT and communication by Jukka "Yucca" Korpela. The presentation of some characters in copies of this document may be defective e.

You may wish to compare the presentation of the characters on your browser with the character table presented as a GIF image in the famous ISO Alphabet Soup document. In text only mode, you may wish to use my simple table of ISO Latin 1 which contains the names of the characters.

Naturally, the appearance of characters varies from one font to another. More examples: the Windows character set s In ISO , code positions - are explicitly reserved for control purposes ; they "correspond to bit combinations that do not represent graphic characters". Latin alphabet No. ISO Part 12 has not been defined. What's in a name? Failures to display a character In addition to the fact that the appearance of a character may vary , it is quite possible that some program fails to display a character at all.

Linear text vs. Compatibility characters There is a large number of compatibility characters in ISO and Unicode which are variants of other characters. Typing characters Just pressing a key? Program-specific methods for typing characters Thus, you often need program-specific ways of entering characters from a keyboard, either because there is no key for a character you need or there is but it does not work properly.

Three important examples of such ways: On Windows systems, you can usually - some application programs may override this produce any character in the Windows character set naturally, in its Windows encoding as follows: Press down the left Alt key and keep it down.

Then type, using the separate numeric keypad not the numbers above the letter keys! Finally release the Alt key. Notice that the first digit is always 0, since the code values are in the range 32 - decimal. Upon releasing Alt, the character should appear on the screen. This method is often referred to as Alt- 0nnn.

If you omit the leading zero, i. For example, Alt would probably insert a graphic character which looks somewhat like a hyphen. There are variations in the behavior of various Windows programs in this area, and using those DOS codes is best avoided. In the Emacs editor which is popular especially on Unix systems , you can produce any ISO Latin 1 character by typing first control-Q, then its code as a three-digit octal number. This method is often referred to as C-Q- nnn.

From version 23, there are good tools for entering any character in Emacs. Text processing programs often modify user input e. This is often convenient, especially if you can add your own rules for modifications, but it causes unpleasant surprises and problems when you actually meant what you wrote, e.

Programs often process some keyboard key combinations , typically involving the use of an Alt or Alt Gr key or some other "composition key", by converting them to special characters. In fact, even the well-known shift key is a composition key: it is used to modify the meaning of another key, e. Such things are not just "program-specific"; they also depend on the program version and settings and on the keyboard, of course , and could well be user-modifiable. For example, in order to support the euro sign , various methods have been developed, e.

But this may require a special "euro update", and the key combinations vary even when we consider Microsoft products only. So it would be quite inappropriate to say e.

How to mention identify a character There are also various ways to identify a character when it cannot be used as such or when the appearance of a character is not sufficient identification.

The MIME solution Media types Internet media types , often called MIME media types , can be used to specify a major media type "top level media type", such as text , a subtype such as html , and an encoding such as iso Problems with implementations - examples Unfortunately, there are deficiencies and errors in software so that users often have to struggle with character code conversion problems, perhaps correcting the actions taken by programs.

This usage, and the next two usages below, is usually not directly caused by incorrect implementations but by the human writer; however, it is indirectly caused by them. For some languages e. Yet another surrogate. The original data was actually ISO encoded or something similar e. Windows encoded but during data transfer the most significant bit of each octet was lost. Such things may happen in systems for transferring, or "gatewaying", data from one network to another.

Sometimes it might be your terminal that has been configured to "mask out" the most significant bit! It requires further study to determine the actual encoding used, since e. Proof sent to secretariat or FDIS ballot initiated: 8 weeks. Close of voting. Proof returned by secretariat. Got a question? Customer care. Keep up to date with ISO Sign up to our newsletter for the latest news, views and product information.

Note that the size of the PDF file is rather large over MB , in some context it may be preferable to save the file before opening it. For more information concerning code charts and their terms of use, see About the Code Charts. Code charts are presented in arrays of graphic symbols representing the characters organized in one to sixteen columns of sixteen symbols each. The lower digit of the coded representation is indicated in the left margin while the remaining upper digits are indicated in the top margin.



0コメント

  • 1000 / 1000