|
Summary
In August, the SimulTrans Localization Seminar Series featured Andrea Vine, software internationalization consultant, and Bill Hall, Director of Internationalization at SimulTrans.
Together, they discussed Unicode, a character coding system designed to support the interchange, processing, and display of the written texts of the world's diverse languages. The evening program featured both formal presentations and demonstrations of programs running in Windows NT 4.0 and 5.0.
- Vine's topics included:
- The history of Unicode
- Basic principles of the character set
- Conformance issues
- What's new in Unicode 2.0 (from 1.1)
Hall addressed Unicode's "practical side," with topics such as:
- Advantages and disadvantages of Unicode
- Unicode realized on Windows NT
- Character conversion
- Character analysis
For more information regarding Unicode, visit the Unicode Consortium's website at http://www.unicode.org.
Andrea Vine
Andrea Vine been involved in the internationalization and localization of software for more than eight years. She has worked with companies such as Sun Microsystems, Xerox, and Computer Associates. She specializes in internationalized design for software applications. Past projects include e-commerce applications, spreadsheets, project planners, communications interfaces, network servers, and word processors. She is currently internationalizing the Sun Internet Mail Server.
Unicode defined
First, we should examine how Unicode differs from ISO 10646.
ISO 10616 is simply a character set that maps characters to binary code numbers. Unicode, on the other hand, is the 2-byte/4-byte form of that same character set plus character properties, implementation rules, and guidelines.
ISO is an international standards organization, with academic and governmental focus. Unicode remains a private consortium, made up of several commercial entities, plus some academic and user groups.
It should also be noted that ISO and Unicode have been working together cooperatively since 1991 in an effort to ensure that there are no conflicts or discrepancies between the two character sets.
A brief history
The concept of the Unicode character set began in 1987, thanks to Joe Becker from Xerox and Mark Davis from Apple. The following year, Becker, Davis, and Lee Collins (currently of Xerox; formerly of Apple) began investigating the design and soon made the case for Han unification to ANSI, ISO. Unicode is, indeed, based on the historic evolution of the Chinese character set (Han).
Several people from various high tech companies began holding bimonthly meetings in 1989. By the end of 1990, an initial, full-review draft was created.
In 1991, the group became the Unicode Consortium, a non-profit organization incorporated as Unicode, Inc.
Version 1.0 became available to the public for the first time in 1992.
Unicode's basic principles
Briefly, the Unicode character set encompasses the following 10 basic principles:
- 16-bit characters
- Full encoding
- Characters, not glyphs
- Semantics
- Plain text
- Logical order
- Unification
- Dynamic composition
- Equivalent sequence
- Convertibility
Relative to the first principle, the characters are of 2-byte representation and are uniform in width (there's a slight exception to this, which I'll address later).
As noted, Unicode relies upon characters, not glyphs. A character is the smallest component of written language that has semantic value (including phonetic value). On the other hand, a glyph is the shape representation of a character. For example, glyphs of Latin lowercase "as" can include: a, a, a, a, and a.
A character shape based on position or semantic context is considered to be a glyph, such as Arabic characters.
A ligature is considered to be a glyph resulting from a combination of two characters, such as ""fi" in some fonts and ""ę" as it's used in English (but not as it's used in Norwegian).
As for semantics, it is important to note that each individual character has properties. The character can be numeric, can have spacing properties, and can have directionality. It becomes very important to know the properties of the character when you receive an input stream to render it properly, or when doing comparisons.
Regarding the use of plain text, there's only enough information to adequately render a character in plain text. There is no additional encoding in Unicode for layout, language, font, style, color, hypertext links, kerning, etc.
Unicode relies upon logical order. Information is stored in the order it is typed in, which may be different from rendering order. The principle of unification refers to the fact that no duplicate encodings of a character occur simply because of varying uses. Use in different languages is just another use (see prior point).
It is also important to note that round trip compatibility is more important than unification. This can get a little fuzzy, for example, when there are several "hyphen" symbols (which one could argue are just varying uses of the same character).
Dynamic composition, also known as combining characters, is the process of following a character with diacritic marks that a rendering engine can then display as one character.
As for equivalent sequence, precomposed characters should be equivalent to their combining counterparts, e.g. č = e + `. These are guidelines, however, and are not necessary for conformance.
Java does not consider the previous example to be equivalents. And, as a last point regarding equivalent sequence, the order of combining characters can make a difference in equivalencybut not always.
As of May 1993, accurate round-trip convertibility is guaranteed between the Unicode Standard and other standards in wide usage. If variant forms are given separate codes within one of the widely used standards, these were preserved in Unicode.
Code point section descriptions
The following list illustrates how the character set is divided up, resulting in text values of the character set. The first is General scripts, followed by Symbols (CJK is Chinese-Japanese-Korean Miscellaneous), Ideographs (or whatever you would like to call them), Hangul, Surrogates, Private use, and, lastly, Compatibility and Special.
0000 = 1FFF - General scripts Basic Latin: U+0000--U+007F Latin-1 Supplement: U+0080--U+00FF Latin Extended-A: U+0100--U+017F Latin Extended-B: U+0180--U+024F IPA Extensions: U+0250--U+02AF Spacing Modifier Letters: U+02B0--U+02FF Combining Diacritical Marks: U+0300--U+036F Greek: U+0370--U+03FF Cyrillic: U+0400--U+04FF Armenian: U+0530--U+058F Hebrew: U+0590--U+05FF Arabic: U+0600--U+06FF Devanagari: U+0900--U+097F Bengali: U+0980--U+09FF Gurmukhi: U+0A00--U+0A7F Gujarati: U+0A80--U+0AFF Oriya: U+0B00--U+0B7F Tamil: U+0B80--U+0BFF Telugu: U+0C00--U+0C7F Kannada: U+0C80--U+0CFF Malayalam: U+0D00--U+0D7F Thai: U+0E00--U+0E7F Lao: U+0E80--U+0EFF Tibetan: U+0F00--U+0FBF Georgian: U+10A0--U+10FF Hangul Jamo: U+1100--U+11FF Latin Extended Additional: U+1E00--U+1EFF Greek Extended: U+1F00--U+1FFF 2000 = 2FFF - Symbols General Punctuation: U+2000--U+206F Superscripts and Subscripts: U+2070--U+209F Currency Symbols: U+20A0--U+20CF Combining Marks for Symbols: U+20D0--U+20FF Letterlike Symbols: U+2100--U+214F Number Forms: U+2150--U+218F Arrows: U+2190--U+21FF Mathematical Operators: U+2200--U+22FF Miscellaneous Technical: U+2300--U+23FF Control Pictures: U+2400--U+243F Optical Character Recognition: U+2440--U+245F Enclosed Alphanumerics: U+2460--U+24FF Box Drawing: U+2500--U+257F Block Elements: U+2580--U+259F Geometric Shapes: U+25A0--U+25FF Miscellaneous Symbols: U+2600--U+26FF Dingbats: U+2700--U+27BF 3000 = 33FF - CJK miscellaneous CJK Symbols and Punctuation: U+3000--U+303F Hiragana: U+3040--U+309F Katakana: U+30A0--U+30FF Bopomofo: U+3100--U+312F Hangul Compatibility Jamo: U+3130--U+318F Kanbun: U+3190--U+319F Enclosed CJK Letters and Months: U+3200--U+32FF CJK Compatibility: U+3300--U+33FF 4E00 = 9FFF - CJK ideographs AC00 = D7A3 - Hangul D800 = DFFF - Surrogates (other planes) E000 = F8FF - Private use F900 = FFFF - Compatibility and Special CJK Compatibility Ideographs: U+F900--U+FAFF Alphabetic Presentation Forms: U+FB00--U+FB4F Arabic Presentation Forms-A: U+FB50--U+FDFF Combining Half Marks: U+FE20--U+FE2F CJK Compatibility Forms: U+FE30--U+FE4F Small Form Variants: U+FE50--U+FE6F Arabic Presentation Forms-B: U+FE70--U+FEFF Halfwidth and Fullwidth Forms: U+FF00--U+FFEF Specials: U+FEFF, U+FFF0--U+FFFF
A final word (or description)
Unassigned codes are reserved for future use by the Consortium. Again, this will be a fully assigned character set. As a result, you can use the "Private Use" area for implementation dependent encoding.
There are also two very important non-characters, FFFE and FFFF (which I will talk about later). And there are no escape sequences or shift states.
Unicode's choice of characters
Why were some characters chosen and not others? The Unicode people chose characters that were currently in use, living languages vs. ancient (by entire language, in some cases).
The ability to distinguish was also a big issue. My favorite examples are Latin, Greek, and Cyrillic. If you take Latin, Greek, and Cyrillic "A," guess what? It looks like the same character (to me). So in order to preserve these as separate characters, they had to make them separate characters.
Round trip compatibility was also an issue. There exists a Chinese-Japanese-Korean (CJK) compatibility zone with some duplicate characters, such as double width Latin letters.
And even widespread usage became an issue. There's a lot of discussion about the precombined Hangul vs. Thai, but, the truth of the matter is that precombined Hangul is in wide use. As a result, they couldn't take that away.
There are some mistakes, but generally speaking, once a character is set, it will not be removed. Following are a few bytes:
- Symmetric swapping - 206A, 206B
- Arabic form shaping - 206C, 206D
- Digit shapes - 206E, 206F
- Old Euro currency sign - 20A0
Major consequences
There are some major consequences with 16-bit Unicode. For one, code points can contain '00', the traditional null byte for 8-bit character strings. Likewise, code points can contain the control functions from other encodings, e.g. ASCII '01' - '20'.
Another consequence worth noting is that current C/C++ data type char won't work. Also, character level sequences and incrementation as well as control codes must be handled as 16-bit.
Combining characters
The typing order is based on handwriting order, not dead key or compose key order.
If diacritics occupy the same relative position, then order is significant. The first diacritic should be placed closer to the base character than the second should.
Where order is not significant, one order is equivalent to the other. Essentially, equivalency is based on the (theoretical) rendered element, as a user would perceive equivalency.
Conformance
To be conformant, you don't have to support all characters or even specific subsets; you can support one character. In fact, John Jenkins of Apple gave an interesting example. You can buy a program called Bell Ringer and, every time you've got 0007, you ring a bell. You just ignore everything else. All they ask is that you don't lose or mangle characters, that you don't' use unassigned code points, and that you don't interpret noncharacters or unpaired surrogates as characters. They're trying to be as flexible as possible so that more and more people will implement at least portions of Unicode.
What's new in Unicode 2.0 (from 1.1)
First of all, certain areas have been defined or moved. For example, the Surrogates area has been defined while the Hangul area has been moved.
Codepoint additions and moves involve the Hangul area (which has been increased from 4306 characters to 11,172). There is also the addition of 31 cantillation marks to Hebrew block, as well as a Tibetan block, containing 168 entries.
In terms of character name changes, LIGATURE changed to LETTER in 6 names: 00C6, 00E6, 01E2, 01E3, 01FC, and 01FD. There will be no name changes in the future.
As for semantics, the order of joiner characters for Indic half-forms changed. And symmetric swapping became the normative property mirrored.
Bi-directional behavior
Data stream vs. rendering is a big issue. The data stream is typing order. Rendering runs into complications. For example, in Hebrew, a period becomes a decimal point when the next item in the stream is a number.
Han unification
There's been a tremendous amount of controversy over Han unification. Essentially, what was attempted was to combine those Han characters that were basically rendered differently in something of a font style, as in a bold or medium weight letter. In the instance of sample Han, it would include the comparison of a to a. In separate Han, it would be more like comparing a to u or a to d
Extra special characters
There are some extra special characters I'd like to point out, i.e., NUL, FEFF, and FFFF. FEFF occurs at the beginning of a file or stream, byte order mark to distinguish big-endian from little-endian order; in the middle of stream ZERO WIDTH NO-BREAK SPACE.
Additional special characters include 2028 - line separator (supposedly a more determinate form of linefeed '0A'); and 2029 - paragraph separator (a situation similar to line delimiter).
Also noteworthy are spaces (at least 15); dashes and hyphens; right-to-left and left-to-right marks; joiners and non-joiners; FFFD used as replacement character for unknowns, sometimes 001A used for 8-bit and FFFD used for 16-bit; and FFFF sentinel character, used for absolute comparison, end of file, etc.
Encoding and transport protocols
When you think of double-byte Unicode, Unicode connection takes several forms. Unicode is a character set with a value associated with each character, but that value can be cached into all sorts of different shapesand one shape is as itself, UCS-2.
UCS-2 is 2-byte Unicode. UCS-2 and UCS-4 are ISO 10646 terms essentially for 16-bit Unicode characters and 32-bit characters (Universal Character Set 2-octet, 4-octet).
The next character set, or character encoding, is UTF-8. This was developed as a form that current systems can handle, at least as far as string identification and transport, 8-bit NUL '00' is valid (UCS Transformation Format 8-bit), and leaving ASCII unchanged.
UTF-7 is strictly for primitive Internet transport. It is a 7-bit form, whose use is discouraged in favor of UTF-8.
UTF-16 is a deceptive name or misnomer for the variable length encoding when supporting surrogate characters as well as other Unicode characters, i.e. both 16-bit and 32-bit characters as defined in Unicode. It's actually a multi-byte character encoding alternating between 16-bit and 32-bit.
What's new in 2.1?
Here are a few things you can look forward to in Unicode 2.1:
- New characters such as FFFC OBJECT REPLACEMENT CHARACTER, 20AC EURO SIGN
- Additional property changes - several characters designated with math property, 2 characters removed from alphabetic listing (02BC MODIFIER LETTER APOSTROPHE and 055A ARMENIAN APOSTROPHE), new identifiers
- Bi-directional behavior changes, e.g. 0026 AMPERSAND "&", 0040 COMMERCIAL "@" now other neutral directionality, 002E FULL STOP and 2007 FIGURE SPACE changed from EUROPEAN NUMBER SEPARATOR to COMMON NUMBER SEPARATOR
- Apostrophe semantics, in terms of letter, e.g. for accent or glottal stop; in terms of punctuation, e.g. true apostrophe as word break; quotation mark.
- Errata - typographical, glyph (about 35, mostly in CJK compatibility), UCS-2 to UTF-7 sample code conversion.
For more information you can visit the Unicode website at http://www.unicode.org/.
You can also pick up a copy of The Unicode Standard, Version 2.0 - ISBN 0-201-48345-9, (see the Unicode website for ordering information).
Two Unicode fonts are also available:
Bitstream Cyberbit at http://www.bitstream.com/ (Cyberbit has been downloadable for free in the past)
Monotype WorldType at http://www.monotype.com/
I am also available for consulting services and can be reached at asv@earthling.net.
Bill Hall
Bill Hall, Director of Internationalization at SimulTrans, has published and spoken widely in forums relating to internationalization. In April he spoke at the Unicode conference in Tokyo on "Unicode and Win32 Internationalization," a topic he addressed in articles in Microsoft Systems Journal and Journal of Win32 Development. After teaching for 15 years at the university level, Bill joined companies such as Novell and Netcom. He holds five degrees, from B.A. to Ph.D., in subjects ranging from computer science and electrical engineering to mathematics.
Unicode's 'practical side'
I'm going to talk about what I call "practical Unicode." Basically, Unicode began as an attempt to represent the world's characters as 16-bit. As you can see, it hasn't quite worked out that way, although the major modern languages of today are very well represented in Unicode.
The idea is to encode the world's languages, including existing character set standards (as Andrea pointed out). In addition to that, many languages and characters that have never been encoded before have also been encoded.
Unicode is an international standard base multilingual plane of ISO 10646 (which is the ISO standard). It's also now a component of some operating systems, especially NT and Windows 95 (believe it or not). It's also going to be part of the new MAC operating system.
The general layout
Based upon Unicode's general layout, about 8K is allocated to alphabetic-like scripts. This would include everything from Roman letters through indic scripts to Georgian, Armenian, Arabic, Hebrew, Greek, and Cyrillic types of things. There's also a section, about 4K, for symbols. There's also a little area called the CJK auxiliary area, which encodes auxiliary characters used in the Chinese-Japanese-Korean code bases.
There's also a reserve space and one for Hangul, which is nearly 21,000 Chinese characters (unified, as Andrea mentioned). As a result, you cannot look at a character code and know whether it came from Chinese, Japanese, or Korean. There is a space called the Surrogate Area, which has been described. As you know, Korean can be written as a very elegant writing system which uses very few symbols (no more than 40) of vowels and consonants that can be combined to make syllables. These syllables are then written in a block. Each block is called a Hangul (?) and a Hangul character.
Basically, each one of these corresponds to some Chinese character, which is the older way of writing. This system of writing is in great use now, but it's taken a long time since its inception in 1446 (?).
As for the Private Use area, you can do anything you want here, as long as you and the Unicode Consortium are in agreement as to what that use is. The Compatibility Zone is for roundtrip mappings to and from older character sets.
Unicode extension mechanism
Unicode can encode a maximum of 64K characters with only 16 bits of information. Some Chinese dictionaries list 70,000 or more characters. And while ancient scripts are missing, they are required for scholarly work.
Extension to 1 million characters has been added by pairing two, 16-bit elements. This allows for representation of rare characters in future extensions of Unicode. A 2K region of Unicode, called the Surrogate Area, has been set aside (high surrogate area U+D800 to U+DBFF; low surrogate area U+DC00 to U+DFFF).
In terms of the surrogate pair, two Unicode values as a sequence where the first value is a high surrogate and the second is a low surrogate. The mechanism provides a 1 million-plus character extension. Currently, there are no characters defined therebut not for long.
Advantages of Unicode
One advantage of Unicode is that each character has a unique code pointa very powerful feature. This means there's no confusion about how to interpret a code point. (Compare this situation with multibyte character sets where either a character must be tagged with its code page or some state maintained.) This facilitates data exchange and requires no stateful encoding.
Another advantage is that each character is assigned a set of properties. This allows semantic, direction, case, type, relations to other character(s), etc. It also eliminates ambiguity of certain characters and provides for precise character typing.
Disadvantages of Unicode
One disadvantage of Unicode is that it takes more space to store plain text. Each character is derived from ISO standards and requires an extra 8-bits of space over its single byte representation. However, compression techniques are available if required. UTF-8 has pretty good compression on ASCII, for example. It gives you one-for-two.
Transmission of Unicode data can also use more bandwidth. Despite this drawback, Unicode, in the form of UTF-8, is becoming a popular item on the Web.
It also requires significant extensions to tools and libraries. C-Library requires new functions with new names that work with Unicode data. This can bifurcate the API set of an operating system, i.e., Win32 splits its API subset of functions using strings into two parts.
Some misunderstandings
Now to address some misunderstandings about the system.
Misunderstanding #1: Unicode is a two-byte encoding system
The data structure for Unicode is not an array of characters. Routinely, it is an array of 16-bit elements. So don't think of it as a two-byte system except possibly when you may have to exchange data between little-endian and big-endian systems.
Misunderstanding #2: A single null byte terminates a Unicode string
If Unicode is being stored as a 16-bit array, then the terminator is the NULL Unicode character U+0000.
Misunderstanding #3: One character = one Unicode element
In the Surrogate area, two Unicode elements are required for one character. Also, in terms of composite representation, an unlimited number of elements can be combined to make up a character.
Misunderstanding #4: Unicode pointer arithmetic is the same as ASCII
In short, say goodbye to p++ and p-. A character may require more than one Unicode element to represent it. As a result, techniques taken from multibyte processing must be used. Also, Win32 supports composite representations in CharNext and CharPrev.
Misunderstanding #5: Unicode solves all I18N and L10N problems
We wish. Unicode trades one set of problems for another. However, the new set is much more interesting and offers greater possibilities for worldwide operations.
Unicode realized in Windows NT
Some of the highlights of Unicode in Windows NT include:
- Kernel Mode is mostly Unicode.
- User Mode application can be Unicode or multibyte (which means you can write an application that uses Unicode exclusively).
- GDI text calls assume Unicode.
- NTFS file names are in Unicode.
- True Type fonts are Unicode encoded.
- Resources strings are compiled to Unicode.
There are probably some bugs here and there, but, in general, Unicode works well in the Windows NT 5.0 language kit. I had a strong feeling that I would never live long enough to see Windows perfectedand I probably won't. But it looks like there's a good chance when we see something that works this well.
A partial support in Windows 95/98
If running in Windows 95 or 98, you'll find that resource strings are compiled to Unicode. You'll also see True Type fonts Unicode encoded, and note some Unicode functions such as TextOutW and related calls.
You'll also note Unicode character semantics and sorting.
Character conversions
You may run into a situation where conversions are required between legacy character sets and Unicode. Programs that are Unicode internally, but still get data from outside need to perform conversion, include web browsers and client/servers.
In terms of conversions from one legacy set to another, pivot through Unicode.
|