|
Summary
In September, the SimulTrans Localization Seminar Series featured Ken Lunde, Manager of CJKV Type Development at Adobe Systems. Lunde discussed cross-locale code conversion, specifically those dealing with Chinese characters and the Chinese, Japanese, Korean, and Vietnamese locales.
Lunde's topics included:
- Code conversion basics
- Character sets versus encodings
- Chinese character relationships
- Advantages of Unicode
Lunde demonstrated how to best handle common characters, simplified and traditional characters, and variant forms. He also addressed code conversion pitfalls.
Lunde shared two code conversion tools: CJKVConv.pl, a tool he developed in Perl which serves as a test vehicle to illustrate cross-locale code conversion issues; and Uniconv, developed by Basis Technology and designed to support a wide variety of character sets and encodings.
Lunde is the author of the forthcoming O'Reilly book, CJKV Information Processing, (which contains code tables and is ideal for looking up a specific character). He has worked at Adobe in San Jose for seven years and holds a Ph.D. in linguistics from the University of Wisconsin-Madison.
For more information on Lunde's book, visit the publisher's website at http://www.oreilly.com/catalog/cjkvinfo.
Algorithmic code conversion
First, let's examine code conversion basics, what type of code conversions are out there, how they're applied, etc. The first method of code conversion is algorithmic code conversion. People generally use this method because it's much easier. If you're dealing with a single locale, such as only Japanese or Chinese or Korean, you can usually get away with using mathematical processes to convert between encodings. For example, in the case of Japanese, Shift-JIS, EUC-JP, or ISO-2022-JP, all are compatible in the sense that you can use mathematical processes to convert from one to the other.
Table-driven code conversion
The second type of code conversion is table-driven. The ordering of the characters are significantly different and, as a result, you cannot use a mathematical process to do a conversion; you must treat every character on a per-case basis to convert from one encoding to another. (This process is also required when dealing with Unicode.)
Surprisingly, table-driven conversion can sometimes be faster than algorithmic because you don't have to apply a mathematical process. For example, if you want to deal with Shift-JIS conversion, the algorithm is far from simple.
Reasons to choose table-driven conversion include:
- Different number of characters
- Different ordering of characters
- Different characters
Chinese character sets
If you're dealing with Chinese, there are two basic Chinese character sets implemented on computers today, GB 2312-80 and Big Five. The latter has roughly twice as many characters, so if you're going to be converting between these two, something will be lost during the conversion process (simply by sheer number of characters). In Big Five, characters in both levels are ordered according to number of strokes; in GB 2312-80, characters are ordered according to Pinyin reading, and in Level 2, by radical. These differences can force you to use table-driven conversion.
These character sets also use different types of characters, with GB 2312-80 generally featuring simplified forms and Big Five, traditional forms. Again, you must deal with the problem of how to convert a simplified character into a traditional one, and vice versa. It's not always a one-to-one mapping.
GB 2312-80 is the most common Chinese character set found on today's machines. However, more contemporary machines, such as later versions of Windows, can process a new character set, GBK. This set is basically GB 2312-80 with the rest of the Chinese characters found in Unicode thrown in. However, it's implemented in a non-Unicode fashion.
In Taiwan the most common character set is Big Five. It's used on PCs and Macs. Most UNIX environments process another standard, CNS 11643-1992. While the CNS standard, by definition, has seven planes of characters for a total of some 50,000, not all have been implemented. As an aside, all seven planes are included in CJKV Information Processing so if you ever need to find a character, it should be in there.
Hong Kong is an unusual case. Even though it is now officially a part of China, it still uses Big Five and, in fact, the Hong Kong government developed an extension to Big Five, called GCCS, in the early '90s. This extension contains more than 3,000 additional Chinese characters.
Japanese character sets
Japan has its JIS X 0208:1997 standard (which has been around for years). They also have an extended character set, JIS X 0212-1990, which features nearly 6,000 additional characters. The country is working on JIS levels 3 and 4 that will add significantly more characters to the Japanese locale.
Korean character sets
There are now character sets for both North and South Korea. About a year ago, the Korean standards organization changed the designation of all KS standards that deal with character sets from a C to an X. What was once KS C 5601 is now KS X 1001. Likewise, KS C 5657 is now KS X 1002. A similar change took place in Japan in the late '80s when all standards that dealt with character sets featured a "C" designation (for example JIS C 6226). In the case of Japan, time has allowed people to grow accustom to the change while the Korean designation change still takes some getting used to.
North Korea came out it with its own character set standard last year, called KPS 9566-97, very similar in structure to South Korea's.
Vietnamese character sets
And lastly, Vietnam established character set standards in '93 and '95. The first, TCVN 5773:1993, contains Chinese characters created in Vietnam called chu Nom. The second set, TCVN 6056:1995, consists of genuine Chinese characters called chu Han.
An important distinction
The distinction in CJKV locales between character set and encoding is very important. A good example can be found in the context of the Web and the Internet. When you create a document, HTML or XML or whatever, you want to inform the browser as to what character set and encoding the document uses. This allows the browser to choose the proper font for display. The best charset designations routinely are encodings because typical encoding names, if chosen correctly, can tell you the locale. Some companies have apparently chosen to use character set designations or character set names as charset designators.
Character set designations are a poor choice for charset names for another reason: You don't have just a single character set being implemented in an encoding, rather, you have multiple character sets.
Chinese character relationships
Traditional forms are still used in Taiwan and Korea. Japan uses simplified forms, but not as simplified as those used in Mainland China. The relationship between a simplified and a traditional character is not always one-to-one. In many cases, you may have two or more traditional forms that have collapsed into a single simplified form. As a result, when you do this conversion, you must be aware that there may be multiple choices when going from simplified to traditional.
Often you can't make this decision on a per-code point basis. You must look at a higher level to determine what context the character is in because particular words will use a different traditional form. Chinese dictionaries often list multiple traditional forms for a single simplified form and will give examples of words that each of the traditional forms appear in. As you can tell, there is some high-level work to be done.
The relationship between simplified and traditional characters is often locale specific. In some locales a particular character will be considered a simplified form, while other locales say this is the traditional form. Rather than be universal, the concept is driven by locales (and, in reality, the governments and organizations that define these relationships).
Some characters have multiple forms and, while written somewhat differently, are still considered the same character. These alternate character, or variant, forms feature the same semantics. In some cases a variant character relationship can be used to coerce or force a conversion when the character itself would not normally map. Fortunately a number of characters are common across all locales. When you want to convert these characters, you won't encounter any problems.
Advantages of Unicode
Unicode provides a common representation for all characters or, in other words, serves as an "information interchange" code. If you go from Chinese to Japanese, the Chinese gets converted into Unicode internally, then gets converted into Japanese. When this occurs, the number of mapping tables required is 2n, or 2n. No matter how many character set encoding combinations you want to support, simply multiply that number by 2 and that will give you the number of mapping tables required.
Mapping tables are available on the Web at ftp://ftp.unicode.org/.
Handling common characters
A number of the characters that are common across character sets are easily converted. For example, if you want to convert Korean into Japanese, as in the following four characters, these are the code points in KS X 1001:1992.
In the process of running the conversion, this information first gets converted into Unicode, then converted into Japanese. It's a relatively simple process.
Handling Simplified/Traditional
To achieve this type of code conversion, you first need a database containing correspondences between simplified and traditional forms. Because the relationship between simplified and traditional forms is locale specific, you often have more than one table as a "look-up." Depending upon which language you're dealing with, you may need to invoke a different look-up table to handle those characters.
In the following example, you have the character than means "black." This character gets converted into Unicode, but when it tries to jump to Japanese, the converter recognizes that the mapping table from Unicode into Japanese has no mapping for this particular character. It would map to nothing. Rather than map to nothing, it uses a simplified/traditional database to make a correspondence between this Unicode code point and another one (which can then successfully convert into Japanese).
Although GB 2312-80 contains the simplified characters, this is one of the exceptions where even the simplified Chinese is not simplified to the extent that it was simplified in Japan. This serves as an example of a character whose definition of traditional form is different depending upon the locale.
Handling variant forms
In Japanese you have what is called "standard forms." This refers to what most recognize as being the form that people use everyday. Then there are "traditional forms" which are traditional in the sense that the government has identified these characters as the official traditional form. If you want to render a character in the traditional way, this is how you write it.
In addition, there are a number of characters that have additional variant forms, or alternate ways of writing the character. While they still represent the same underlying character, they simply provide a different way to express it.
The following example features characters that are in the JIS X 0208:1997 standard, including general variants. The second character in this example is written in the traditional form, however, this character set also features four additional variant forms. If you enter the realm of professional publishing, i.e. dictionaries, you will encounter some variation of this and some variation of that. In many cases you may have up to two dozen ways of writing this one character. You encounter the same issues when dealing with Chinese.
Handling the compatibility zone
The CJK compatibility zone is specific to Unicode. This zone includes characters simply because the source character sets had them encoded in duplicate. For example, the standard Korean character set, KS X 1001:1992, includes 268 duplicate Chinese characters. If you look at the corresponding code points, you find the same glyph, which makes you wonder, why did they duplicate those? The reason is that in Korean, like other languages, many Chinese characters have multiple readings or pronunciations. When this character set was designed, they decided to encode all of these cases in duplicate.
As a result, we need to apply two different transformations in the internal Unicode to convert it from Korean to Chinese. In the following example, the original KS standard has four distinct code points. When we convert those into Unicode, we again have four distinct code points. However, in the case of the first, second, and fourth instance, they fall into the 0xF9 region, which is in the CJK compatibility zone. As a result, we must first "normalize" to the same code point so that all of these things get normalized to a standard or non-compatibility code point (which is 0x6A02).
As for why some characters are put in the compatibility zone as opposed to the standard area, I'm not sure. It may have been the frequency of the reading or pronunciation that determined which one was more common.
The next step is to convert the character into a simplified form. Because the character doesn't convert into GB 2312-80, we need to determine if it has a simplified form. If so, we convert the character into 0x4E50, which in turn successfully converts into GB 2312-80. You will notice that it does produce a somewhat different looking glyph.
Handling unmappable characters
Some characters are simply not available in the target character set plus encoding combination, even as a variant form. The most common occurrence is Korean hangul. You won't find these characters in either Chinese or Japanese character sets; they are unique to Korean character sets and to Unicode.
There are two ways to approach an unmappable character: 1) the graceful way, and 2) the ungraceful way. The latter is to simply remove such characters from the output altogether. However, if you do so, you won't have an indication that a character has been dropped, that it didn't map.
A much better, or graceful, way is to output the characters as some sort of tag. For example, output them as XML character references &#x following by four uppercase hex digits which represent the Unicode value plus a trailing semicolon. If you output these into a file and try to reverse the conversion, the converter can then recognize these character references and convert them back into the original character set and encoding.
Although you can choose any sort of tag you wish, there are advantages to using some sort of standard tag such as this XML character reference since it can be recognized by other software.
Avoiding code conversion pitfalls
There are numerous cases where you have the same glyph but different (or multiple) semantics. In the following example, you want to convert this particular Japanese character from GB 2312-80 to the JIS X 0208:1997 standard. There are three possibilities, two of which are realistic.
First, you can convert directly into the JIS standard because this Unicode code point has a mapping into Japanese to give you the same glyph. This leads to the second point. The example illustrates three possible scenarios and how the semantics of the character are interpreted. If this character were interpreted as a radical, you'd want to convert it in this. But in Chinese this is also its own character. You first convert the character from simplified to traditional Chinese. You then convert it from traditional to simplified Japanese (which is more commonly used).
Lastly, the character gets converted from the simplified to traditional form, which again is the Japanese standard. There are three different ways to convert this character into Japanese and it may depend on what sort of context you're dealing with. In most cases it's probably going to be the second character depicted in the example which, as you can see, requires the most work.
Cross-locale code conversion tools
Tools that do this sort of conversion are based on Unicode. One is CJKVConv.pl, which I wrote in Perl. I consider it a test vehicle to illustrate these various code conversion issues. I don't consider it a perfect tool, but it gets you there. It can be found at: ftp://ftp.oreilly.com/pub/examples/nutshell/ujip/perl/cjkvconv.pl.
|