SimulTrans Localization Blog: SimulTips

Chinese Localization

[fa icon="calendar"] January 24, 2015 / by the SimulTrans Team

Representing over a quarter of the world's population, Chinese speakers represent an exciting market opportunity. Chinese localization requires cultural and political sensitivity, knowledge of the differences and uses of the Simplified Chinese and Traditional Chinese writing forms, as well as a solid understanding of software engineering issues.

Who Speaks Chinese?

Chinese is spoken by over a quarter of the world's population, and therefore offers a huge potential audience for the internet and other products.

Before discussing Chinese localization, you need to have an understanding of who actually speaks the language.

China itself has the largest Chinese speaking population, of 1.3 billion people. Taiwan, Republic of China follows with a population of 22.4 million, and Singapore, which has a population of 3.6 million. All the other Chinese speaking people scattered around the world are estimated to be about 100 million. Indonesia alone has a population of 212 million, and about 20% of those people are ethnic Chinese.

China as a Market

There are a number of factors that make China a promising market.

PC Growth Rate

The growth rate of personal computer use last year in China was 400%. The current PC market in China was about 5% of the U.S. market, however the projected growth rate will be about four times that of the U.S..

Software Acceptance Curve

Jeffrey Moore described the bell curve of software acceptance and usage in his book Crossing the Chasm. There are visionaries who use software first, then early adapters, early majority, late majority, and finally the laggards. Right now in China, as well as in Taiwan, the visionary population that we're talking about is very small, and generally it takes about three to six versions of software before you have a product that can go to the early majority. This means that to date Chinese software is important. As the Chinese software industry develops, that will become less and less the case. But so far approximately 98% of software sold in China is U.S. software, localized into Chinese.

Lack of Legacy Infrastructure

In addition to the software acceptance curve, there is a lack of legacy IT infrastructure which makes selling into China advantageous. Old systems don't have to be worked around as they do in the U.S. and other Western countries. Cellular phones are a good example. In the U.S., most of America still uses analog cell phones. A big reason for this is that we were early telephone adapters. A lot of analog-based stations were built, and even though digital is clearly superior to analog, we're still using mostly the analog cellular phones. People in China were able to adopt digital phones from the very beginning. CDMA base stations are basically the only base stations in China, so everyone uses digital phones.

Another example of how the lack of an outdated infrastructure makes China an attractive market relates to internet connectivity. The Chinese government recently announced that they will link the top 15 cities together with fiber optic cable. In the U.S. we're hoping to get cable modems into our homes so that data will flow through a fatter hose. In China, most people will have data spewing out of a fire hydrant over the next few years.

Other government initiatives are also underway to build up infrastructure based on new technologies.

Flattening Yield Curve

Most of the software being sold in China is from the U.S., and is proven to be solid. If it wasn't a successful product, the developers probably wouldn't have the funds to pay someone like SimulTrans to localize it for China.

Pent Up Demand

As Asia has melted down financially over the past few years, companies have been slowing down or delaying their Chinese IT investments. The need however has continued to climb. This means that there may be more demand than we can handle out there in terms of making that IT investment.

Chinese Localization

There are four ways to classify Chinese characters.

Pictographs

The first classification of characters is pictographs. Latin and Chinese actually started off the same way; from hieroglyphs. People saw something that was tangible and they drew a picture of it. Over time that picture became a representation of that specific, tangible item. For example, the Chinese sun or moon characters look a bit like the sun and a half moon. The character for mountain looks something like a mountain.

Over more time, Latin developed into transliteration, meaning that words are spelled out rather than being pictorially represented. In Chinese it went exactly the opposite. Writing grew more and more pictorial.

Simple Ideographs

Simple ideographs represent ideas or concepts rather than things, and constitute another type of character.

Compound Ideographs

Compound ideographs are a combination of the previous two character types. For instance, take the character for tree. If we add another tree next to it, it becomes woods. Add a tree on top of it, it becomes a forest. Or if you take the sun and the moon and put them together, it means bright or clear.

Phonetic Ideographs

Phonetic ideographs make up about 90% of the characters in Chinese. Phonetic ideographs are similar to compound ideographs, except that phonetic sounds are included to clarify the idea. The word for copper, for example, is made up of the ideograph for metal, along with the phonetic word "tong." This means that it is pronounced "tong" but since the metal ideograph is included, you are referring to copper.

Traditional versus Simplified Chinese

As you can imagine, over time things got a little hairy. More and more characters were developed for more and more ideas.

Eventually the number of characters reached staggering levels, and the government decided to create a subset of commonly used characters to be used as a separate method of writing from the traditional methods. The goal was to allow more people to read and write, since the task of learning all the characters would be less time consuming. This resulted in the development of the two writing systems, Simplified and Traditional Chinese. Simplified Chinese is based on traditional, but has fewer character strokes. Simplified Chinese today is used in China and Singapore, whereas Traditional Chinese is used in Hong Kong and in Taiwan.

There are not accurate automated methods for converting between Simplified and Traditional Chinese. Most of this work must be done manually. There are more characters in traditional, and there are also a lot of new characters that mean different things depending on whether they appear in traditional or simplified. You can re-map or do a code conversion, but you're still going to need a translator to accurately change the syntax and terminology.

Locale Specific Issues

Date and Time Format

The format for representing dates in China is year first, then month, and then day. The year can be represented by the last two characters (you can drop the 19). The calendar is the Gregorian format, and the year, month, and date can be localized with Chinese characters.

AM and PM are represented by Chinese characters, so they need to be translated.

Currency

The currency indicator in China is basically RMB and the yen sign (¥). The thousand separator is a comma. In some locales, such as in some European countries, the separator is a period, but in China it is similar to the U.S.. For example, RMB¥ 1,200.

Name Format

In English the full name format is first name and then last name. In Chinese the full name format is last name then first name. There is a lack of spacing between the names, and to make things even more confusing, the first and last names can be either one or two characters each. You may notice, however, that most Chinese people will have only one character in their last name, and in the U.S., many Chinese people have last names of only one syllable. People with two syllables in their last name usually have a two-character last name.

Address Format

The address format is almost exactly opposite from the U.S. format. Country comes first, then the province and city, then the street address, then last name, first name, and the honorific, Mr. and Mrs., Sir.

Political Issues

The next issues to be discussed relate to the Chinese political situation. This is where Chinese localization gets significantly different from traditional localization.

Regional Issues

As most people are aware, there is a stormy history between China and Taiwan. During World War II, there was a civil war in Taiwan. During the civil war, the Communist Party took over China and sent the existing government to Taiwan. There is sensitivity about the issue in both countries; Taiwan thinks of itself as a separate country, whereas China adamantly denies it.

There are regional issues outside of Taiwan as well. For example, Hong Kong is a special administrative region within China. Macau is a Chinese peninsula that was colonized by Portugal, but which will return to China this year with the understanding that it can maintain it's free market economic system.

Terminology to Avoid

Because of the regional sensitivities, some commonly used phrases should be avoided. For example; Republic of China, Taiwan independence, Communist Bandit, and Regain possession of the lost mainland. Some of these phrases may seem obviously inappropriate, but believe it or not, a certain very large software company that had a very popular word processor went to China a few years ago, and when you highlighted Taiwan and used a Thesaurus, these phrases were listed.

Locale Specific Changes

When localizing software for sale in China, there are a few specific items that should be changed:

  • Country list should be changed to Country/Region
  • Sensitive holidays (like Taiwan Independence) should be removed from calendars
  • The Taiwan date and currency format should be removed.
  • Sensitive clipart should be omitted (for example, the Taiwan flag).
  • Capital city lists should be changed to Capital/Major cities.
  • Taiwan should be included in maps of China.
  • New country names in the evolving Eastern European region should be used.

Internationalization

People frequently get internationalization and localization confused. In layman's terms, internationalization can be defined as the process of preparing software so that it can be used by more than one culture. Usually you can think of internationalization as the initial process, and localization as happening multiple times, depending on the cultures that you want to localize into. Generally, the greater the level of effort spent on internationalization, the lesser the effort spent on localization.

The basic strategy of internationalization is to separate the data segment of code from the program segments. User interface components like dialogue boxes, text boxes, sounds, toolbars, etc. should be separated into files that are independent from the program code. Doing this allows translators to translate or localize the software without having access to the program code. Data should be able to be read from any place in the program code.

When done right, internationalization significantly minimizes the level of effort in localization.

Character Set Standards

What is a character set standard? There are over 92,000 Chinese characters right now. It's hard enough to remember our own phone numbers, so memorizing 92,000 characters isn't something that many people can really focus on. Because of this, the Chinese government defined 7,500 characters that every student graduating from middle school needs to know. This is an example of a non-coded character set standard.

A coded character set standard is actually bigger than a non-coded set. This is because a coded character set standard is all the words or characters that need to be displayed on a computer. You can think of a character set standard as a bucket of words or a bucket of characters.

There is no universally accepted character standard for Chinese, such as ASCII for English, but some standard sets are used more than others. Two prevalent character sets in China are called GB 2312-80 and GBK, which is a subset. The total number of characters defined in GBK is 21,688.

Encoding Methods

Encoding is the process of taking a character and mapping it to a number so that it can be read by the computer and binary language. This is where we get into the world of bits and bytes. Each bit is either 1 or 0, there are eight bits in a byte, and one byte can represent up to 256 characters.

The best way to think about this is in terms of a matrix. For one byte, there are 256 combinations of 1's and 0's that can occur. In ASCII, there are only 128 characters, of which 94 are printable. Examples of printable characters are A, B, C, D, etc. There are 32 non-printing characters, like the space character, delete character, or shift character. ASCII was traditionally mapped from 0 to 128. In this mapping, the letter A is 65. The computer ends up reading that as 0100000001.

For ASCII, the one-byte method of encoding, which allows for 256 characters is quite sufficient.

But as stated earlier, the GBK standard contains 21,688 characters. Obviously 256 isn't going to cover it. The only answer is to encode characters using two bytes rather than one. Using two bytes means that instead of being able to define 256 characters, you can define 65,536. While that is still not enough to map all of the more than 92,000 Chinese characters, it easily covers the GBK standard.

When mapping ASCII characters using two bytes rather than one, the first byte tells you where the character is in the row; and the second byte tells you where it is in the column.

Big Five is the character set for Taiwan, and it contains more characters than GBK.

Deleting and Inserting Double Byte Characters

Although a Chinese character is composed of more than one byte, it still needs to be treated as one character. If your software doesn't handle double byte characters properly, you could delete one byte of the character but leave the other byte so that the character is corrupted. Or if you deleted one byte and then entered a new one, it would actually change the character because it would be a different number. When deleting, both bytes need to be deleted with a single keystroke.

For text insertion, the cursor needs to move two bytes in order to avoid splitting a character.

All of this handling needs to be done at the application level.

Searching with Double Byte Characters

Some search programs don't treat two or more bytes as a single unit. For example, the UNIX GREP command searches byte by byte. When entering a character for a search, the search engine needs to search for both bytes, and must match both bytes before returning a candidate. Also, the search index needs to advance one character, rather than one byte.

Encoding Standards

There are a number of different standards with which you can encode. The major encoding standards in China are ISO-2022-CN, ISO-2022-CN-extension, EUC-CN, (which is used for UNIX) and GBK. Luckily, you don't have to choose the encoding standard, because the method is determined by the platform. Microsoft has a code page 936, which is their standard from which you map, and the encoding is GBK.

Unicode is becoming more and more prevalent. ASCII is treated as one byte under the previous encoding standards. Under Unicode, it maps everything, including ASCII, as multiple bytes. If you're going to be localizing into two more languages, go with Unicode because you can do multiple localization. Current versions of Microsoft Windows are completely Unicode-based in every language, everywhere in the world.

You do need to be concerned with conversions, and you have to know what you'll be doing with the data. Will it be displayed on the screen? Will the data be save d in a file and passed along to someone else? What software will it be used with?

For input and output, you convert the encoding into whatever the encoding is appropriate for the data's eventual use. There are easy to use tools available for every platform to convert into and out of Unicode.

You can use Unicode internally in your applications regardless of what the operating system is. Within your own applications you can be equipped to handle any international character set.

Input Methods

There are many kinds of keyboards in China, but the standard QWERTY that we use here is probably the most prevalent. While the keyboards may be the same, the method for entering text is very different. When you input Chinese text, it takes place in two stages. First, you use enter a few keystrokes, and the computer then displays a list of potentially matching characters. Once a list of potentially matching characters is displayed, you choose the one that you want.

This selection process is performed by software called Input Method Editors (IMEs). IMEs are native on Windows software. There are a number of Input Method Editors on the market in Taiwan and in China. IMEs can either sit in the menu bar at the bottom of the screen, or can be separate programs that run on top or float on your desktop.

Keyboard input can also be done by transliteration, which means spelling out the pronunciation of the Chinese character in Latin, or through native-script input called Zhuyin. Zhuyin is sort of like the alphabet. All you do is you spell the words out phonetically using Zhuyin symbols instead of Latin characters.

Sorting

The default sorting technique for Chinese is phonetic. Alternative techniques include sorting by the number of strokes that are required to draw the character. Another sort method is by radical, which is a central semantic component that each character contains.

Other Special Locale Considerations

There are special locale considerations that you need to make when going into various Chinese speaking markets. For instance, every bill in Taiwan needs to have a government unified invoice stamped on it. Most U.S. software modules don't include this function, so if you're going to create a billing system for sale in Taiwan, a separate module needs to be developed.

Implementation Issues

You also need to consider the hardware and software environments, and the run time environment. If your product runs on a Mac, is the latest version of the Mac OS localized in China? Could your software conflict with local software? You need to confirm that your software doesn't conflict in some way with software that is used locally.

In China there are a lot of independent peripheral manufacturers. Can your software print through their printers? Who will run your help desk? Who will provide bug patches or fixes for localized version? The best answer to these questions is to have a strong partner in Asia and a detail-oriented globalization firm.

the SimulTrans Team

Written by the SimulTrans Team

SimulTrans provides software, document, and website localization services, translating text into over 100 languages. Established in 1984, SimulTrans has enabled thousands of businesses to provide high-quality content to their international customers. Management ownership allows an exclusive focus on customers and quality, as exemplified by ISO 9001 and ISO 17100 certifications. In addition to its headquarters in Mountain View, California, SimulTrans has offices in Boston, Dublin, London, Paris, Bonn, and Tokyo.