Home
 Solutions
 News
 Education
 Books
 Careers
 Contact

Printing problems?
If you are having trouble printing information from SimulTrans´ website, please visit our
printing advice page.

 Software Localization for Windows

Software Localization in the Windows Environment

In March 1999, the SimulTrans Localization Seminar Series featured Atsushi Kaneko, Manager of Software Localization Engineering at Autodesk.  With a focus on the Asia Pacific region, Atsushi is responsible for the internationalization of source code involving the implementation of MBCS (multi-byte character set) code, and specific platform differences among languages.

Summary

Atsushi's presentation focused on two areas of Windows software localization: the internationalization of source code, and the actual localization of the user interface.

Internationalization of Source Code

Atsushi described how the following four issues must be handled in order to properly internationalize software before localization:

  • All strings in the UI must be extracted into RC files
  • Country-specific conventions or formats (such as time, date, and currency) must be supported
  • String buffers must be long enough to allow for text expansion.
  • Support for multi-byte character sets must be provided.

Atsushi also discussed the advantages and disadvantages of using Unicode versus MBCS (Multi Byte Character Set).

Localization of User Interface

Once the software has been properly internationalized, the localization process can begin.  Atsushi's presentation described three methods for bringing localized text back into software.  In addition, he described how localization tools such as Corel Catalyst work, and why they are beneficial.  The discussion concluded with a description of the features that you should look for when selecting a localization tool.

Introduction

Atsushi's presentation gave an overview of software engineering requirements for localizing Windows software.  His goal was to help ease the transition for those new to localization.

The discussion is based on Microsoft Windows 95/98 and Windows NT with Visual C/C++ development environments.

Software Localization is More than Translation

Many people think of software localization as merely translating the strings.  In actuality, you need to prepare the software before starting translation in order to minimize the tasks and resources required for localization.

The goal is to simplify the core development environment to isolate translation and to make the localization work easier and faster.

Two major items were discussed.

1. Internationalization of software source code (especially for Asia Pacific languages) to make localization easier.

This section outlines important programming tips for source code internationalization, and also focuses on Asian Character handling using Visual C/C++.

2. Localization of the Software User Interface.

This section introduces User Interface localization, describes a few basic procedures, and briefly discusses localization tools.

Internationalization of Software Source Code

The goal of internationalization is to NOT have to touch the core component during actual localization work, and to keep localizable components separated.  This reduces the localization effort.  This section describes the areas that you need to concentrate on at the beginning of a project.

Extract Strings into Windows Resource files (RC)

By extracting all strings that need to be translated from the core components, and placing them in resource files, you don't have to touch the core functionality. Translation efforts can therefore focus on translation of messages.

Ideally, bitmaps or icons shouldn't contain any strings.  Editing of bitmaps or icons requires a great deal of effort.  To address this challenge, text should be displayed on top of a bitmap at runtime, rather than being included as part of the actual graphic or icon.

Plan for Country-Specific Conventions and Formats

This is very important but easily forgotten during the programming stage. Although we usually write the date by order of Month, Day and Year in the United States, the year comes first followed by Month and Day in Japan.  The Windows system provides defaults for these conventions (which are user customizable).  Your application should get the Windows default settings rather than hard coding these conventions in the source code.

The fonts (FontFaceName) that appear in the user interface also dependent on the country and character set as established by the operating system.  Font selections should therefore also be pulled from the operating systems settings, or should be defined in RC files rather than hardcoded so that they can be changed during localization.

Make String Buffer Long Enough for Expansion

The length of strings after translation might be longer than original English strings. If a translated message can't fit in the allocated buffer, it can cause message truncation or memory leaks.

The length of string buffers should therefore be variable or long enough for the expected translation expansion.

Asian Character (Multi Byte Character Set) Handling

Asian character sets (such as Japanese, Korean, Traditional Chinese, or Simplified Chinese) are referred to as "Multi Byte Character Sets" (MBCS)or "Double Byte Character Sets" (DBCS). Because Kanji character shapes are so complex and there are so many of them, (more than 10,000 as compared to 26 in the Latin alphabet), the single byte (8-bit) code range isn't big enough to define them all. The Asian character sets therefore need a code range which is 2 bytes long to express all of them.

Single Byte Character Representation

A (0x41)        
This ASCII character (the letter ‘A') needs only 1 byte.

4

1

  4 bits    4 bits  8 bits total (1 byte)

Double Byte Character Representation

 (0x93FA)
This Japanese character needs 2 bytes.

9

3

F

A

  4 bits    4 bits    4 bits    4 bits  16 bits total (2 bytes)

Unicode vs. MBCS (Multi Byte Character Set)

There are two methods for enabling proper handling of double byte character sets. The two methods are referred to as MBCS and Unicode.

Unicode

The design of Unicode is based on the simplicity and consistency of ASCII, but goes far beyond ASCII's limited ability to encode only the Latin alphabet.  The Unicode Standard provides the capability to encode all of the characters used for the written languages of the world.  It uses 16-bit encoding that provides code points for more than 65,000 characters.  To keep character coding simple and efficient, the Unicode Standard assigns each character a unique 16-bit value, and does not use complex modes or escape codes.

Unicode Character Representations

A (0x41)               
The letter ‘A' as expressed in Unicode.

0

0

4

1

  4 bits    4 bits   4 bits    4 bits  16 bits total (2 bytes)

 (0x93FA)
This Japanese character needs 2 bytes.

9

3

F

A

  4 bits    4 bits    4 bits    4 bits  16 bits total (2 bytes)

Basically, Unicode doesn't distinguish between single byte characters and double byte characters.  Instead, it expresses even single byte characters as 16-bit (2 byte) values.

Advantage: Handles all characters the same way.

Disadvantage: Not supported by Windows 95/98 API.

MBCS

Methods for dealing with code using MBCS will be discussed extensively in this section.

 Advantage: The same code can be used on Windows NT and 95/98.

Disadvantage: Multi byte characters must be handled differently than single byte characters.

The section below discusses basic MBCS handling of Asian characters.  This does not apply to Unicode based applications, because Unicode handles the characters automatically.

Leading Byte and Trailing Byte

"Leading byte" refers to the first half of a double byte character, and "trailing byte" refers to the second half.

(0x93FA)

Leading Byte

 

Trailing Byte

 

9

3

F

A

From a programming point of view, the leading byte is the important byte for recognizing multi byte characters in a string.  You can find multi byte characters by looking for the leading byte from the beginning of a string since only double byte characters use the leading byte.

The trailing byte is used for both double and single byte characters, so you can't look at only the trailing byte to determine what type of character it is.

Enabling the MBCS Function (for Visual C/C++)

There are several run-time library functions which can handle MBCS in Visual C++ such as "isleadbyte()", "_ismbblead()", "_ismbbtrail()", "_mbschr()" or "_mbsrchr". But in order to use these routines, you must understand the concept of a Code Page.

Code Pages

A code page is a character set, which includes numbers, punctuation marks, and other glyphs.  Different languages and locales often use different code pages.  For example, ANSI code page 1252 is used for American English, and OEM code page 932 is used for Japanese Kanji and Hiragana characters.

A code page can be represented in a table as a mapping of characters to single byte or double byte values.  Many code pages use the same ASCII characters for the range 0x00-0x7F.

The Microsoft run-time library uses several types of code pages, as described below:

System-default ANSI code page:  At startup, the run-time system automatically sets the multi byte code page to the system-default ANSI code page.  This code page is obtained from the operating system.

Locale code page: The behavior of some run-time routines is dependent on the operating system's locale setting, which includes the locale code page.  By default, all locale-dependent routines in the Microsoft run-time library use the code page that corresponds to the "C" locale. At run-time you can change or query the locale code page in use with a call to setlocale.

In order to enable the code page for each language's locale code page, the setlocale() function should be called.  This function enables some Windows API functions to support differences among languages/countries by the locale-dependent categories, as shown in the examples below.

// English code page for all categories (LC_ALL).
setlocale(LC_ALL, "English");

// System default code page for only character-handling functions.
Setlocale(LC_CTYPE, "");

Multi byte code page: The behavior of most multi byte character routines in the run-time library depends on the current multi byte code page setting.  By default, these routines use the system-default code page.  At run-time, you can query the multi byte code page using _getmbcp and change the multi byte code page using _setmbcp.

Most multi byte-character routines (_ismb, _mbs and _mbc routines) in the Microsoft run-time library recognize double byte character sequences according to the current multi byte code page setting, although some of multi byte-character routines depend on locale code page described above.

Because the run-time libraries obtain and use the operating system code page (system default code page), you don't have to set the multi byte code page unless you need to change it.

For Example:

// Instruct _setmbcp to use system default ANSI code page.

 if (_setmbcp(_MB_CP_ANSI) == -1)

      SetLastError(ERROR_IN_SET_MULTI BYTE_CODEPAGE);

_MB_CP_ANSI is the instruction to use the ANSI system default code page obtained from operating system at program startup.

IsDBCSLeadByte(), CharNext() and CharPrev()Functions

These three functions are very important for handling multi byte characters.  Most multi byte character handling errors are caused by pointing to the trailing byte and mistaking it for one of the ASCII characters.  Because part of the trailing byte range is also used by the ASCII code range, it is impossible to identify a trailing byte in a string without knowing the leading byte.  You should always find the leading byte first.  You can then assume that the next character is a trailing byte.

To find a leading byte, use the IsDBCSLeadByte() function, as described below. You can either skip the trailing byte or increment the pointer by using  "CharNext()".

IsDBCSLeadByte()—This function tests whether the specified byte is in the leading byte range of the default code page.

If a string might include multi byte characters, this API is ideally used to look for the first appearance of a leading byte.  The testing should always start with the beginning of the string.  You should look for the first appearance of a leading byte range character to find a multi byte character.  If you start testing in the middle of a string, or if you test backward, you can't tell whether a specified byte is really a leading byte even when this function returns "True".

CharNext()—This function increments the pointer by one character.

If the pointer is pointing to the leading byte of a multi byte character, it increments by one multi byte character (16 bits). If the pointer is pointing to a single byte character, it increments by one single byte character (8 bits).

CharPrev()—This function decrements the pointer by one character.

If the pointer is pointing to the leading byte of multi byte character it decrements by one multi byte character (16 bits). If the pointer is pointing to a single byte character, it decrements by one single byte character (8 bits).

Generic-Text Mappings

Microsoft Extensions offer Generic-Text Mapping (not ANSI compatible) which allows programmers to simplify internationalization by allowing the program to switch between the three character sets: SBCS, MBCS, and Unicode.

Generic Text Mapping switches the data type, routine name, or object name among the three character sets at compile time.  It does this according to a manifest constant using the #define statement.

This is very useful if you need to switch your program from SBCS/MBCS to Unicode.  If you have no need to switch between character set types however, it might unnecessarily complicate your source code to include it.

Examples

The generic-text function _tcschr() maps to _mbschr() if _MBCS is defined in your program, or to wcschr() if _Unicode is defined.  Otherwise it maps to strchr().

The generic-text data type _TCHAR maps to:

  • Type char if _MBCS is defined
  • Type wchar_t if _Unicode is defined
  • Type char if neither constant is defined

These generic-text mappings are defined in TCHAR.H, so you need to include it if you want to use them.

Localization of the Software User Interface

This section discusses the basic localization process and using localization tools for Windows resource file translation.  The discussion assumes that your application is well-internationalized  as described in the previous section.  This means that:

  • All strings (UI) are extracted into RC files.
  • Support is provided for country specific conventions/formats such as Date, Time and Currency.
  • String buffers are long enough to accommodate text expansion during translation.
  • Double byte character sets are supported.

Given this structure, you don't have to think about core code localization anymore, and you can now focus on translation of the RC files.

Localization Build Procedures

Three build scenarios for localized software are described below.

Scenario 1

1. Obtain all resource files.
2. Translate the resource files and put them back into a copy of the English environment.
3. Build the software from scratch, so that the translated resource files are compiled and linked with the object files (OBJ) to create localized executable files (EXE).

Advantage: You can use the original development environment, and you don't have to change the structure. You can just compile with a copy of the original build environment.

Disadvantage: You have to maintain a duplicate development environment and synchronize it with the original environment, which may change throughout the translation process.

Scenario 2

1. Obtain all resource files.
2. Translate the resource files.
3. Compile the translated resource files into binary files (RES).
4. Rebind the binary RES files to executable files (EXE).

Advantage: You don't have to maintain the whole original development environment. You can remove all .CPP or .OBJ files. All you need are the resource files and their associated makefiles.

Disadvantage: You have to modify the makefiles to process with only the resource files. You also need to implement the "Rebind" procedure into the .EXE.

Scenario 3

1. When developing the software, create "Resource Only RCs", to physically separate the UI and other strings from the core component.
2. Translate the resource files and compile them into binary files (RES).
3. Create "Resource Only DLLs" from the translated resource binary files (RES).
4. Replace the English Resource Only DLL with translated version in a copy of the English product.

This might be the simplest method, since the localization process doesn't require the development environment.

Advantage: You only need the English installable image (product CD image), resource files, and "makefiles" to create .DLLs. Because the core components and the software User Interface are physically separated, functionality can't be accidentally altered.

Disadvantage: Software may need to be reengineered to structure it properly before localization begins.

Localization Tools

You should now understand the various structure options, so we can focus on editing the resource files.  Strings in resource files can be edited directly and dialogs can be resized in Microsoft Developer Studio. But using localization tools can make this process more efficient.

Localization tools extract strings from resource files and make a simple database of those strings.  The database contains a Translation String Column and an English String Column, so the translator has only to deal with translation.  The tool automatically puts the translated strings back in the RC, replacing the English text. It also provides visual features for resizing buttons or making other adjustments to dialog boxes.

Reasons for Using Localization Tools

There are three main reasons for using localization tools.

1. Separation of Responsibilities

Resource files include not only translatable strings, but also technical information. For example, there might be code for the coordinates and sizing of a dialog button. If you translate strings in the resource files directly, the translator must know about resource coding.  Otherwise, some function codes or keywords might be translated, which could cause functional problems in the software.  It's not easy to find engineers who are also translators, or translators who are also software engineers.  You shouldn't expect both skill sets in one person.

Using localization tools prevents this problem.

2. Glossary Sharing

Glossaries are obviously very important for translation.  Terminology can change throughout the course of a project.  When using a localization tool, updates to glossaries can be automatically reflected throughout the product.  This significantly reduces the translation workload.

3. Simultaneous Development

Companies today want to release localized software simultaneously or very close to English.  To do this, you have to start localization work in the early stages of the English software development process.  The localization work has to run concurrently with core software development.  You can't wait to start translation until the core software development is finished.  There could be many updates of the resource files during the localization process, and you need to keep the localized versions up to date with these changes.  If translated file updates can be automated using tools, the localization workload can be significantly reduced.

Selection Criteria/Feature List

The following items are important localization tool functions.  You should evaluate these features when looking for tools.  Comparing the features will help give you a clearer idea about what you can expect from the tool.

  • Multiple resource file management
  • Ability to edit .EXE or .DLL files directlyAbility to import from translated resource files
  • Leveraging from previous project resource files
  • Ability to export and import glossaries
  • Visual dialog editing (WYSIWYG)
  • Ability to change "FontFaceName" for dialogs
  • Pseudo Translation
  • Translation validation
  • String tagging (NO Translation, Frozen, Sign Off or Updated)
  • Batch script function

This presentation has hopefully given you a better understanding of internationalization requirements, and localization processes.

STLine8

SimulTrans, L.L.C.
1370 Willow Road; Menlo Park, California 94025  USA; +1-650-605-1300;
info@simultrans.com

SimulTrans is proud to use the products of its clients, NetObjects Fusion and Netscape Communicator, for the development and viewing of this site.
© Copyright 1996-2001, SimulTrans, L.L.C.  All rights reserved.