|
Effective internationalisation plans, from the drawing board stage onwards, are inarguably what makes a piece of software compatible for worldwide markets.
In February 2000, the SimulTrans Localization Seminar Series in Dublin featured Bill Hall, SimulTrans and Tom Garland, Sun Microsystems.
Introduction: Seminar overview
This seminar deals with two areas of internationalisation:
The first part of the seminar will focus on internationalisation aspects of java programming presented by Bill Hall. Using an application of the Date Format class to write a Java application, it will be able to read the WWV time signal, display the result in any locale you want and update your computer clock. This is an interesting way to teach developers about time and date formatting, native methods, accessing the Internet, and some really arcane facts about time measurement. This presentation is not overly technical, reasonably short, has some entertaining aspects, and most seem to enjoy the talk.
Part two of the seminar deals with Solaris. Tom Garland's presentation outlines key internationalisation features of Solaris. It focus' primarily on the extensibility and customisability of the Solaris i18n Architecture and provides an overview of the Solaris single binary/single source model. Areas discussed include:
- Unicode Support
- Complex Text Layout
- Multiscript/Multilingual support
- Code conversion
GLOBALISATION SEMINAR SPEAKERS
Bill Hall, Director Internationalisation, SimulTrans, L.L.C.
Bill has been involved in the globalisation industry for 15 years. He has spoken and written widely on the topic of internationalisation for the Unicode Consortium, Boston University's Win-Dev Programs, Microsoft Systems Journal, MultiLingual Communication and Technology, and other forums. Bill has worked previously in Software Internationalisation at companies including Novell and Netcom, and now currently directs SimulTrans' internationalisation program, which has assisted Adaptec, Adobe, Hewlett-Packard, SGI, and SPSS. He also teaches an Internationalisation course at the University of California's Santa Cruz Extension.
Tom Garland, Staff Engineer, Sun Microsystems, Ireland
Tom Garland is a Staff Engineer in the Internationalisation Engineering group at Sun Microsystems. Based in Ireland he is responsible for internationalisation for Europe and the Middle East. Tom has 15 years experience in the computing industry. He started his career as a programmer at the University of Ulster and subsequently at Nixdorf Computer in Dublin. He then joined Lotus Development Ireland as the lead engineer for Unix products. In 1993 Tom joined Sun as Technical Lead in the European Localisation Centre. He then joined a newly formed Internationalisation Engineering group in 1998 as Staff Engineer and currently manages European and Middle East internationalisation projects.
WWV: A Java Demo Program for Reading and Displaying Accurate Current Time while using International Formats
The radio station WWV in Fort Collins, Colorado has been broadcasting extremely accurate time on both long and short wave frequencies for many years. Now these broadcasts are available on the Net. With a simple Java program you can easily access the data. The question is, can you then display the result in international formats?
The March seminar described how the Java date and time classes can help you render the time in any supported locale.
NIST Services
Bill Hall began the presentation with a discussion on how to use Java and NIST time services to present time information to users.
WWV
WWV is a radio station that broadcasts time signals, and has been operating in Ft. Collins, Colorado since 1923. It comes from the National Institute of Standards and Technology, (NIST), which is part of the National Bureau of Standards. Many countries have their own time signals, but WWV is one of the best known. Broadcasts of WWV are at 2.5, 5.0, 10.0, 15.0, and 20 mHz.
WWVH
WWVH in Kauai, Hawaii has been in operation since 1948. It does the same job as WWV, and runs at the same frequencies, excepting 20 mHz.
WWVB
WWVB broadcasts at 60 kHz VLF (Very Low Frequency), which has the advantage of traveling down deep into water. The lower the frequency, the deeper into the water that radio waves can go. This is useful because there are a number of vehicles which travel under water and which need accurate time information.
Another advantage of these long waves is that since the attenuation dies off with wavelengths, it also travels a long way on the ground.
Some new clocks and watches synchronize themselves automatically to the time signal, so that you never have to adjust for daylight savings time etc. Synchronization usually takes place around 1:00 a.m. when the reception tends to be good.
Non-Radio Services
Radio broadcasting is only one aspect of the time services that NIST provides. The radio broadcasts came first, followed by a telephone service, which was implemented in 1988. More recently, Internet access was offered. And finally there is a satellite broadcast. Some of these services are described in greater detail below.
Automated Computer Time Service (ACTS)
The telephone service that NIST offers is called Automated Computer Time Service (ACTS). ACTS was the first digital technology that NIST employed. The system handles about 12,000 calls per day.
Here are the two phone numbers for ACTS.
Ft. Collins Server, 303-494-4774
Hawaii Server, 808-335-4271
To use the service, you need a modem, computer, and some downloadable PC-based software. There are versions for Windows and DOS, but Bill was not sure if there is anything for old versions of Windows or for Win32 systems.
ACTS data is sent in a string in the following form: JJJJJ YRMODA HH:MM:SS TT L DUT1 msADV UTC(NIST) OTM
Satellite
The GOES weather satellite operated by NOAA broadcasts a time code at a frequency of about 468 mHz. It is accurate to 100 microseconds. The satellite method however requires some kind of digital decoding.
Network Time Service (NTS)
Network Time Service (NTS) is the Internet method. There are servers located in Boulder, CO, Gaithersburg, MD, and Redmond, WA. You can find a list of the IP addresses at www.bldrdoc.gov/timefreq/service/nts.html. You can also find the proper DNS name on the site, and use it directly for the code that you write.
NTS has three protocols, as described below.
Daytime Protocol (RFC-867)?This protocol sends an ASCII string similar to the ACTS. The daytime protocol is useful for small computers, and you can query it on UDP port 13.
Daytime protocol is discussed in greater detail later in this summary.
Time Protocol (RFC-868)?This protocol is a sort of dumb protocol. It sends out 32 bits of information, which gives you the UTC time in seconds since January 1, 1900. Because it is 32-bit data, it will have year 2037 problems. You can query it on UDP port 37.
Network Time Protocol (RFC-1305)?This protocol is slightly later RFC, and is typically used by large computers since the software is often part of the operating system. The client software runs continuously as a background task. It provides a 64-bit time stamp, and it contains seconds since 1/1/1900. It has a resolution of 200 Pico seconds. You can query it on UDP port 123.
Julian Date
The first few characters of the Daytime protocol time string are the last five digits of the Julian date. The concept of Julian date can be rather confusing, since it has nothing to do with the Julian calendar. Rather it's the number of days since January 1, 4713 BC.
Joseph Justus Scaliger in 1583 published De emendatione temporum, which attempted to resolve various historic eras by relating them all to a single system. Rather than use negative numbers, he chose an initial epoch predating any known historical record.
He used three different cycles, which were well known at the time:
the 28 year solar cycle through which weekdays and calendar dates move in the Julian calendar,
the 19-year cycle of Golden Numbers (also called the Metonic cycle, discovered by the Greek Meton but also known to the Chinese) over which the moon (almost) returns to the same calendar dates, the 15-year indiction cycle used in Roman taxation.
A given combination of the three cycles occurs every 28*19*15 = 7980 years. Scaliger determined that the most recent year where all three cycles had year 1 was 4713 B.C., so he designated January 1, 4713 BC as the first year of the Julian period.
In 1849, the astronomer John F. Herschel published his Outlines of Astronomy and explained his ideas of extending Scaliger's concept to days and fractional days. In this system, Julian Day 0 began at noon of January 1, 4713 B.C., and the Julian Day number represents the number of days since then. Fractional days can be converted to hours, minutes, and seconds thus linking clock time and calendar time into one system. The Julian date thus becomes a convenient reference for computing and converting dates in other calendars, among other uses.
Herschel did himself and his fellow astronomers a favor by making the day begin at noon. Since astronomers do most of their work at night, Herschel's work ensured that they didn't have to deal with date changes while conducting their work.
Daytime Protocol Strings
The WWV program which goes out and queries the daytime protocol receives the strings back in the following format:
JJJJJ YRMODA HH:MM:SS TT L DUT1 msADV UTC(NIST) OTM
The fields of the string are described below.
JJJJJ?The last five digits of the Julian date. This is the number of days since January 1, 4713 B.C. If you add 2.4 million you will get the actual Julian date.
YR-MO-DA?Year, month, and date separated by a hyphen. Each field is two-digit, which means that there is a Y2K problem.
HH:MM:SS?Hours, minutes, and seconds separated by a colon. This is Greenwich mean time, which is essentially Universal Coordinated Time (UTC). The time is kept by an atomic box, and is corrected periodically by adding or subtracting a second.
TT?Indicates whether the U.S. is on standard time or daylight savings time.
During standard time, the field displays zero.
Otherwise, it displays 50 until you reach the month in which the time changes back.
At that point, it counts down the number of days until the time change.
L?Leap second code, which synchronizes UTC with the Earth's rotation. This field indicates whether a second is added or subtracted at midnight on the last day of the month.
If zero is displayed, there is no correction.
If one is displayed, one second is added.
If two is displayed, one second is subtracted.
So here's a question; if you wanted to calculate the number of milliseconds since 1900, could you take 86,400, (which is the number of seconds in a day) multiply by the number of days, make the corrections for leap year and get the right answer? Due to this correction feature, you can't calculate the answer.
H?Health digit. This field indicates the status of the server.
If the server is healthy the field displays zero.
If H=1, it means that the server is okay but the time might be off by up to five seconds. In this situation, the state should correct itself to healthy within 10 minutes.
If H=2, it means that the server is okay, but the time is known to be wrong by more than five seconds. Unfortunately, you can't tell you how much it is off.
If H=4, it means that a hardware or software failure occurred, and that the time error is unknown.
The advantage of this field is that you'll always get the query back as long as the server is able to process it. If you receive indications that there are problems with the time, you can warn your users that they shouldn't synchronize their time based on your time information, since it is not correct.
msAdv?This is the value for how much NIST advances the time signal to compensate for network delays. It is normally 50 milliseconds. You'll notice, however, that sometimes the value is larger, therefore NIST must have some method for adjusting the value
UTC(NIST)?This is the identifier, which indicates that you are receiving the Universal Coordinated Time from the National Institute of Standards and Technology.
OTM?On Time Marker. The value of this field is an asterisk. The time values sent by the time code refer to the arrival of the OTM. That is, if the time code says 12:45:45 then that is the time the * arrives. Navigators call this a time hack.
Note: There is a method for calculating the time lag from the server. Bill didn't know the specifics, but basically you have to reply three times, and then compute the difference so that you can calculate a bit more accurately. NIST provides source code for this function, which you can refine.
WWV Program
Bill next demonstrated a program that he wrote. He initially developed the program for use as a teaching demo. His initial goal was to illustrate how to use Java's DateFormat class to create time and date display strings that are culturally correct for a given locale.
The program queries the NIST Daytime server on command. If the query fails, the program displays a test string. Results are returned in a form, which is suitable for the Locale selected from a drop down list box. The program will update the computer's clock based on the time retrieved if desired.
Program Operation
When the program starts up, it loads a list of locales into a dropdown list control by querying the DateFormat class for a list of available Locales. The default selection is based on the user's current locale.
When the "Get Current Time" button is pressed, the program sends a UDP query on port 13 to time.nist.gov for a Daytime string. If the query fails, an internally stored test string is substituted.
The returned Daytime string is parsed for year, month, day, hour, minutes, seconds and stored in a Date class.
If the "Update System Time" box is checked, a Win32 SYSTEMTIME structure is filled out, and a call is made to SetSystemTime.
The Date class is passed to an internal updateDisplay routine. This routine:
Gets instances of the DateFormat class for a Full date and a Full time for the currently selected locale
Computes the display strings using the format member function
If the user changes the selected locale, new formats are computed and displayed.
Program Design
Bill used an interesting approach to build the application. He used Microsoft WFC classes for Container and Controls, but used standard Java classes to obtain Locale lists and Formats. A native Windows method, (SetSystemTime) was used to update the computer's clock.
The program was developed using Visual J++, and built using Microsoft J++. It currently runs only on Win32 systems and is completely non-portable. Portability could be addressed however if it were rewritten using AWT or Swing.
Coding Features
The Locale List
Bill pointed out two architectural advantages of using his approach for generating the Locale list.
Locale list data is generated by a call to the DateFormat member function GetAvailableLocales, which returns a Locale object. The object fills up an array of all the locales that you have. You can use the array to load the locale list directly into a WFC Dropdown list control. This is convenient since it is not necessary to keep a copy in memory correlated with the list's sort order.
The second advantage is that this handling allows the list to sort properly, and you don't have to deal with memory issues. This is a common internationalization-coding problem that Bill sees frequently in his work. People don't realize that you can store enough information in the control and have it get sorted correctly without having to store any local data in the program.
When the object is inserted into the list, a user-visible display string is obtained by the control automatically by calling toString, a base class member function of all Java objects.
Unfortunately, the display string returned by Locale.toString is not very informative. It is simply a concatenation of the iso639 language abbreviation, the iso3166 country abbreviation, and a possible variant string.
For example: the string no_NO_B would be returned for Norway, Bokmal variant.
So how can you make the list more usable? While it might seem natural to extend the locale class, the locale class is final, so unfortunately you can't extend it. (This means that you can't create subclasses for it.)
MyLocale Class
There is a useful call that you can make, called getDisplayName in the locale class, which returns a useable name.
Bill's solution was to create a new class, called MyLocale, as a member function inside the locale class. He then wrote a toString routine to return getDisplayName from the locale. The locale therefore becomes a member variable.
public class MyLocale extends Object { public java.util.Locale locale; public MyLocale(Locale loc) { super(); locale = loc; } public String toString() { return locale.getDisplayName();}}
Using this approach, an informative display string appears for the user.
When you retrieve the locale, you need to pull the whole class reference off, (by calling getItem) rather than just the string. Once you get the whole thing back, you can operate on it as needed.
While this approach was fairly simple, it took a little while to figure out the optimal approach.
Updating the Clock
The Windows method for updating the computer's clock uses the Win32 function SetSystemTime. This method requires a SYSTEMTIME structure as a parameter, and uses the following format:
Eight fields, each an unsigned short
?Year (four digit), month, day, day of week (not required)
?Hour, minutes, seconds, milliseconds
SystemTime values must be in UTC
Here is Bill's code:
public class Win32 { /** * @dll.struct() */ public static class SYSTEMTIME { public short wYear; public short wMonth; public short wDayOfWeek; public short wDay; public short wHour; public short wMinute; public short wSecond; public short wMilliseconds; } /** * @dll.import("KERNEL32",auto) */ public static native boolean SetSystemTime(com.ms.win32.SYSTEMTIME lpSystemTime);}
In this case, you get MyLocale class, so you have to pull the actual locale class out of that by reading the member function.
To call this from Java requires using a Native Method. This used to be somewhat complicated, but Visual J++ provides a convenient Wizard to translate Win32 API's and data structures into classes and static member functions that can be used in Java.
if (updatesystime == true) { // create instance of SYSTEMTIME com.ms.win32.SYSTEMTIME st = new com.ms.win32.SYSTEMTIME(); // fill out SYSTEMTIME fields st.wYear = (short)(year + 1900); st.wMonth = (short)mon; st.wDay = (short)day; st.wDayOfWeek = 0; st.wHour = (short)hour; st.wMinute = (short)min; st.wSecond = (short)sec; st.wMilliseconds = 0; // call native method boolean result = Win32.SetSystemTime(st); // make noise if OK if (result == true) { java.awt.Toolkit tk = java.awt.Toolkit.getDefaultToolkit(); tk.beep(); }}
Formatting Time and Date Displays
Sample code for formatting date and time displays is shown below.
public void updateDisplay(){ // get the selected locale int sel = cbxLocale.getSelectedIndex(); MyLocale myloc = (MyLocale)cbxLocale.getItem(sel); java.util.Locale currentLocale = myloc.locale; // Get date instance of DateFormat, get string, insert into control DateFormat df = DateFormat.getDateInstance(DateFormat.FULL, currentLocale); String d = df.format(currentDate); edtFormattedDate.setText(d); // Get time instance of DateFormat, get string, insert into control DateFormat tf = DateFormat.getTimeInstance(DateFormat.FULL, currentLocale); String ts = tf.format(currentDate); edtFormattedTime.setText(ts);}
Bugs and Program Logic Problems
While Bill designed this program as a sample for illustrative purposes, he is willing to make it available for people who might want to use it as a base for their own work.
Anyone who is interested in receiving a copy can contact Bill at BillH@simultrans.com. It should also be posted on ftp://ftp.simultrans.com/hall.
Bill would appreciate receiving back a copy of the program once any fixes are made.
Given that this was a demo program, Bill wanted to point out the logic problems, which should be addressed in any production systems.
The current version does not handle conversion to local time very well. Bill had some difficulty in knowing when to use Date, Calendar, and TimeZone classes.
Only the date and time strings from the Daytime protocol are being used, and it might be useful to take advantage of other fields. For example if the server is not healthy, the program should do something intelligent.
WWV uses some deprecated routines from the Date class:
Date(int year, int month, int day, int hour, int minute, int seconds)
This is natural for the Daytime protocol, but maybe a production program would be better off using the Network Time Protocol instead.
Since the year is represented by only two characters in the NIST data, Bill's sample program has a Y2K problem because it merely adds 1900 to the year field.
Date.getTimezoneOffset is used to compute the offset for local time.
Adjustments should be made to program for Daylight Savings Time corrections.
NIST time is UTC.
WWV attempts to use the Java TimeZone class to compute local time before getting display strings. This code is an awkward mix of Date and TimeZone class methods and needs fixing. Bill took a look at the Calendar class, but was not able to determine whether it would be useful for this problem.
Depending on what Windows version you are using, time strings from other locales look as if they are being corrected for geographical location (probably due to the time zone bug of older SDK's).
This behavior is interesting since the program is acting like a world clock, but this was not the intended behavior. However, it goes away on Windows 2000.
Solaris Internationalization
The Solaris internationalization (I18N) architecture is the framework within which our Unicode/UTF-8 locales on Solaris systems are created and supported. The locale architecture is based on the single source/ binary model. In Tom's presentation, an overview of the Solaris "master" Unicode locale en_US.UTF-8 was provided. Particular emphasis was placed on its multiscript capabilities with regard to ease of use, customizability, and extensibility.
Tom described the flexibility of five key internationalization areas:
- Language Engines for Complex Text Layout Languages
- Code conversion
- Input Methods
- MIME support
- Terminal support
Language Evolution on Solaris
The Solaris single source/binary, internalization framework was introduced in 1992, with the release of Solaris version 2.0. It became available on both Sparc and Intel platforms in version 2.4.
The table below shows the language support evolution since Solaris 2.5.
|
Solaris Version |
Languages |
Locales |
|
2.5 |
17 |
42 |
|
2.6 |
23 |
56 |
|
7 |
37 |
97 |
|
8 |
37 |
123 |
|
Solaris 8 is due to be released in the second quarter of 2000.
The key point in this table is that the underlying internationalization framework of Solaris has remained generally unchanged, yet it has been able to accommodate a huge increase in support for additional scripts and languages.
Solaris Framework: Overview
The Solaris framework can be divided into several layers as shown below.
The Operating System Layer comprises the Kernel, OS libraries, and utilities (as applications)
The Window System Layer comprises X11 libraries and X input method server (as an application).
The CDE/Motif Layer(s) comprise the Motif library, CDE libraries and CDE desktop applications.
References to these layers are made throughout this summary.
Operating System
When a user executes a program (in other words, when a process and possibly threads calls the setlocale(3C) function), the operating system uses the C library libc to dynamically load locale data from Locale 1/Locale 2 etc. The locale data loaded is either the default locale or is user defined.
The dynamic locale data consists not only of cultural conventions like date/time format, monetary and numeric formatting rules etc, but also includes various other components which will be discussed later.
Users can also dynamically load or unload STREAMS modules that can be pushed into a Stream between a user process and terminal device (either a real terminal or emulator) to do code conversions. For example, you might have a JIS-7 terminal, and you want to connect to a system and use the Japanese EUC locale. In order to do this you should insert a STREAMS module so that it will do code conversion between JIS-7 and Japanese EUC. Information on how to do this is provided with the Solaris documentation.
Key Features
As a POSIX/XPG conforming operating system, the Solaris system's I18N framework is based on the I18N model of POSIX/XPG. In addition, Solaris conforms to various other international, national, and industry standards including ANSI C, X11, Motif, and the Common Desktop Environment (CDE).
The Solaris I18N (and localization) model is based on the single source/ binary model. This allows locale-related items (cultural elements, translated messages, and system resources) to be separated from the binary. Locale data is dynamically loaded into memory space when the binary is executed via a call to setlocale(3C).
With the evolution of Solaris language support, Tom's group developed a multilingual/multiscript environment, which is easy to use, customize, and extend. This flexibility is key to the Unicode strategy since it empowers end users and developers to customize their multilingual/multiscript environments. The new UTF-8 locales are based on Unicode 3.0 and are architected so that each one inherits any new multiscript features which are added to the basic I18N framework.
Locale Architecture
As a POSIX/XPG compliant system, Solaris provides the localedef(1) command to allow developers and localization engineers to generate locales. Localedef(1) reads various definitions from specific files and generates a locale shared object containing locale-specific data like a collation table for sorting and regular expression handling, character classification tables, cultural convention data etc.
In addition, each locale may have other components like translated message files. These translated message files are in a binary format so that at run-time, they can be loaded and used more efficiently. (Other locale files include resource files, configuration files, help files, and so on.)
Depending on the locale and its codeset, the locale developer may develop and provide a method shared object containing codeset-specific related functions for use with routines such as mbtowc(3C), wcwidth(3C), mblen(3C) etc. that cannot be generalized and provided from the system. These functions are dynamically linked into the system's C library when setlocale(3C) is invoked.
If a user calls a codeset-related function like mbtowc(3C), the corresponding internal function from the method shared object will be called instead of the default system function.
X Input Method Architecture
Solaris provides two different kinds of input models in X. These are transparent to both end users and developers:
Local Input Method (IM) model
Remote IM model
Since Asian scripts usually require more complicated input processing (e.g. Phonetic input, intelligent Kana-to-Kanji conversion etc) it is necessary to have a remote IM architecture, which enables a separate process, or thread (a.k.a. an Input Method Server) to process input appropriately. Third party companies specialize in input technology and produce highly sophisticated language engine/servers. Solaris includes these rather than trying to invent their own.
Somewhat simpler input processing can be done by using a local IM model. Almost every Latin script-based locale (including en_US.UTF-8) supports the local IM module. In fact, the en_US.UTF-8 locale can support the local IM and remote IM models simultaneously.
Motif
With regard to presentation of characters, there are two different kinds of scripts: Complex Text Layout (CTL) scripts and non-CTL scripts.
CTL scripts require pre-processing before rendering because they are bi-directional and/or require context-sensitive shaping. For instance, in the case of Arabic languages, bi-directional processing and context-sensitive shaping is required because text proceeds horizontally from right to left and characters can take one of four possible shape forms: Isolated, Initial, Middle, or Final depending on context.
Thai, Indic, Arabic, Hebrew, Farsi, and Yiddish are examples of CTL languages.
Solaris conforms to the CTL standard X/Open CAE Specification: Portable Layout Services. The Portable Layout Services interfaces are implemented in the "Layout Library" shown in the graphic below.
When m_create_layout(3L)is called, (as with setlocale(3C) in the C library ) the layout library will dynamically load a language-specific layout engine that will actually do the pre-processing for the CTL script.
Non-CTL scripts like most other Latin and Asian scripts will not go down this extra pre-processing route but instead go directly to the corresponding output function (e.g. XwcDrawString (3X11).
This is all implemented within the Motif toolkit and is driven by a special layered library, which is based on XPG4 standards. Sun adheres to the open systems such as Posix and XPG or X Open standards.
Unicode on Solaris: A Flexible Framework
Tom described the Solaris internationalization framework as flexible, because it is easy to customize and expand. Unicode support provides some of this flexibility.
Unicode locales are standard (POSIX/XPG) Solaris locales based on the UTF-8 encoding. Solaris uses UTF-8 as its presentation format for Unicode. The number of these locales is increasing to fulfill country specific cultural element requirements.
There are many common features among Solaris UTF-8 locales, including:
- Code conversion
- Input methods
- File code form: UTF-8
- Process code form: UCS-4
These common features and others are inherited by all Solaris UTF-8 locales from the "master" UTF-8 locale. This results in a common Unicode foundation for all UTF-8 locales.
Unicode 3.0 Locale: Overview
Tom described the locale naming convention, by using the locale en_US.UTF-8 as an example.
He pointed out that both of the subfields within the locale adhere to particular naming conventions. The first subfield refers to an ISO standard language indicator followed by a standard for the country or territory, and the second subfield represents the encoding.
En_US.UTF-8 therefore indicates that this is an English locale for the United States, using UTF-8 encoding. When using this locale, the menus and dialogs are in English. The date and time format, currency, and sorting methods are all U.S.-specific.
Tom referred to en_US.UTF-8 as the Master Unicode locale, because when enhancements are made to it, all other Unicode locales inherit those changes.
Unicode-driven Features at all Levels
Some of the Unicode-driven features that are implemented through the locale are listed below.
- Complex Text Layout (CTL) Support. This refers to support for Arabic, Hebrew, Thai, and Middle Eastern languages.
- Interoperability: Enhanced Code Conversion
- Extended Native Asian Input Methods
- MIME Support. This refers to support for email from systems/platforms using different codesets. A process has been developed for handling this automatically.
- Terminal Support. This refers to support for remote terminals.
Language Engines for CTL
To illustrate the flexibility of the internationalization framework, Tom described a problem with CTL support. Previously, the architecture required one layout engine per CTL locale. For example, the Hebrew locale used one layout engine, Arabic used another, and Thai used a third. This required separate modules. Also, much of the code required to drive the language engine was essentially generic resulting in duplication of code among language engines. The last problem was that users were not able to customize their language engines in order to apply locale-specific language rules.
The solution was to decouple the language engine logic from language rules and create a generic "Universal Multiscript Layout Engine" or UMLE. The UMLE enables users to run Solaris on their desktop and put an output language (currently Hebrew, Arabic and Thai) simultaneously on one window. Additional scripts will be supported in the future.
Language Engine Architecture Improvements
Multiple Language Engine Architecture
The Solaris CTL framework consists of three main components:
Motif library: Uses the layout services of the Layout Library to provide Complex Text Layout capabilities transparently to the toolkit users.
Layout library: Conforms to X/Open CAE Specification -- Portable Layout Services: Context-dependent and Directional Text and contains six internationalization programming interfaces for CTL language support.
Layout engines: Is dynamically loaded/plugged into the Layout Library. The language engine loaded will depend on the locale in which the application program is running.
Three layout engines are shown at the bottom of the graphic, one each for Arabic, Hebrew, and Thai. When a user inputs text, if flows through in what's called the logical form, in other words the form that it's input. It moves through the layout library, which doesn't do much, but simply acts as a layer between Motif and the layout engines.
As mentioned previously, Arabic is a context sensitive language, where glyphs or characters change dynamically as you input text. The logic for changes are all held in the layout engine, so when input goes through it, the engine passes back data to the Layout Library in presentation form, meaning the way it should look on the screen. Calculations are made for how to display the string, the height of the highest character, font metrics, how low the character descends, etc. The data is then passed to the X Library, which renders the character.
Universal Multiscript Language Engine Architecture
The diagram below illustrates the Universal Multiscript Layout Engine flow.
The initial process is the same as illustrated in the previous graphic. The difference comes when the Layout Library passes data in its logical form to the Layout Engine. The language rules have been taken out of the engine and put into a language-specific user-configurable text file. This means that users can tweak the rules to fit their needs, and Sun doesn't have to supply another language or layout engine. Users can go into this file and change the rules themselves without having to change any part of the architecture.
Within the Motif toolkit layer, when any static/dynamic text object widget is created, the toolkit dynamically loads the Layout Library and checks if the current locale requires CTL processing or not by calling the m_create_layout(3X) and m_getvalues_layout(3X) functions. When the m_create_layout(3X) function is called, the current locale's langue engine is loaded into the Layout Library.
If the current locale does not require CTL processing, the widget does not go through the pre-processing steps of CTL but instead directly uses the X library's output functions (e.g. XmbDrawString(3X11)) without any pre-processing.
If the current locale does require CTL processing, however, the widget tries to get the necessary layout values from the layout engine by calling m_getvalues_layout(3X) for future processing. Afterwards, all text rendering goes through an additional step of transformation before actual rendering by using m_transform_layout(3X) or m_wtransform_layout(3X) functions. For actual rendering, X library output functions are used. These include: XmbDrawString(3X11), XDrawString(3X11), XDrawString16(3X11).
UMLE Feature Summary
The graphic below illustrates the UMLE:
Codeset independent methods. These can be loaded either from the UMLE pre-defined method sets or from the user-supplied method shared object.
Unicode Bidi Algorithm. Described in Unicode Technical Report #9 for bidirectionality support.
Table-driven shaping algorithm for bidi and context-sensitive shaping support. This is the table that contains bidi and shaping rules. It is a user configurable flat text file.
Binary bidi and shaping rule table generator utility genlayouttbl(1). This enables users to generate the binary bidi and shaping rule table. The shaping algorithm loads this into the UMLE.
genlayouttbl(1)
genlayouttbl(1) is the utility that accepts user-defined shaping rules and generates a set of binary shaping rule tables that can be loaded and used by the UMLE.
This utility allows users to change the file, and pass it to genlayouttbl(1), because it's just a command language. genlayouttbl(1) then produces binary bidi and shaping tables.
Users define shaping rules as follows:
Classify each character of the locale's codeset into a number of types. Characters in a type have a common property. Arabic characters can be categorized into several types. For example, "4-shape type" are characters that have four shapes (isolated, initial, middle, and, final forms), "2-shape type" are characters that have two shapes (isolated and final forms), and "1-shape type" are characters that have only one single shape (isolated form or characters from other scripts).
Enumerate all possible state transition combinations by using the types as input for each state transition. For instance:
S0 -- Type1 input -> S1 -- Type2 input -> S2 … -> Sfinal
S0 -- Type2 input -> S2 -- Type3 input -> S5 … -> Sfinal
Once state transition combinations are done, define all possible transformed output text for all final states.
Write definitions into a flat text file by following the genlayouttbl(1)input file format defined in the genlayouttbl(5) man page.
UMLE Summary
Tom summarized by saying that UMLE is easy to maintain, extensible, customizable, and it enables display of mixed text in CTL languages. This is really the key added value provided by the latest version of Solaris compared with Solaris 7.
Code Conversions
Sun Microsystems found that they needed to provide a more customizable method for handling code conversion. What they came up with is a user-defined code conversion framework, which consists of three modules:
- Definition file (text based)
- Table builder: geniconvtbl
- Loadable shared object: geniconvtbl.so
- User Defined Code Conversion with geniconvtbl
Solaris bundles many codeset conversion tables which are used by both the iconv(1) command and the iconv*() routines. These conversion modules cover a wide range of codesets. For example they have modules that allow you to convert between either 595, which is the ISO Cyrillic codeset standard, to 1251, which is the Russian Windows code page. Users, however, often require customized code conversion functionality (usually related to proprietary requirements).
Iconv modules are essentially fragments of code, which can be compiled and loaded as shared objects by iconv*(). Creating these modules is not a trivial task for end users who may be inexperienced in coding. The user-defined code conversion framework in Solaris provides a mechanism for end users to either extend existing code conversion modules or to create their own. Using geniconvtbl(1) users can define their own code conversions on Solaris. geniconvtbl(1) accepts users' code conversion definitions in a flat text file and generates a code conversion binary table file subsequently used by the geniconvtbl shared object, which is loaded by iconv(3).
The diagram below shows that the code conversion rules have been taken out and put into a user-defined file, so users can actually define code-set mappings themselves. Customized mappings are loaded into the iconv routine via another module.
This means that rather than Sun receiving requests for patches with new iconv modules, customers can make modifications themselves. Man pages describe each of the commands.
Basically, this framework:
Allows users to customize and extend the code conversion capabilities of their Solaris platform easily, effectively, and at minimum cost.
Allows users to be more self-sufficient and less reliant on patches or on future releases of the operating system.
Input Methods
A third area of internationalization framework flexibility deals with input methods. The challenge was to increase native input method support and provide dynamic switching between input modes. Sun wanted to keep adding to their native input method portfolio, and wanted users to be able to dynamically switch from one input mode to another without changing keyboards or rebooting.
The en_US.UTF-8 locale supports multiple scripts. These are:
- English/European
- Cyrillic
- Greek
- Arabic
- Hebrew
- Thai
- Unicode Hexadecimal and Octal code input methods
- Table lookup input method
- Japanese
- Korean
- Simplified Chinese
- Traditional Chinese
The first part of the solution was to add more input methods, such as those listed below:
- Japanese: ATOK, cs00, Wnn
- T/S Chinese: PinYin, QuanPin, TsangChieh
- Korean: Phonetic, Hangul-to-Hanja conversion
(ATOK refers to a language engine, which is supplied by a third party, which Sun buys and plugs in.)
The second part of the solution was to create a key sequence and mouse driven dynamic input mode switching facility, called the multiscript input schema.
To switch into a certain input mode, you can either:
Type in an input-mode-switch compose-key sequence for each input mode, or
Click the left-most mouse button on the "status area" of your application to open an input mode selection window and select from the listed input.
Input mode information is held in a special user configurable "Compose" file.
Multiscript Input Schema
The dynamic input mode switching method allows uses to switch from any supported language to any other supported language by a simple keystroke.
For example, you can be working in English and then move to Cyrillic input and output. You can come back to English again with a keystroke or a mouse click, or change to any of the other languages.
The graphic below illustrates the schema.
Pressing a special key sequence or choosing a script via mouse selection on the special "status area" at the base of the client window will immediately switch your input mode to that script. All subsequent input will be in the chosen script (indicated in the above graphic by the arcs for each script). Also, the users keyboard layout will be automatically switched to a default layout for that script. Users can return to the default script/layout (English in the above case) by either using a special key sequence (<ctrl><space>) or by selecting English on the status area using the mouse.
In addition to the major language categories (Cyrillic, Greek, CJK, and CTL) you will also notice options in the graphic for Unicode Hexadecimal Code (UHC), Unicode Octal Code (UOC) and table look-up (TBL). These are useful if you don't know the actual key sequences for the characters you wish to input.
For example, you might need to generate a particular Unicode character but you only have the Unicode code point. You don't know how to create it through key strokes. Using UHC, you can input the Unicode code point and get the actual character that you want. Or if you want to use Octal, you can select UOC. TBL is a table lookup feature, which pops up a table of characters so that you can choose from it. (This is similar to the Windows feature.)
Modifying the input options in this way makes things very flexible.
Enhanced MIME Support
The fourth flexible internationalization feature is the MIME support that was added to the mailing system.
The challenge is that people receive e-mail from different platforms using different codesets and these are increasing by the day. Systems must have a transparent way of dealing with the mail, because if they didn't, users would get mail that looked like garbage. They would have to attempt codeset conversions on mail themselves, and then load the mail back up to read it.
Sun wanted to add conversion modules to extend platform interoperability, they needed the modules to work transparently to the user, and they wanted the modules to be extensible.
Their solution to these issues was to supply new geniconvtbl based, codeset conversion modules integrated with CDE/Dtmail, along with a locale conversion library.
As a result of increased coverage in scripts, Solaris 8 Dtmail running in the en_US.UTF-8 locale supports a wide range of MIME character sets such as Shift-JIS, GB2312, BIG5, TIS-620, UTF-16BE, UTF-16LE, and KOI8-U.
This support allows users to view virtually any kind of email encoded in various MIME character sets from any region of the world in a single instance of Dtmail. The decoding of received email is handled by Dtmail, which looks at the MIME character set and content transfer encoding provided with the email.
When sending email, users can specify a MIME character set that is understood by the recipient's mail client, alternatively they can use the default MIME character set provided by the en_US.UTF-8 locale.
The result is a very rich mailing system on Solaris, driven by the existing code conversion modules as well as the locale conversion library.
Locale Conversion Library
In order to ensure interoperability with Internet mailing conventions Solaris mail clients need to ensure correct code conversions for incoming and outgoing mail. This can be done by ensuring:
- Correct resolution of incoming MIME compliant email.
- and MIME compliant tagging of outgoing mail.
The Locale Conversion Library (LCL) is an internal library used by email clients to achieve this.
Previously, each client of the LCL had its own statically linked copy of the library. This model was obviously inefficient. To resolve the problem the dynamically loadable LCL shared object was created. Unlike the old model this object is shared by all email clients rather than statically linked by each client.
The LCL Definition file contains locale-specific information for the LCL shared object. This information is used for converting characters from one representation to another. It is provided as a plain ASCII text file so locale developers or users can change it's contents easily. This allows users to customize the behavior of the LCL without changing the library itself.
The model is illustrated below.
Incoming mail goes into the mail client, which in Solaris is called Dtmail. It looks at the MIME tag on the mail and passes it to a shared object. Shared objects can be dynamically linked, which means they can bypass standard APIs. The LCL shared object looks at the LCL definition file to find out whether or not the locale system needs to convert the codeset coming in from a particular sender. The e-mail client will take it, run it through the LCL module, and pass it to the LCL definition file.
As with the other flexible internationalization features, character conversion is definable, but definition is usually done by the localization center rather than the user. Customization is handled by the LCL file, which has a line in it defining the codeset conversion modules that should be run. The required mapping is passed back to the mail client, and the mail client does the conversion based on the information it finds.
Outgoing mail is tagged according to the locale your system is set to, and obviously it can't do any conversion because it doesn't know what machine it's going to. The recipient will always need some sort of mechanism for conversion as well.
Extended Code Conversion Support
The diagram below summarizes all the possible iconv code conversions related to Unicode/ISO/IEC 10646-1. There are many more in addition to these. Please refer to the Solaris documentation for more information.
The various conversions between single-byte codesets and Unicode representations are given below:
- ISO 646 (ASCII), ISO 8859-1 ~ -10, -13, -14, and, -15, KOI8-R, KOI8-U <-> UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-BE, UTF-16, UTF-16BE, UTF-16LE, UTF-8
Conversions among various Unicode representations are given below:
- UCS-4, UCS-4BE, UCS-4LE <-> UCS-2, UCS-2BE, UCS-2LE, UTF-16, UTF-16BE, UTF-16LEUCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE, UTF-16BE, UTF-16LE <-> UTF-8
Unicode Support for Terminals
The final area of flexible internationalization support relates to terminals. If you have a terminal that supports UTF-8, then it's very easy to hook it up to Solaris, log in to a UTF-8 locale, and have the characters display properly.
The reason for this is that Solaris provides kernel-based STREAMS modules (called ldterm) that can be pushed on to a stream between the terminal (or, emulator) and the user process for proper code conversion. This means that no additional streams modules are required as they had been previously. It's all built in.
For terminals that already support UTF-8 locales no additional streams modules need to be pushed on to the stream since the standard streams terminal line discipline module ldterm has been made codeset independent. Terminals that do not support UTF-8 will require the appropriate utf8/<codeset> streams module to be pushed on to the stream.
Terminals that support only a specific codeset but are logging into a Solaris system under a UTF-8 locale are supported as well through UTF-8-based streams modules.
The main purpose of these modules is to carry out proper code conversion in the case where the terminal's native codeset is different from the codeset of the users locale on Solaris. For example, you can connect a terminal that has the ISO 8859 Part 1 codeset as the native codeset to a Solaris system and then set the current locale of the session to en_US.UTF-8. In this case, you can push the u8lat1(7M) STREAMS module on to your Stream and achieve proper code conversion between ISO 8859-1 and UTF-8.
The user can just pop and push these modules in and out of the stream very easily by changing the configuration file. When data comes in, it will then be converted according to the configuration settings.
The key point here is that Solaris provides flexible Unicode support at all levels; right down to the kernel level where these streams modules are deployed.
Conclusion
Tom concluded by reiterating that the Solaris I18N architecture is solidly based on Unicode and will continue to be. That is the Sun strategy. They will continue to support legacy codesets as well, since they aren't going to magically disappear.
Sun internationalization tenets are that the framework will be easy to use, easy to customize, and easy to expand so that work won't need to be done later to accommodate new features. Tom said that they want to empower users to change the system in ways that are useful.
Tom also stated that Unicode support needs to be provided at all levels: the application, toolkit, Windowing system, Motif, OS (for the locale data), and right down to the kernel. (Streams modules are implemented at the kernel level.) It really has to be a top-down approach.
What's Next?
Tom described the areas that his group will be focusing on in the future. He reported that their top priority will be to continue to find ways to make the architecture more extensible and customizable by both end users and developers. They want to support more platforms, expand inter-platform capabilities, and expand the multilingual, multiscript features. For example, they are looking at adding Indic script support.
Another goal is to improve the current I18N architecture by using Distributed Internationalization Services (thin client networks for example). This poses a whole new set of challenges, but would resolve some of the current restraints that Posix imposes.
With the explosion of the internet, new challenges and issues need to be addressed particularly in the areas of e-commerce and consumer/embedded systems. Sun plans to focus much of their effort in the future to this work as well.
Questions and Answers
Question
Can you clarify how geniconvtbl and your layout engines work, particularly the modules you buy? Are they executable files created with certain character sets? Are they like code pages?
Answer
Geniconvtbl is a shared object which is used for codeset conversion. It's a module that takes a set of mappings defined by the user and is loaded by a routine that actually does the mapping. The geniconvtbl shared object will load up an iconv module, which is supplied on Solaris itself, that has been created by the user. The module will convert from one codeset to another, with some modifications that have been required by the user. For example, they might want a slightly different mapping from the standard. Geniconvtbl is used to create that module.
Question
Can you convert from one language codeset to another language codeset? For example, from Russian to Japanese?
Answer
No, you can't do codeset conversion for different languages. The codesets have to be for the same language. For example, in Russian, (and in many languages) you have different codesets for the same language. 1251 is the Windows code page for Russian, but the equivalent on UNIX can be 80595, KOI8, or KOIR. (KOI8 is the most popular in Russia.)
Question
You mentioned terminal support. Are you talking about handling character conversion and terminal support from one Solaris machine to another Solaris terminal?
Answer
The support Tom described was purely for a dumb terminal, connected to a Solaris box, logging in using a character-only interface, no GUI. That's the configuration.
Question
I know there are other UNIX vendors. How does Solaris with its new UTF-8 support interface with those other versions of UNIX?
Answer
The strategy on Unicode across UNIX vendors is generally the same; implement with UTF-8. You can't implement raw Unicode because raw Unicode has certain characters that UNIX doesn't like. UTF-8 protects those characters by the special encoding that it uses. And so most UNIX vendors will go with UTF-8.
Question
What about multilingual networks? By that I mean you have a server and two users connecting from terminals. One is Spanish and the other is French, and they both want to see the actual OS displayed in their own language. How do you handle that? Can you add multiple language support in the servers that cater to that kind of situation?
Answer
Yes. Once a user is running a particular locale, he is tied to that locale until the end of his session. When he logs into a server, he logs in under a particular locale. For example, with X Windows, where you can have an X server and an X display, it will be as if you're on the local server, but it's just that your display is somewhere else. Your input, output and print will be based on that locale, and the user interface will be translated if it is a supported language. Likewise, a different user can log in under another locale and get the same support.
|