OverviewThere are many different formats for software localization, i.e., for resource bundles. The most important file format feature for translation of text elements is to represent key-value pairs where the values are strings. Each format was designed for a certain purpose. Many but not all formats are recognized by translation tools. For localization it is best to use a source format that is optimized for translation, and to convert from it to the platform-specific formats at build time. This overview concentrates on the formats that are relevant for working with ICU. The examples below show only lists of strings, which is the lowest common denominator for resource bundles. RecommendationThe most promising long-term approach is to author localizable data in XLIFF format (see the XLIFF (§) section below) and to convert it to native, platform/tool-specific formats at build time. Short-term, due to the lack of ICU tools for XLIFF, either custom tools must be used to convert from some authoring/translation format to Java/ICU formats, or one of the Java/ICU formats needs to be used for authoring and translation. Java and ICU4J.properties filesJava PropertyResourceBundle uses runtime-parsed .properties files. They contain key-value pairs where both keys and values are Unicode strings. No other native data types (e.g., integers or binaries) are supported. There is no way to specify a charset, therefore .properties files must be in ISO 8859-1 with \u escape sequences (see the Java native2ascii tool). Defined at: http://java.sun.com/j2se/1.4/docs/api/java/util/PropertyResourceBundle.html Example: (example_de.properties)
.java ListResourceBundle filesJava ListResourceBundle files provide implementation subclasses of the ListResourceBundle abstract base class. They are Java code! Source files are .java files that are compiled as usual with the javac compiler. Syntactic rules of Java apply. As Java source code, they can contain arbitrary Java objects and can be nested. Although the Java compiler allows to specify a charset on the command line, this is uncommon, and .java resource bundle files are therefore usually encoded in ISO 8859-1 with \u escapes like .properties files. Defined at: http://java.sun.com/j2se/1.4/docs/api/java/util/ListResourceBundle.html Example: (example_de.java)
ICU4J can also access the ICU4C resource bundles described in the next section, using the API described in the UResourceBundle documentation. ICU4C.txt resource bundlesICU4C natively uses a plain text source format with a nested structure that was derived from Java ListResourceBundle .java files when the original ICU Java class files were ported to C++. The ICU4C bundle format can of course contain only data, not code, unlike .java files. Resource bundle source files are compiled with the genrb tool into a binary runtime form (.res files) that is portable among platforms with the same charset family (ASCII vs. EBCDIC) and endianness. Features:
Defined at: icuhtml/design/bnf_rb.txt To use with ICU4C, see the Resource Bundle APIs section of this userguide. Example: (de.txt)
ICU4C XML resource bundlesThe ICU4C XML resource bundle format was defined simply to express the same capabilities of the .txt and binary ICU4C resource bundles in XML form. However, we have decided to drop the format for lack of use and instead adopt standard XLIFF format for localization. For more information on XLIFF format, see the following section. For examples on using ICU tools to produce and read XLIFF format see the XLIFF Usage (§) section in the resource management chapter. XLIFFThe XML Localization Interchange File Format (XLIFF) is an emerging industry standard "for the interchange of localization information". Version 1.1 is available (2003-Oct-31), and 1.2 is almost complete (2007-Jan-20). This is the result of a quick review of XLIFF and may need to be improved. Features:
Defined at: http://www.oasis-open.org/committees/xliff/ Example: (example.xlf)
For examples on using ICU tools to produce and read XLIFF format see the XLIFF Usage (§) section in the resource management chapter. DITAThe Darwin Information Typing Architecture (DITA) is "IBM's XML architecture for topic-oriented information". It is a family of XML formats for several types of publications including manuals and resource bundles. It is extensible. For example, subformats can be defined by refining DTDs. One design feature is to provide cross-document references for reuse of existing contents. For more information see http://www.ibm.com/developerworks/xml/library/x-dita4/index.html While it is certainly possible to define resource bundle formats via DTDs in the DITA framework, there currently (2002-Nov-27) do not appear to be resource bundle formats actually defined, or tools available specifically for them. Linux/gettextThe OpenI18N specification requires support for message handling functions (mostly variants of gettext()) as defined in libintl.h. See Tables 3-5 and 3-6 and Annex C in http://www.openi18n.org/docs/html/LI18NUX-2000-amd4.htm Resource bundles ("portable object files", extension .po) are plain text files with key-value pairs for string values. The format and functions support a simple selection of plural forms by associating integer values (via C language expressions) with indexes of strings. The msgfmt utility compiles .po files into "message object files" (extension .mo). The charset is determined from the locale ID in LC_CTYPE. There are additional supporting tools for .po files.
Defined at: Annex C of the Li18nux-2000 specification, see above. Example: (example.po)
POSIX/catgetsPOSIX (The Open Group specification) defines message catalogs with the catgets() C function and the gencat build-time tool. Message catalogs contain key-value pairs where the keys are integers 1..NL_MSGMAX (see limits.h), and the values are strings. Strings can span multiple lines. The charset is determined from the locale ID in LC_CTYPE. Defined at: http://www.opengroup.org/onlinepubs/009695399/utilities/gencat.html and http://www.opengroup.org/onlinepubs/009695399/functions/catgets.html Example: (example.txt)
WindowsWindows uses a number of file formats depending on the language environment -- MSVC 6, Visual Basic, or Visual Studio .NET. The most well-known source formats are the .rc Resource and .mc Message file formats. They both get compiled into .res files that are linked into special sections of executables. Source formats can be UTF-16, while compiled strings are (almost) always UTF-16 from .rc files (except for predefined ComboBox strings) and can optionally be UTF-16 from .mc files. .rc files carry key-value pairs where the keys are usually numeric but can be strings. Values can be strings, string tables, or one of many Windows GUI-specific structured types that compile directly into binary formats that the GUI system interprets at runtime. .rc files can include C #include files for #defined numeric keys. .mc files contain string values preceded by per-message headers similar to the Linux/gettext() format. There is a special format of messages with positional arguments, with printf-style formatting per argument. In both .rc and .mc formats, Windows LCID values are defined to be set on the compiled resources. Developers and translators usually overlook the fact that binary resources are included, and include them into each translation. This despite Windows, like Java and ICU, using locale ID fallback at runtime. .rc and .mc files are tightly integrated with Microsoft C/C++, Visual Studio and the Windows platform, but are not used on any other platforms. A sample Windows .rc file (§) is at the end of this document. ICU toolsICU 2.4 provides tools for conversion between resource bundle formats:
There are currently no ICU tools for XLIFF. Converting de.txt to a ListResourceBundleThe following genrb invocation generates a ListResourceBundle from de.txt (see the example file de.txt above): genrb -j -b TestName -p com.example de.txt The -j option causes .java output, -b is an arbitrary bundle name prefix, and -p is an arbitrary package name. "Arbitrary" means "depends on your product" and may be truly arbitrary if the generated .java files are not actually used in a Java application. genrb auto-detects .txt files encoded in Unicode charsets like UTF-8 or UTF-16 if they have a signature byte sequence ("BOM"). The .java output file is in native2ascii format, i.e., it is encoded in US-ASCII with \u escapes. The output of the above genrb invocation is TestName_de.java:
Converting a ListResourceBundle back to .txtAn ICUListResourceBundle .java file as generated in the previous example can be converted to an ICU4C .txt file with the following steps:
The last step generates a new de.txt in native2ascii format:
Further information
Sample Windows .rc fileThis file (winrc.rc) was generated with MSVC 6, using the New Project wizard to generate a simple "Hello World!" application, changing the LCIDs to German, then adding the two example strings as above.
|