ICU Data

Overview

ICU makes use of a wide variety of data tables to provide many of its services. Examples include converter mapping tables, collation rules, transliteration rules, break iterator rules and dictionaries, and other locale data. Additional data can be provided by users, either as customizations of ICU's data or as new data altogether.

This section describes how ICU data is stored and located at run time. It also describes how ICU data can be customized to suit the needs of a particular application.

For simple use of ICU's predefined data, this section on data management can safely be skipped. The data is built into a library that is loaded along with the rest of ICU. No specific action or setup is required of either the application program or the execution environment.

NoteNote that ICU for C by default comes with pre-built data. The source data files are included as an "icu*data.zip" file starting in ICU4C 49. Previously, they were not included unless ICU is downloaded from the source repository. Alternatively, the Data Customizer may be used to customize the pre-built data.

ICU and CLDR Data

Most of ICU's data is sourced from CLDR, the Common Locale Data Repository project. Do not file bugs against ICU to request data changes in CLDR, see the CLDR project's page itself. Also note that most ICU data files are therefore autogenerated from CLDR, and so manually editing them is not usually recommended.

Data which is NOT sourced from CLDR includes:
  • Conversion Data
  • Break Iterator Dictionary Data ( Thai, CJK, etc )
  • Break Iterator Rule Data (as of this writing, it is manually kept in sync with the CLDR datasets)
For information on building ICU data from CLDR, see the cldr-icu-readme.

ICU Data Directory

The ICU data directory is the default location for all ICU data. Any requests for data items that do not include an explicit directory path will be resolved to files located in the ICU data directory.

The ICU data directory is determined as follows:

  1. If the application has called the function u_setDataDirectory(), use the directory specified there, otherwise:

  2. If the environment variable ICU_DATA is set, use that, otherwise:

  3. If the C preprocessor variable ICU_DATA_DIR was set at the time ICU was built, use its compiled-in value.

  4. Otherwise, the ICU data directory is an empty string. This is the default behavior for ICU using a shared library for its data and provides the highest data loading performance.

Note u_setDataDirectory() is not thread-safe. Call it before calling ICU APIs from multiple threads. If you use both u_setDataDirectory() and u_init(), then use u_setDataDirectory() first.

Earlier versions of ICU supported two additional schemes: setting a data directory relative to the location of the ICU shared libraries, and on Windows, taking a location from the registry. These have both been removed to make the behavior more predictable and easier to understand.

The ICU data directory does not need to be set in order to reference the standard built-in ICU data. Applications that just use standard ICU capabilities (converters, locales, collation, etc.) but do not build and reference their own data do not need to specify an ICU data directory.

Multiple-Item ICU Data Directory Values

The ICU data directory string can contain multiple directories as well as .dat path/filenames. They must be separated by the path separator that is used on the platform, for example a semicolon (;) on Windows. Data files will be searched in all directories and .dat package files in the order of the directory string. For details, see the example below.

Default ICU Data

The default ICU data consists of the data needed for the converters, collators, locales, etc. that are provided with ICU. Default data must be present in order for ICU to function.

The default data is most commonly built into a shared library that is installed with the other ICU libraries. Nothing is required of the application for this mechanism to work. ICU provides additional options for loading the default data if more flexibility is required.

Here are the steps followed by ICU to locate its default data. This procedure happens only once per process, at the time an ICU data item is first requested.

  1. If the application has called the function udata_setCommonData(), use the data that was provided. The application specifies the address in memory of an image of an ICU common format data file (either in shared-library format or .dat package file format).

  2. Examine the contents of the default ICU data shared library. If it contains data, use that data. If the data library is empty, a stub library, proceed to the next step. (A data shared library must always be present in order for ICU to successfully link and load. A stub data library is used when the actual ICU common data is to be provided from another source).

  3. Dynamically load (memory map, typically) a common format (.dat) file containing the default ICU data. Loading is described in the section How Data Loading Works (§). The path to the data is of the form "icudt<version><flag>", where <version> is the two-digit ICU version number, and <flag> is a letter indicating the internal format of the file (see the Sharing ICU Data Between Platforms section (§)).

Once the default ICU data has been located, loading of individual data items proceeds as described in the section How Data Loading Works (§).

Application Data

ICU-based applications can ship and use their own data for localized strings, custom conversion tables, etc. Each data item file must have a package name as a prefix, and this package name must match the basename of a .dat package file, if one is used. The package name must be used in ICU APIs, for example in udata_setAppData() (instead of udata_setCommonData() which is only used for ICU's own data) and in the pathname argument of ures_open().

The only real difference to ICU's own data is that application data cannot be simply loaded by specifying a NULL value for the path arguments of ICU APIs, and application data will not be used by APIs that do not have path/package name arguments at all.

The most important APIs that allow application data to be used are for Resource Bundles, which are most often used for localized strings and other data. There are also functions like ucnv_openPackage() that allow to specify application data, and the udata.h API can be used to load any data with minimum requirements on the binary format, and without ICU interpreting the contents of the data.

The pkgdata tool, which is used to package the data into various formats (e.g. shared library), has an option (--without-assembly or -w) to not use assembly code when  building and packaging the application specific data into a shared library.  Building the data with assembly code, which is enabled by default, is faster and more efficient; however, there are some platform specific issues that may arise.  The --without-assembly option may be necessary on certain platforms (e.g. Linux) which have trouble properly loading application data when it was built with assembly code and is packaged as a shared library.

Flexibility vs. Installation vs. Performance

There are choices that affect ICU data loading and depend on application requirements.

Data in Shared Libraries/DLLs vs. .dat package files

Building ICU data into shared libraries is the most convenient packaging method because shared libraries (DLLs) are easily found if they are in the same directory as the application libraries, or if they are on the system library path. The application installer usually just copies the ICU shared libraries in the same place. On the other hand, shared libraries are not portable.

Packaging data into .dat files allows them to be shared across platforms, but they must either be loaded by the application and set with udata_setCommonData() or udata_setAppData(), or they must be in a known location that is included in the ICU data directory string. This requires the application installer, or the application itself at runtime, to locate the ICU and/or application data by setting the ICU data directory (see the ICU Data Directory (§) section above) or by loading the data and providing it to one of the udata_setXYZData() functions.

Unlike shared libraries, .dat package files can be taken apart into separate data item files with the decmn ICU tool. This allows post-installation modification of a package file. The gencmn and pkgdata ICU tools can then be used to reassemble the .dat package file.

For more information about .dat package files see the section Sharing ICU Data Between Platforms (§) below.

Data Overriding vs. Loading Performance

If the ICU data directory string is empty, then ICU will not attempt to load data from the file system. It is then only possible to load data from the linked-in shared library or via udata_setCommonData() and udata_setAppData(). This is inflexible but provides the highest performance.

If the ICU data directory string is not empty, then data items are searched in all directories and matching .dat files mentioned before checking in already-loaded package files. This allows overriding of packaged data items with single files after installation but costs some time for filesystem accesses. This is usually done only once per data item; see User Data Caching below.

Single Data Files vs. Packages

Single data files are easy to replace and can override items inside data packages. However, it is usually desirable to reduce the number of files during installation, and package files use less disk space than many small files.

How Data Loading Works

ICU data items are referenced by three names - a path, a name and a type. The following are some examples:

path name type

cnvalias icu

cp1252 cnv

en res

uprops icu
c:\some\path\dataLibName test dat

Items with no path specified are loaded from the default ICU data.

Application data items include a path, and will be loaded from user data files, not from the ICU default data. For application data, the path argument need not contain an actual directory, but must contain the application data's package name after the last directory separator character (or by itself if there is no directory). If the path argument contains a directory, then it is logically prepended to the ICU data directory string and searched first for data. The path argument can contain at most one directory. (Path separators like semicolon (;) are not handled here.)

Note The ICU data directory string itself may contain multiple directories and path/filenames to .dat package files. See the ICU Data Directory (§) section.

It is recommended to not include the directory in the path argument but to make sure via setting the application data or the ICU data directory string that the data can be located. This simplifies program maintenance and improves robustness.

See the API descriptions for the functions udata_open() and udata_openChoice() for additional information on opening ICU data from within an application.

Data items can exist as individual files, or a number of them can be packaged together in a single file for greater efficiency in loading and convenience of distribution. The combined files are called Common Files.

Based on the supplied path and name, ICU searches several possible locations when opening data. To make things more concrete in the following descriptions, the following values of path, name and type are used:

path = "c:\some\path\dataLibName"
name = "test"
type = "res"

In this case, "dataLibName" is the "package name" part of the path argument, and "c:\some\path\" is the directory part of it.

The search sequence for the data for "test.res" is as follows (the first successful loading attempt wins):

  1. Try to load the file "dataLibName_test.res" from c:\some\data\.

  2. Try to load the file "dataLibName_test.res" from each of the directories in the ICU data directory string.

  3. Try to locate the data package for the package name "dataLibName".

  1. Try to locate the data package in the internal cache.

  2. Try to load the package file "dataLibName.dat" from c:\some\data\.

  3. Try to load the package file "dataLibName.dat" from each of the directories in the ICU data directory string.

The first steps, loading the data item from an individual file, are omitted if no directory is specified in either the path argument or the ICU data directory string.

Package files are loaded at most once and then cached. They are identified only by their package name. Whenever a data item is requested from a package and that package has been loaded before, then the cached package is used immediately instead of searching through the filesystem.

Note ICU versions before 2.2 always searched data packages before looking for individual files, which made it impossible to override packaged data items. See the ICU 2.2 download page and the readme for more information about the changes.

User Data Caching

Once loaded, data package files are cached, and stay loaded for the duration of the process. Any requests for data items from an already loaded data package file are routed directly to the cached data. No additional search for loadable files is made.

The user data cache is keyed by the base file name portion of the requested path, with any directory portion stripped off and ignored. Using the previous example, for the path name "c:\some\path\dataLibName", the cache key is "dataLibName". After this is cached, a subsequent request for "dataLibName", no matter what directory path is specified, will resolve to the cached data.

Data can be explicitly added to the cache of common format data by means of the udata_setAppData() function. This function takes as input the path (name) and a pointer to a memory image of a .dat file. The data is added to the cache, causing any subsequent requests for data items from that file name to be routed to the cache.

Only data package files are cached. Separate data files that contain just a single data item are not cached; for these, multiple requests to ICU to open the data will result in multiple requests to the operating system to open the underlying file.

However, most ICU services (Resource Bundles, conversion, etc.) themselves cache loaded data, so that data is usually loaded only once until the end of the process (or until u_cleanup() or ucnv_flushCache() or similar are called.)

There is no mechanism for removing or updating cached data files.

Directory Separator Characters

If a directory separator (generally '/' or '\') is needed in a path parameter, use the form that is native to the platform. The ICU header "putil.h" defines U_FILE_SEP_CHAR appropriately for the platform.

Note On Windows, the directory separator must be '\' for any paths passed to ICU APIs. This is different from native Windows APIs, which generally allow either '/' or '\'.

Sharing ICU Data Between Platforms

ICU's default data is (at the time of this writing) about 8 MB in size. Because it is normally built as a shared library, the file format is specific to each platform (operating system). The data libraries can not be shared between platforms even though the actual data contents are identical.

By distributing the default data in the form of common format .dat files rather than as shared libraries, a single data file can be shared among multiple platforms. This is beneficial if a single distribution of the application (a CD, for example) includes binaries for many platforms, and the size requirements for replicating the ICU data for each platform are a problem.

ICU common format data files are not completely interchangeable between platforms. The format depends on these properties of the platform:

  1. Byte Ordering (little endian vs. big endian)

  2. Base character set - ASCII or EBCDIC

This means, for example, that ICU data files are interchangeable between Windows and Linux on X86 (both are ASCII little endian), or between Macintosh and Solaris on SPARC (both are ASCII big endian), but not between Solaris on SPARC and Solaris on X86 (different byte ordering).

The single letter following the version number in the file name of the default ICU data file encodes the properties of the file as follows:

icudt19l.dat Little Endian, ASCII
icudt19b.dat Big Endian, ASCII
icudt19e.dat Big Endian, EBCDIC

(There are no little endian EBCDIC systems. All non-EBCDIC encodings include an invariant subset of ASCII that is sufficient to enable these files to interoperate.)

The packaging of the default ICU data as a .dat file rather than as a shared library is requested by using an option in the configure script at build time. Nothing is required at run time; ICU finds and uses whatever form of the data is available.

Note When the ICU data is built in the form of shared libraries, the library names have platform-specific prefixes and suffixes. On Unix-style platforms, all the libraries have the "lib" prefix and one of the usual (".dll", ".so", ".sl", etc.) suffixes. Other than these prefixes and suffixes, the library names are the same as the above .dat files.

Customizing ICU's Data Library

ICU includes a standard library of data that is about 16 MB in size. Most of this consists of conversion tables and locale information. The data itself is normally placed into a single shared library.

The ICU data library can be easily customized, either by adding additional converters or locales, or by removing some of the standard ones for the purpose of saving space.

Note

Note that ICU for C by default comes with pre-built data. The source data files are included as an "icu*data.zip" file starting in ICU4C 49. Previously, they were not included unless ICU is downloaded from the source repository. Alternatively, the Data Customizer may be used to customize the pre-built data.

ICU can load data from individual data files as well as from its default library, so building a customized library when adding additional data is not strictly necessary. Adding to ICU's library can simplify application installation by eliminating the need to include separate files with an application distribution, and the need to tell ICU where they are installed.

Reducing the size of ICU's data by eliminating unneeded resources can make sense on small systems with limited or no disk, but for desktop or server systems there is no real advantage to trimming. ICU's data is memory mapped into an application's address space, and only those portions of the data actually being used are ever paged in, so there are no significant RAM savings. As for disk space, with the large size of today's hard drives, saving a few MB is not worth the bother.

By default, ICU builds with a large set of converters and with all available locales. This means that any extra items added must be provided by the application developer. There is no extra ICU-supplied data that could be specified.

Details

The converters and resources that ICU builds are in the following configuration files. They are only available when building from ICU's source code repository. Normally, the standard ICU distribution do not include these files.

source/data/locales/resfiles.mk The standard set of locale data resource bundles
source/data/locales/reslocal.mk User-provided file with additional resource bundles
source/data/coll/colfiles.mk The standard set of collation data resource bundles
source/data/coll/collocal.mk User-provided file with additional collation resource bundles
source/data/brkitr/brkfiles.mk The standard set of break iterator data resource bundles
source/data/brkitr/brklocal.mk User-provided file with additional break iterator resource bundles
source/data/translit/trnsfiles.mk The standard set of transliterator resource files
source/data/translit/trnslocal.mk User-provided file with a set of additional transliterator resource files
source/data/mappings/ucmcore.mk Core set of conversion tables for MIME/Unix/Windows
source/data/mappings/ucmfiles.mk Additional, large set of conversion tables for a wide range of uses
source/data/mappings/ucmebcdic.mk Large set of EBCDIC conversion tables
source/data/mappings/ucmlocal.mk User-provided file with additional conversion tables
source/data/misc/miscfiles.mk Miscellaneous data, like timezone information

These files function identically for both Windows and UNIX builds of ICU. ICU will automatically update the list of installed locales returned by uloc_getAvailable() whenever resfiles.mk or reslocal.mk are updated and the ICU data library is rebuilt. These files are only needed while building ICU. If any of these files are removed or renamed, the size of the ICU data library will be reduced.

The optional files reslocal.mk and ucmlocal.mk are not included as part of a standard ICU distribution. Thus these customization files do not need to be merged or updated when updating versions of ICU.

Both reslocal.mk and ucmlocal.mk are makefile includes. So the usual rules for makefiles apply. Lines may be continued by preceding the end of the line to be continued with a back slash. Lines beginning with a # are comments. See ucmfiles.mk and resfiles.mk for additional information.

Reducing the Size of ICU's Data: Conversion Tables

The size of the ICU data file in the standard build configuration is about 8 MB. The majority of this is used for conversion tables. ICU comes with so many conversion tables because many ICU users need to support many encodings from many platforms. There are conversion tables for EBCDIC and DOS codepages, for ISO 2022 variants, and for small variations of popular encodings.

Important: ICU provides full internationalization functionality without any conversion table data. The common library contains code to handle several important encodings algorithmically: US-ASCII, ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8, and IMAP-mailbox-name (i.e., US-ASCII, ISO-8859-1, and all Unicode charsets; see source/data/mappings/convrtrs.txt for the current list).

Therefore, the easiest way to reduce the size of ICU's data by a lot (without limitation of I18N support) is to reduce the number of conversion tables that are built into the data file.

The conversion tables are listed for the build process in several makefiles source/data/mappings/ucm*.mk, roughly grouped by how commonly they are used. If you remove or rename any of these files, then the ICU build will exclude the conversion tables that are listed in that file. Beginning with ICU 2.0, all of these makefiles including the main one are optional. If you remove all of them, then ICU will include only very few conversion tables for "fallback" encodings (see note below).

If you remove or rename all ucm*.mk files, then ICU's data is reduced to about 3.6 MB. If you remove all these files except for ucmcore.mk, then ICU's data is reduced to about 4.7 MB, while keeping support for a core set of common MIME/Unix/Windows encodings.

Note If you remove the conversion table for an encoding that could be a default encoding on one of your platforms, then ICU will not be able to instantiate a default converter. In this case, ICU 2.0 and up will automatically fall back to a "lowest common denominator" and load a converter for US-ASCII (or, on EBCDIC platforms, for codepages 37 or 1047). This will be good enough for converting strings that contain only "ASCII" characters (see the comment about "invariant characters" in utypes.h).

When ICU is built with a reduced set of conversion tables, then some tests will fail that test the behavior of the converters based on known features of some encodings. Also, building the testdata will fail if you remove some conversion tables that are necessary for that (to test non-ASCII/Unicode resource bundle source files, for example). You can ignore these failures. Build with the standard set of conversion tables, if you want to run the tests.

Reducing the Size of ICU's Data: Locale Data

If you need to reduce the size of ICU's data even further, then you need to remove other files or parts of files from the build as well.

There are a number of different subdirectories of 'data' containing locale data split out by section. Each subdirectory has its own .mk file listing the locales which will be built. Subdirectories include lang for language names and curr for currency names.

You can remove data for entire locales by removing their files from source/data/locales/resfiles.mk or the appropriate other .mk file. ICU will then use the data of the parent locale instead, which is root.txt. If you remove all resource bundles for a given language and its country/region/variant sublocales, do not remove root.txt! Also, do not remove a parent locale if child locales exist. For example, do not remove "en" while retaining "en_US".

Reducing the Size of ICU's Data: Collation Data

Collation data (for sorting, searching and alphabetic indexes) is also large, especially the collation data for East Asian languages because they define multiple orderings of tens of thousands of Han characters. You can remove the collation data for those languages by removing references to those locales from  source/data/coll/colfiles.mk files. When you do that, the collation for those languages will fall back to the root collator, that is, you lose language-specific behavior.

A much less radical approach is to keep the collation data tables but remove the tailoring rule strings from which they were built. Those rule strings are rarely used at runtime. For documentation about their use and how to remove them see the section "Building on Existing Locales" in the Collation Customization chapter.

Adding Converters to ICU

The first step is to obtain or create a .ucm (source) mapping data file for the desired converter. A large archive of converter data is maintained by the ICU team at http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/

We will use solaris-eucJP-2.7.ucm, available from the repository mentioned above, as an example.

Build the Converter

Converter source files are compiled into binary converter files (.cnv files) by using the icu tool makeconv. For the example, you can use this command

makeconv -v solaris-eucJP-2.7.ucm

Some of the .ucm files from the repository will need additional header information before they can be built. Use the error messages from the makeconv tool, .ucm files for similar converters, and the ICU user guide documentation of .ucm files as a guide when making changes. For the solaris-eucJP-2.7.ucm example, we will borrow the missing header fields from source/data/mappings/ibm-33722_P12A-2000.ucm, which is the standard ICU eucJP converter data.

The ucm file format is described in the "Conversion Data" chapter of this user guide.

After adjustment, the header of the solaris-eucJP-2.7.ucm file contains these items:

<code_set_name>   "solaris-eucJP-2.7"
<subchar>         \x3F
<uconv_class>     "MBCS"

<mb_cur_max>      3
<mb_cur_min>      1

<icu:state>       0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1
<icu:state>       a1-fe
<icu:state>       a1-e4
<icu:state>       a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4
<icu:state>       a1-fe

The binary converter file produced by the makeconv tool is solaris-eucJP-2.7.cnv

Installation

Copy the new .cnv file to the desired location for use. Set the environment variable ICU_DATA to the directory containing the data, or, alternatively, from within an application, tell ICU the location of the new data with the function u_setDataDirectory() before using the new converter.

If ICU is already obtaining data from files rather than a shared library, install the new file in the same location as the existing ICU data file(s), and don't change/set the environment variable or data directory.

If you do not want to add a converter to ICU's base data, you can also generate a conversion table with makeconv, use pkgdata to generate your own package and use the ucnv_openPackage() to open up a converter with that conversion table from the generated package.

Building the new converter into ICU

The need to install a separate file and inform ICU of the data directory can be avoided by building the new converter into ICU's standard data library. Here is the procedure for doing so:

  1. Move the .ucm file(s) for the converter(s) to be added ( solaris-eucJP-2.7.ucm for our example) into the directory source/data/mappings/

  2. Create, or edit, if it already exists, the file source/data/mappings/ucmlocal.mk Add this line:

      UCM_SOURCE_LOCAL = solaris-eucJP-2.7.ucm

    Any number of converters can be listed. Extend the list to new lines with a back slash at the end of the line. The ucmlocal.mk file is described in more detail in source/data/mappings/ucmfiles.mk (Even though they use very different build systems, ucmlocal.mk is used for both the Windows and UNIX builds.)

  3. Add the converter name and aliases to source/data/mappings/convrtrs.txt. This will allow your converter to be shown in the list of available converters when you call the ucnv_getAvailableName() function. The file syntax is described within the file.

  4. Rebuild the ICU data.
    For Windows, from MSVC choose the makedata project from the GUI, then build the project.
    For UNIX, "cd icu/source/data; gmake"

When opening an ICU converter (ucnv_open()), the converter name can not be qualified with a path that indicates the directory or common data file containing the corresponding converter data. The required data must be present either in the main ICU data library or as a separate .cnv file located in the ICU data directory. This is different from opening resources or other types of ICU data, which do allow a path.

Adding Locale Data to ICU's Data

If you have data for a locale that is not included in ICU's standard build, then you can add it to the build in a very similar way as with conversion tables above. The ICU project provides a large number of additional locales in its locale repository on the web. Most of this locale data is derived from the CLDR (Common Locale Data Repository ) project.

You need to write a resource bundle file for it with a structure like the existing locale resource bundles (e.g. source/data/locales/ja.txt, ru_RU.txt, kok_IN.txt) and add it by writing a file source/data/locales/reslocal.mk just like above. In this file, define the list of additional resource bundles as

GENRB_SOURCE_LOCAL=myLocale.txt other.txt ...

Starting in ICU 2.2, these added locales are automatically listed by uloc_getAvailable().

ICU Data File Formats

ICU uses several kinds of data files with specific source (plain text) and binary data formats. The following table provides links to descriptions of those formats.

Each ICU data object begins with a header before the actual, specific data. The header consists of a 16-bit header length value, the two "magic" bytes DA 27 and a UDataInfo structure which specifies the data object's endianness, charset family, format, data version, etc.

Files Source format Binary format Generator tool
Public Data Files
ICU .dat package files (list of files provided as input to the icupkg tool, or on the gencmn tool command line) .dat: source/tools/toolutil/pkg_gencmn.c icupkg or gencmn
Resource bundles .txt: icuhtml/design/bnf_rb.txt .res: source/common/uresdata.h genrb
Unicode conversion mapping tables .ucm: Conversion Data chapter .cnv: source/common/ucnvmbcs.h makeconv
Conversion (charset) aliases source/data/mappings/convrtrs.txt : contains format description
The command "uconv -l --canon" will also generate the alias table from the currently used copy of ICU.
cnvalias.icu: source/common/ucnv_io.c gencnval
Unicode Character Data
(Properties; for Java only: hardcoded in C common library)
source/data/unidata/ppucd.txt : Preparsed UCD uprops.icu: tools/unicode/c/genprops/corepropsbuilder.cpp genprops
Unicode Character Data
(Case mappings; for Java only: hardcoded in C common library)
source/data/unidata/*.txt : Unicode Character Database ucase.icu: tools/unicode/c/genprops/casepropsbuilder.cpp genprops
Unicode Character Data
(BiDi, and Arabic shaping; for Java only: hardcoded in C common library)
source/data/unidata/*.txt : Unicode Character Database ubidi.icu: tools/unicode/c/genprops/bidipropsbuilder.cpp genprops
Unicode Character Data
(Normalization since ICU 4.4) & custom normalizatin data
source/data/unidata/norm2/*.txt : Files derived from the Unicode Character Database, or custom data .nrm: source/common/normalizer2impl.h gennorm2
Unicode Character Data
(Normalization before ICU 4.4; for Java only: was hardcoded in C common library)
source/data/unidata/*.txt : Unicode Character Database unorm.icu: source/common/unormimp.h gennorm
Unicode Character Data
(Character names)
source/data/unidata/UnicodeData.txt : Unicode Character Database unames.icu: tools/unicode/c/genprops/namespropsbuilder.cpp genprops
Unicode Character Data
(Property [value] aliases since ICU 4.8; for Java only: hardcoded in C common library since ICU 4.8)
UCD Property*Aliases.txt : Unicode Character Database pnames.icu: source/common/propname.h genprops
Unicode Character Data
(Property [value] aliases before ICU 4.8)
source/data/unidata/Property*Aliases.txt : Unicode Character Database pnames.icu: source/common/propname.h (ICU 4.6) genpname
Collation data
(root collation & tailorings; ICU 53 & later)
Original data from allkeys_CLDR.txt in CLDR Root Collation Data Files
processed into source/data/unidata/FractionalUCA.txt by tool at unicode.org maintained by Mark Davis (call the Main class with option writeFractionalUCA);
source tailorings (text rules) in source/data/coll/*.txt resource bundles: Collation Customization chapter
ucadata.icu & binary tailorings in resource bundles: source/i18n/collationdatareader.h genucagenrb
Collation data
(UCA, code points to weights; ICU 52 & earlier)
Same as in ICU 53 ucadata.icu & binary tailorings in resource bundles: source/i18n/ucol_imp.h (ICU 52) genucagenrb
Collation data
(Inverse UCA, weights->code points; ICU 52 & earlier)
Processed from FractionalUCA.txt like ICU 52 ucadata.icu invuca.icu: source/i18n/ucol_imp.h (ICU 52) genuca
Rule-based break iterator data .txt: Boundary Analysis chapter .brk: source/common/rbbidata.h genbrk
Dictionary-based break iterator data (ICU 50 & later) .txt: gendict.cpp comments .dict: see source/common/dictionarydata.h gendict
Dictionary-based break iterator data (ICU 49 & earlier) .txt: genctd.cpp comments .ctd: see CompactTrieHeader in source/common/triedict.cpp genctd
Rule-based transform (transliterator) data .txt (in resource bundles): Transform Rule Tutorial chapter Uses genrb to make binary format Does not apply
Time zone data (Before ICU 4.4)
source/data/misc/zoneinfo.txt : ftp://elsie.nci.nih.gov/pub/ tzdata<year><rev>.tar.gz
zoneinfo.res (generated by genrb and tzcode tools) Does not apply
Time zone data (ICU 4.4 and later)
source/data/misc/zoneinfo64.txt : ftp://elsie.nci.nih.gov/pub/ tzdata<year><rev>.tar.gz zoneinfo64.res (generated by genrb and tzcode tools) Does not apply
StringPrep profile data source/data/misc/NamePrepProfile.txt .spp: source/tools/gensprep/store.c gensprep
Confusable data source/data/unidata/confusables.txt, source/data/unidata/confusablesWholeScript.txt confusables.cfu: source/i18n/uspoof_impl.h gencfu
Non-File API Binary Data
Converter selector data
none source/common/ucnvsel.cpp ucnvsel_open()
Test-Only Data Files
test.icu (for udata API testing)
none (fixed output from gentest when not using -r or -j options)
test.icu: see createData() in source/tools/gentest/gentest.c gentest
Other Data Structures
BytesTrie (maps byte sequences to 32-bit integers)
(builder API)
BytesTrie design doc, source/common/bytestrie.h (builder class)
UCharsTrie (C++)/CharsTrie (Java) (maps 16-bit-Unicode strings to 32-bit integers)
(builder API)
UCharsTrie design doc, source/tools/toolutil/ucharstrie.h (builder class)

ICU4J Resource Information

Starting with release 2.1, ICU4J includes its own resource information which is completely independent of the JRE resource information. (Note, ICU4J 2.8 to 3.4, time zone information depends on the underlying JRE). The new ICU4J information is equivalent to the information in ICU4C and many resources are, in fact, the same binary files that ICU4C uses.

By default the ICU4J distribution includes all of the standard resource information. It is located under the directory com/ibm/icu/impl/data. Depending on the service, the data is in different locations and in different formats. Note: This will continue to change from release to release, so clients should not depend on the exact organization of the data in ICU4J.

  1. The primary locale data is under the directory icudt38b, as a set of ".res" files whose names are the locale identifiers. Locale naming is documented the com.ibm.icu.util.ULocale class, and the use of these names in searching for resources is documented in com.ibm.icu.util.UResourceBundle.

  2. The collation data is under the directory icudt38b/coll, as a set of ".res" files.

  3. The rule-based transliterator data is under the directory icudt38b/translit as a set of ".res" files. (Note: the Han transliterator test data is no longer included in the core icu4j.jar file by default.)

  4. The rule-based number format data is under the directory icudt38b/rbnf as a set of ".res" files.

  5. The break iterator data is directly under the data directory, as a set of ".brk" files, named according to the type of break and the locale where there are locale-specific versions.

  6. The holiday data is under the data directory, as a set of ".class" files, named "HolidayBundle_" followed by the locale ID.

  7. The character property data as well as assorted normalization data and default unicode collation algorithm (UCA) data is found under the data directory as a set of ".icu" files.

  8. The character set converter data is under the directory icudt38b, as a set of ".cnv" files. These files are currently included only in icu-charset.jar.

  9. The time zone data is named zoneinfo.res under the directory icudt38b.

Some of the data files alias or otherwise reference data from other data files. One reason for this is because some locale names have changed. For example, he_IL used to be iw_IL. In order to support both names but not duplicate the data, one of the resource files refers to the other file's data. In other cases, a file may alias a portion of another file's data in order to save space. Currently ICU4J provides no tool for revealing these dependencies.

Note Java's Locale class silently converts the language code "he" to "iw" when you construct the Locale (for versions of Java through Java 5). Thus Java cannot be used to locate resources that use the "he" language code. ICU, on the other hand, does not perform this conversion in ULocale, and instead uses aliasing in the locale data to represent the same set of data under different locale ids.

Resource files that use locale ids form a hierarchy, with up to four levels: a root, language, region (country), and variant. Searches for locale data attempt to match as far down the hierarchy as possible, for example, "he_IL" will match he_IL, but "he_US" will match he (since there is no US variant for he, and "xx_YY will match root (the default fallback locale) since there is no xx language code in the locale hierarchy. Again, see java.util.ResourceBundle for more information.

Currently ICU4J provides no tool for revealing these dependencies between data files, so trimming the data directly in the ICU4J project is a hit-or-miss affair. The key point when you remove data is to make sure to remove all dependencies on that data as well. For example, if you remove he.res, you need to remove he_IL.res, since it is lower in the hierarchy, and you must remove iw.res, since it references he.res, and iw_IL.res, since it depends on it (and also references he_IL.res).

Unfortunately, the jar tool in the JDK provides no way to remove items from a jar file. Thus you have to extract the resources, remove the ones you don't want, and then create a new jar file with the remining resources. See the jar tool information for how to do this. Before 'rejaring' the files, be sure to thoroughly test your application with the remaining resources, making sure each required resource is present.

Using additional resource files with ICU4J

Note Resource file formats can change across releases of ICU4J!

The format of ICU4J resources is not part of the API. Clients who develop their own resources for use with ICU4J should be prepared to regenerate them when they move to new releases of ICU4J.

We are still developing ICU4J's resource mechanism. Currently it is not possible to mix icu's new binary .res resources with traditional java-style .class or .txt resources. We might allow for this in a future release, but since the resource data and format is not formally supported, you run the risk of incompatibilities with future releases of ICU4J.

Resource data in ICU4J is checked in to the repository as a jar file containing the resource binaries, icudata.jar. This means that inspecting the contents of these resources is difficult. They currently are compiled from ICU4C .txt file data. You can view the contents of the ICU4C text resource files to understand the contents of the ICU4J resources.

The files in icudata.jar get extracted to com/ibm/icu/impl/data in the build directory when the 'core' target is built. Building the 'resources' target will force the resources to once again be extracted. Extraction will overwrite any corresponding resource files already in that directory.

Building ICU4J Resources from ICU4C

Requirements

  1. ICU4C

  2. Compilers and tools required for building ICU4C .

  3. J2SE SDK version 5 or above

Procedure

  1. Download and build ICU4C on a Windows or Linux machine. For instructions on downloading and building ICU4C, please click here .

  2. Follow the remaining instructions in $icu4c_root/source/data/icu4j-readme.txt . $icu4c_root is the root directory of ICU4C source package.
Comments