This chapter discusses the ICU design structure, the ICU versioning support, and the introduction of namespace in C++.
The JDK internationalization components and ICU components both share the same common basic architectures with regard to the following:
There are design features in ICU4C that are not in the Java Development Kit (JDK) due
Locale IDs are composed of language, country, and variant information. The following links provide additional useful information regarding ISO standards: ISO-639 , and an ISO Country Code, ISO-3166 . For example, Italian, Italy, and Euro are designated as: it_IT_EURO.
Data-driven services often use resource bundles for locale data. These services map a key to data. The resources are designed not only to manage system locale information but also to manage application-specific or general services data. ICU supports string, numeric, and binary data types and can be structured into nested arrays and tables.
This results in the following:
The "open and close" model supports multi-threading. It enables ICU users to use the same kind of service for different locales, either in the same thread or in different threads.
For example, a thread can open many collators for different languages, and different threads can use different collators for the same locale simultaneously. Constant data can be shared so that only the current state is allocated for each editor.
The ICU threading model is designed to avoid contention for resources, and enable you to use the services for multiple locales simultaneously within the same thread. The ICU threading model, like the rest of the ICU architecture, is the same model used for the international services in Java™.
If no locale is supplied when a service is opened, ICU uses the default locale. Once a service is open, changing the default locale has no effect. Thus, there can not be any thread synchronization between the default locales and open services.
When you open a second service for the same locale, another small chunk of memory is used for the state of the service, with pointers to the same shared, read-only data. Thus, the majority of the memory usage is shared. When any service is closed, then the chunk of memory is deallocated. Other connections that point to the same shared data stay valid.
Any number of services, for the same locale or different locales, can be open within the same thread or in different threads.
In recent ICU releases, we have worked to make any service object thread-safe (usable concurrently) as long as all of the threads are using only const APIs: APIs that are declared const in C++, take a const this-like service pointer in C, or are "logically const" in Java. This is an enhancement over the original Java/ICU threading model. (Originally, concurrent use of even only const APIs was not thread-safe.)
However, you cannot use a reference to an open service object in two threads at the same time if either of them calls any non-const API. An individual open service object is not thread-safe for concurrent "writes". Rather, for non-const use, you must use the clone function to create a copy of the service you want and then pass this copy to the second thread. This procedure allows you to use the same service in different threads, but avoids any thread synchronization or deadlock problems.
Some classes also implement the
Clone operations are designed to be much faster than reopening the service with initial parameters and copying the source's state. (With objects in C++ and Java, the clone function is also much safer than trying to recreate a service, since you get the proper subclass.) Once a service is cloned, changes will not affect the original source service, or vice-versa.
Thus, the normal mode of operation is to:
Typically, the services supplied with ICU cover the vast majority of usages. However, there are circumstances where the service needs to be customized for a new locale. ICU (and Java) enable you to create customized services. For example, you can create a RuleBasedCollator by merging the rules for French and Arabic to get a custom French-Arabic collation sequence. By merging these rules, the pointer does not point to a read-only table that is shared between threads. Instead, the pointer refers to a table that is specific to your particular open service. If you clone the open service, the table is copied. When you close the service, the table is destroyed.
For some services, ICU supplies registration. You can register a customized open service under an ID; keeping a copy of that service even after you close the original. A client in that thread or in other threads can recreate a copy of the service by opening with that ID.
ICU may cache service instances. Therefore, registration should be done during startup, before opening services by locale ID.
These registrations are not persistent; once your program finishes, ICU flushes all the registrations. While you still might have multiple copies of data tables, it is faster to create a service from a registered ID than it is to create a service from rules.
For services whose IDs are locales, such as collation, the registered IDs must also be locales. For those services (like Transliteration or Timezones) that are cross-locale, the IDs can be any string.
Prospective future enhancements for this model are:
Some application environments operate by setting a per thread (or per process) locale ID, and then not passing the locale ID as a parameter during processing. If this usage model were used with ICU in a multi-threaded server, it might result in ICU being requested to constantly open, use, and then close service objects. Instead, it is recommended that locale IDs be associated with each client be stored with other per-client data, along with any service objects (such as collators or formatters) that client might use. If operations involving a single client are short-lived, it might be more efficient to keep a pool of service objects, organized according to locale. Then, if a particular locale's formatter is in high demand, that formatter can be used, and then returned to the pool.
ICU4C APIs are designed to allow separate heaps for its libraries vs. the application. This is achieved by providing functions to allocate and release objects owned by ICU4C using only ICU4C library functions. For more details see the Memory Usage section in the Coding Guidelines .
The ICU library does not normally require any explicit initialization prior to use. An
In C++ programs, ICU objects and APIs may safely be used during static initialization of other application-defined classes or objects. There are no order-of-initialization problems between ICU and static objects from other libraries because ICU does not rely on C++ static object initialization for its normal operation.
When an application is terminating, it may optionally call the function u_cleanup(void) , which will free any heap storage that has been allocated and held by the ICU library. The main benefit of u_cleanup() occurs when using memory leak checking tools while debugging or testing an application. Without u_cleanup(), memory being held by the ICU library will be reported as leaks.
( For some platforms, the configure option --enable-auto-cleanup (or defining the option UCLN_NO_AUTO_CLEANUP to 0) will add code which automatically cleans up ICU when its shared library is unloaded. See comments in ucln_imp.h )
There is one specialized case where extra care is needed to safely initialize ICU. This situation will arise only when ALL of the following conditions occur:
To safely initialize the ICU library when all of the above conditions apply, the application must explicitly arrange for a first-use of ICU from a single thread before the multi-threaded use of ICU begins (see below for basic steps in safely initializing the ICU library). A convenient ICU operation for this purpose is uloc_getDefault() , declared in the header file "unicode/uloc.h".
In order for ICU to maximize portability, this version includes only the subset of the C++ language that compile correctly on older C++ compilers and provide a usable C interface. Thus, there is no use of the C++ exception mechanism in the code or Application Programming Interface (API).
To communicate errors reliably and support multi-threading, this version uses an error code parameter mechanism. Every function that can fail takes an error-code parameter by reference. This parameter is always the last parameter listed for the function.
The UErrorCode parameter is defined as an enumerated type. Zero represents no error, positive values represent errors, and negative values represent non-error status codes. Macros (U_SUCCESS and U_FAILURE) are provided to check the error code.
The UErrorCode parameter is an input-output function. Every function tests the error code before performing any other task and immediately exits if it produces a FAILURE error code. If the function fails later on, it sets the error code appropriately and exits without performing any other work, except for any cleanup it needs to do. If the function encounters a non-error condition that it wants to signal, such as "encountered an unmapped character" in conversion, the function sets the error code appropriately and continues. Otherwise, the function leaves the error code unchanged.
Generally, only the functions that do not take a
UErrorCode parameter, but call functions that do, must declare a
variable. Almost all functions that take a UErrorCode parameter,
ICU enables you to call several functions (that take error codes) successively without having to check the error code after each function. Each function usually must check the error code before doing any other processing, since it is supposed to stop immediately after receiving an error code. Propagating the error-code parameter down the call chain saves the programmer from having to declare the parameter in every instance and also mimics the C++ exception protocol more closely.
There are 3 major extensibility elements in ICU:
A resource bundle is a set of <key,value> pairs that provide a mapping from key to value. A given program can have different sets of resource bundles; one set for error messages, one for menus, and so on. However, the program may be organized to combine all of its resource bundles into a single related set.
The set is organized into a tree with "root" at the top, the language at the first level, the country at the second level, and additional variants below these levels. The set must contain a root that has all keys that can be used by the program accessing the resource bundles.
Except for the
root, each resource bundle has an immediate parent. For example, if
there is a resource bundle "X_Y_Z", then there must be the resource
bundles: "X_Y", and "X". Each child resource bundle can omit any
<key,value> pair that is identical to its parent's pair. (Such
omission is strongly encouraged as it reduces data size and maintenance
effort). It must override any <key,value> pair that is different
from its parent's pair. If you have a resource bundle for the locale ID
"language_country_variant", you must also have
If a program doesn't find a key in a child resource bundle, it can be assumed that it has the same key as the parent. The default locale has no effect on this. The particular language used for the root is commonly English, but it depends on the developer's preference. Ideally, the language should contain values that minimize the need for its children to override it.
The default locale is used only when there is not a resource bundle for a given language. For example, there may not be an Italian resource bundle. (This is very different than the case where there is an Italian resource bundle that is missing a particular key.) When a resource bundle is missing, ICU uses the parent unless that parent is the root. The root is an exception because the root language may be completely different than its children. In this case, ICU uses a modified lookup and the default locale. The following are different lookup methods available:
Lookup chain : Searching for a resource bundle.
: Searching for a <key, value> pair after en_US_some-variant has
ben loaded. ICU does not use the default locale in this case.
ICU supports extensive version code and data changes and introduces namespace usage.
Version changes show clients when parts of ICU change. ICU; its components (such as Collator); each resource bundle, including all the locale data resource bundles; and individual tagged items within a resource bundle, have their own version numbers. Version numbers numerically and lexically increase as changes are made.
All version numbers are used in Application Programming Interfaces (APIs) with a UVersionInfo structure. The UVersionInfo structure is an array of four unsigned bytes. These bytes are:
Two UVersionInfo structures may be compared using binary comparison (memcmp) to see which is larger or newer. Version numbers may be different for different services. For instance, do not compare the ICU library version number to the ICU collator version number.
UVersionNumber structures can be converted to and from string representations as dotted integers (such as "188.8.131.52") using the u_versionToString() and u_versionFromString() functions. String representations may omit trailing zeros.
The interpretation of version numbers depends on what is being described.
The first version number field contains the ICU release version number, for example 49. Each new version might contain new features, new locale data, and modified behavior. (See below for more information on ICU Binary Compatibility (§).)
The second field is 1 for the initial release (e.g., 49.1). The second and sometimes third fields are incremented for binary compatible maintenance releases.
(The second field is 0 during development, with milestone numbers in the third field during that time. For example, 49.0.1 for 49 milestone 1.)
In earlier releases, the first two version fields together indicated the ICU release, for example 4.8. The third field was 0 for the initial release, and 1 and higher for binary compatible (bug fixes only) maintenance releases (e.g., 4.8.1). The fourth field was used for updates specific to only one of Java, C++, or ICU-in-Eclipse.
The second version field was even for formal releases ("reference releases") (e.g., 1.6 or 4.8) and odd during their development (unreleased unstable snapshot versions; e.g., 4.7). During development, the third field contained the milestone number (e.g., 4.7.1 for 4.8 milestone 1). For very old ICU code, we published semi-formal “enhancement” releases with odd second-field numbers (e.g., 1.7).
Library filenames and some other internal uses already used a concatenation of the first two fields ("48" for 4.8).
Resource Bundles and Elements
The data stored in resource bundles is tagged with version numbers. A resource bundle can contain a tagged string named "Version" that declares the version number in dotted-integer format. For example,
A resource bundle may omit the "version" element and thus, will inherit a version along the usual chain. For example, if the resource bundle en_US contained no "version" element, it would inherit "184.108.40.206" from the parent en element. If inheritance passes all the way to the root resource bundle and it contains no "version" resource, then the resource bundle receives the default version number 0.
Elements within a resource bundle may also contain version numbers. For example:
In this example, the CollationElements data is version 220.127.116.11. This element version is not related to the version of the bundle.
Internally, data files carry format and other version numbers. These version numbers ensure that ICU can use the data file. The interpretation depends entirely on the data file type. Often, the major number in the format version stays the same for backwards-compatible changes to a data file format. The minor format version number is incremented for additions that do not violate the backwards compatibility of the data file.
ICU component version numbers may be found using:
A major new feature in ICU 2.0 is the ability to link to different versions of ICU with the same program. Using this new feature, a program can keep using ICU 1.8 collation, for example, while using ICU 2.0 for other services. ICU now can also be unloaded if needed, to free up resources, and then reloaded when it is needed.
ICU 2.0 introduced the use of a C++ namespace to avoid naming collision between ICU exported symbols and other libraries. All the public ICU C++ classes are defined in the "icu_VersionNumber::" namespace, which is also aliased as namespace "icu". Starting with ICU 2.0, including any public ICU C++ header by default includes a "using namespace icu_VersionNumber" statement. This is for backward compatibility, and should be turned off in favor of explicitly using icu::UnicodeString etc. (see How To Use ICU). (If entry point renaming is turned off, then only the unversioned "icu" namespace is used.)
Starting with ICU 49, ICU4C requires namespace support.
It is sometimes useful to see a dependency chart between the public ICU APIs and ICU libraries. This chart can be useful to people that are new to ICU or to people that want only certain ICU libraries.
Here are some things to realize about the chart.
Starting with ICU 49, the dependencies of code files (.o files compiled from .c/.cpp) are documented in source/test/depstest/dependencies.txt. Adjacent Python code is used to parse this file and to verify that it matches the actual dependencies of the code files.
The dependency list can be used to build subset libraries. In addition, by reducing intra-library dependencies, the code size of statically linked ICU code has been reduced.
ICU APIs, as defined in header and class files, are either "external" or "internal". External APIs are meant to be used by applications, while internal APIs should be used only within ICU. APIs are marked to indicate whether they are external or internal, as follows. Every external API has a lifecycle label, see below.
External ICU4C APIs are
Exception: Layout engine header files are not in a unicode folder, although the public ones are still copied to the include/unicode folder at build/install time. External layout engine APIs are the ones that have lifecycle labels and not an "@internal" label.
External ICU4J APIs are
"System" APIs are external APIs
that are intended only for special uses for system-level code, for
example u_cleanup(). Normal users should not use them, although they
are public and supported. System APIs have a "@system" label in addition to the lifecycle
All APIs that do not fit any of the descriptions above are internal, which means that they are for ICU internal use only and may change at any time without notice. Some of them are member functions of public C++ or Java classes, and are "technically public but logistically internal" for implementation reasons; typically because programming languages don't provide sufficiently access control (without clumsy mechanisms). In this case, such APIs have an "@internal" label.
As ICU develops, it adds external APIs - functions, classes, constants, and so on. Occasionally it is also necessary to remove or change external APIs. In order to make this work, we use the following process:
For all API changes (and for significant/controversial/difficult implementation changes), we use proposals to announce and discuss them. A proposal is simply an email to the icu-design mailing list that details what is proposed to be changed, with an expiration date of typically a week. This gives all mailing list members a chance to review upcoming changes, and to discuss them. A proposal often changes significantly as a result of discussion. Most proposals will eventually find consensus among list members; otherwise, the PMC decides what to do. If the addition or change of APIs would affect you, please subscribe to the main icu-design mailing list .
When a new API is added to ICU, it is marked as draft with a "@draft ICU x.y" label in the API documentation, where x.y is the ICU version when the API signature was introduced or last changed. A draft API is not guaranteed to be stable! Although we will not make gratuitous changes, sometimes the draft APIs turns out to be unsatisfactory in actual practice and may need to be changed or even removed. Changes of "draft" API are subject to the proposal process described above.
When a @draft ICU x.y API is changed, it must remain @draft and its version number must be updated.
In ICU4J 3.4.2 and earlier, @draft APIs were also marked with Java's @deprecated tag, so that uses of draft APIs in client code would be flagged by the compiler. These uses of the @deprecated tag were indicated with the comment “This is a draft API and might change in a future release of ICU.” Many clients found this confusing and/or undesireable, so ICU4J 3.4.3 no longer marks draft APIs with the @deprecated tag by default. For clients who prefer the earlier behavior, ICU4J provides an ant build target, 'restoreDeprecated', which will update the source files to use the @deprecated tag. Then clients can just rebuild the ICU4J jar as usual.
When an API is judged to be stable and has not been changed for at least one ICU release, it is relabeled as stable with a "@stable ICU x.y" label in the API documentation. A stable API
is expected to be available in this form for a long time. The ICU version x.y
indicates the last time the API signature was introduced or changed. The promotion from @draft ICU x.y to @stable ICU x.y must not change the x.y version number.
We occasionally make an exception and allow adding new APIs marked as @stable ICU x.y APIs in the x.y release itself if we believe that they have to be stable. We might do this for enum constants that reflect 1:1 Unicode property aliases and property value aliases, for a Unicode upgrade in the x.y release.
We sometimes "broaden" a @stable API function by changing its signature in a compatible way. For example, in Java, we might change an input parameter from a String to a CharSequence. In this case we keep the @stable but update the ICU version number indicating the function signature change.
Even a stable API may eventually need to become deprecated or obsolete. Such APIs are strongly discouraged from use. Typically, an improved API is introduced at the time of deprecation/obsolescence of the old one.
For example, here is how an API might be tagged in various versions:
ICU4C may be configured for use as a system library in an environment where applications that are built with one version of ICU must continue to run without change with later versions of the ICU shared library.
Here are the requirements for enabling binary compatibility for ICU4C:
Stable APIs do not guarantee that the results from every function will always be completely identical between ICU versions (see the Version Numbers (§) section above). Bugs may be fixed. The Unicode character data may change with new versions of the Unicode standard. Locale data may be updated or changed, yielding different results for operations like formatting or collation. Applications that require exact bit-for-bit, bug-for-bug compatibility of ICU results should not rely on ICU release-to-release binary compatibility, but should instead link against a specific version of ICU.
verify that an application uses only stable APIs, build it with the C
preprocessor symbols U_HIDE_DRAFT_API and U_HIDE_DEPRECATED_API
defined. This will produce build errors if any draft, deprecated or
obsolete APIs are used. An operating system level installation of ICU may set this option permanently.
C APIs only. Only plain C APIs remain compatible across ICU releases. The reason C++ binary compatibility is not supported is primarily because the design of C++ language and runtime environments present extreme technical difficulties to doing so. Stable C++ APIs are source compatible, but applications using them must be recompiled when moving between ICU releases.
Function renaming disabled. Function renaming is an ICU feature that allows an application to explicitly link against a specific version of the ICU library, and to continue to use that version even when other ICU versions exist in the runtime environment. This is the exact opposite of release-to-release binary compatibility – instead of being able to transparently change ICU versions, an application is explicitly tied to one specific version.
Function renaming is enabled by default, and must be disabled at ICU build time to enable release to release binary compatibility. To disable renaming, use the configure option
configure -–disable-renaming [other configure options]
(Configure options may also be passed to the runConfigureICU script.)
To enable release-to-release binary compatibility, ICU must be built with --disable-renaming, and applications must be built using the headers and libraries that resulted from the –-disable-renaming ICU build
ICU Version 3.0 or Later. Binary compatibility of ICU releases is supported beginning with ICU version 3.0. Older versions of ICU (2.8 and earlier) do not provide for binary compatibility between versions.
This section is intended to aid software developers who are implementing or integrating solutions based on ICU, that may need to consider having multiple versions of ICU running within the same executable (address space) at once. Typically, users of ICU are encouraged to update to the latest stable version. Under certain circumstances, however, behavior from earlier versions is desired, or else, an application is linking together code which is already built against a different version of ICU.
The major and minor numbers are the first and second numbers in a version number, separated by a period. For example, in the version numbers 18.104.22.168, 3.4.2, or 3.4, "3" is the major, and "4" is the minor. Normally, ICU employs "symbol renaming", such that the C function names and C++ object names are #defined to contain the major and minor numbers. So, for example, if your application calls the function "ucnv_open()", it will link against "ucnv_open_3_4" if compiled against ICU 3.4, 3.4.2, or even 22.214.171.124. However, if compiled against ICU 3.8, the same code will link against "ucnv_open_3_8". Similarly, UnicodeString is renamed to UnicodeString_3_4, etc. This is normally transparent to the user, however, if you inspect the symbols of the library or your code, you will see the modified symbols.
If there are multiple versions of ICU being linked against in one application, it will need to link against all relevant libraries for each version, for example, common, i18n, and data. ICU uses standard library renaming, where, for example, 'libicuuc.so' on one platform will actually be a symbolic link to 'libicuuc.so.3.4'. When multiple ICU versions are used, the application may need to explicitly link against the exact versions of ICU being used.
To disable renaming, build ICU with "--disable-renaming" passed to configure. Or, set the equivalent "#define U_DISABLE_RENAMING 1". Renaming must be disabled both in the ICU build, and in the calling application.
The binary compatibility of the data refers to the resource bundle binary format that is contains the locale data, charset conversion tables and other file formats supported by ICU. These binary formats are readable by many versions of ICU. For example, resource bundles written with ICU 3.6 are readable by ICU 3.8.
The structural compatibility of the data refers to the structural contents of the ICU data. The structure of the locale data may change between reference releases, but the keys to reference specific types of data will be the same between maintenance releases. This means that resource keys to access data within resource bundles will work between maintenance releases of a specific reference release. For example, an ICU 3.8 calendar will be able to use ICU 3.8.1 data, and vis versa; however ICU 3.6 may not be able to read ICU 3.8 locale data. Generally, these keys are not accessible by ICU users because only the ICU implementation uses these resource keys.
The contents of the data library may change between ICU maintenance releases and give you different results due to important updates and bug fixes. An example of an important update would be a timezone rule update for when a country changes when daylight saving time occurs. So the results may be different between maintenance releases.
Starting in ICU4J 3.6, ICU4J stable API classes (marked as @stable) implementing java.io.Serializable support serialized objects to be deserialized by ICU4J 3.6 or newer version of ICU4J. Some classes perform only shallow serialization, therefore, it is not guaranteed that a deserialized object behaves exactly same with the original object across ICU4J versions. Also, when it is difficult to maintain serialization compatibility in a certain class across different ICU4J versions for technical or other reasons, the ICU project committee may approve the breakage. In such event, a note explaining the compatibility issue will be posted in the ICU public mailing lists and also documented in the release note of the new ICU4J version introducing the incompatibility.