Note: This page describes the use of ICU4C Resource Management techniques and APIs. For an overview of the message localization process using ICU, see the related page Localizing with ICU.
A software product that needs to be localized wins or loses depending on how easy is to change the data or "resources" which affect users. From the simplest point of view, that data is the information presented to the user (such as a translated message) as well as the region-specific ways of doing things (such as sorting). The process of localization will eventually involve translators and it would be very convenient if the process of localizing could be done only by translators and experts in the target culture. There are several points to keep in mind when designing such a localizable software product.
Keeping Data Separate
Obviously, one does not want to make translators wade through the source code and make changes there. That would be a recipe for a disaster. Instead, the translatable data should be kept separately, in a format that allows translators easy access. A separate resource managing mechanism is hence required. Application access data through API calls, which pick the appropriate entries from the resources. Resources are kept in human readable/editable format with optional tools for content editing.
The data should contain all the elements to be localized, including, but no limited to, GUI messages, icons, formatting patterns, and collation rules. A convenient way for keeping binary data should also be provided - often icons for different cultures should be different.
It is not unlikely that the data will be same for several regions - take for example Spanish speaking countries - names of the days and month will be the same in both Mexico and Spain. It would be very beneficial if we can prevent the duplication of data. This can be achieved by structuring resources in such a way so that an unsuccessful query into a more specific resource triggers the same query in a more general resource. A convenient way to do this is to use a tree like structure.
Another way to reduce the data size is to allow linking of the resources that are same for the regions that are not in general-specific relation.
Sometimes, the exact data for a region is still not available. However, if the data is structured correctly, the user can be presented with similar data. For example, a Spanish speaking user in Mexico would probably be happier with Spanish than with English captions, even if some of the details for Mexico are not there.
If the data is grouped correctly, the program can automatically find the most suitable data for the situation.
The previous points all lead to a separate mechanism that stores data separately from the code. Software is able to access the data through the API calls. Data is structured in a tree like structure, with the most general region in the root (most commonly, the root region is the native language of the development team). Branches lead to more specialized regions, usually through languages, countries and country regions. Data that is already the same on the more general level is not repeated.
Here is an example of a such a resource tree structure:
Let us assume that the root resource contains data written by the original implementors and that this data is in English and conforms to the conventions used in the United States. Therefore, resources for English and English in United States would be empty and would take its data from the root resource. If a version for Ireland is required, appropriate overriding changes can be made to the data for English in Ireland. Special variant information could be put into en_US_POSIX if specific legacy formatting were required, or specific sub-region information were required. When making the version for the German speaking region, all the German data would be in that resource, with the differences in the Germany and Austria resources.
It is important to note that some locales have the optional script tag. This is important for multiscript locales, like Uzbek, Azerbaijani, Serbian or Chinese. Even though Chinese uses Han characters, the characters are usually identified as either traditional Chinese (Hant) characters, or simplified Chinese (Hans).
Even if all the data that would go to a certain resource comes from the more general resources, it should be made clear that the particular region is supported by application. This can be done by having completely empty resources.
ICU bases its resource management model on the ideas presented above. All the resource APIs are concentrated in the resource bundle framework. This framework is closely tied in its functioning to the ICU Locale naming scheme.
ICU provides and relies on a set of locale specific data in the resource bundle format. If we think that we have correct data for a requested locale, even if all its data comes from a more general locales, we will provide an empty resource bundle. This is reflected in our return informational codes (see the section on APIs). A lot of ICU frameworks (collation, formatting etc.) relies on the data stored in resource bundles.
Resource bundles rely on the ICU data framework. For more information on the functioning of ICU data, see the appropriate section .
Users of the ICU library can also use the resource bundle framework to store and retrieve localizable data in their projects.
Resource bundles are collections of resources. Individual resources can contain data or other resources.
Essential part ICU's resource management framework is the fallback mechanism. It ensures that if the data for the requested locale is missing, an effort will be made to obtain the most usable data. Fallback can happen in two situations:
ICU allows and requires that the application specific data be stored apart from the ICU internal data (locale, converter, transformation data etc.). Application data should be stored in packages. ICU uses the default package (NULL) for its data. All the ICU's build tools provide means to specify the package for your data. More about how to package application data can be found below.
ICU4C provides both C and C++ APIs for using resource bundles. The core implementation is in C, while the C++ APIs are only a thin wrapper around it. Therefore, the code using C APIs will generally be faster.
Resource bundles use ICU's "open use close" paradigm. In C all the resource bundle operations are done using the UResourceBundle* handle. UResourceBundle* allows access to both resource bundles and individual resources. In C++, class ResourceBundle should be used for both resource bundles and individual resources.
To use the resource bundle framework, you need to include the appropriate header file, unicode/ures.h for C and unicode/resbund.h for C++.
If an operation with resource bundle fails, an error code will be set. It is important to check for the value of the error code. In C you should frequently use the following construct:
The most common C resource bundle opening API is UResourceBundle* ures_open(const char* package, const char* locale, UErrorCode* status). The first argument specifies the package name or NULL for the default ICU package. The second argument is the locale for which you want the resource bundle. Special values for the locale are NULL for the default locale and "" (empty string) for the root locale. The third argument should be set to U_ZERO_ERROR before calling the function. It will return the status of operation. Apart from returning regular errors, it can return two informational/warning codes: U_USING_FALLBACK_WARNING and U_USING_DEFAULT_WARNING. The first informational code means that the requested resource bundle was not found and that a more general bundle was returned. If you are opening ICU resource bundles, do note that this means that we do not guarantee that the contents of opened resource bundle will be correct for the requested locale. The situation might be different for application packages. However, U_USING_DEFAULT_WARNING means that there were no more general resource bundles found and that you were returned either a resource bundle that is the default for the system or the root resource bundle. This will almost certainly contain wrong data.
There is a couple of other opening APIs: ures_openDirect takes the same arguments as the ures_open but will fail if the requested locale is not found. Also, if opening is successful, no fallback will be performed if an individual resource is not found. The second one, ures_openU takes a UChar* for package name instead of char*.
In C++, opening is done through a constructor. There are several constructors. Most notable difference from C APIs is that the package should be given as a UnicodeString and the locale is passed as a Locale object. There is also a copy constructor and a constructor that takes a C UResourceBundle* handle. The result is a ResourceBundle object. Remarks about informational codes are also valid for the C++ APIs.
In C++, opening would look like this:
After using, resource bundles need to be closed to prevent memory leaks. In C, you should call the void ures_close(UResourceBundle* resB) API. In C++, if you have just used the ResourceBundle objects, going out of scope will close the bundles. When using allocated objects, make sure that you call the appropriate delete function.
As already mentioned, resource bundles and resources share the same type. You can close bundles and resources in any order you like. You can invoke ures_close on NULL resource bundles. Therefore, you can always this API regardless of the success of previous operations.
Once you are in the possession of a valid resource bundle, you can access the resources and data that it holds. The result of accessing operations will be a new resource bundle object. In C, UResourceBundle* handles can be reused by using the fill-in parameter. That saves you from frequent closing and reallocating of resource bundle structures, which can dramatically improve the performance. C++ APIs do not provide means for object reuse. All the C examples in the following sections will use a fill-in parameter.
Resource bundles can contain two main types of resources: complex and simple resources. Complex resources store other resources and can have named or unnamed elements. Tables store named elements, while arrays store unnamed ones. Simple resources contain data which can be string, binary, integer array or a single integer.
There are several ways for accessing data stored in the complex resources. Tables can be accessed using keys, indexes and by iteration. Arrays can be accessed using indexes and by iteration.
In order to be able to distinguish between resources, one needs to know the type of the resource at hand. To find this out, use the UResType ures_getType(UResourceBundle *resourceBundle) API, or the C++ analog UResType getType(void). UResType is an enumeration defined in unicode/ures.h header file.
To access resources using a key, you can use the UResourceBundle* ures_getByKey(const UResourceBundle *resourceBundle, const char* key, UResourceBundle *fillIn, UErrorCode *status) API. First argument is the parent resource bundle, which can be either a resource bundle opened using ures_open or similar APIs or a table resource. The key is always specified using invariant characters. The fill-in parameter can be either NULL or a valid resource bundle handle. If it is NULL, a new resource bundle will be constructed. If you pass an already existing resource bundle, it will be closed and the memory will be reused for the new resource bundle. Status indicator can return U_MISSING_RESOURCE_ERROR which indicates that no resources with that key exist, or one of the above mentioned informational codes (U_USING_FALLBACK_WARNING and U_USING_DEFAULT_WARNING) which do not affect the validity of data in the case of resource retrieval.
In C++, the analogous API is ResourceBundle get(const char* key, UErrorCode& status) const.
Trying to retrieve resources by key on any other type of resource than tables will produce a U_RESOURCE_TYPE_MISMATCH error.
Accessing by index requires you to supply an index of the resource that you want to retrieve. Appropriate API is UResourceBundle* ures_getByIndex(const UResourceBundle *resourceBundle, int32_t indexR, UResourceBundle *fillIn, UErrorCode *status). The arguments have the same semantics as for the ures_getByKey API. The only difference is the second argument, which is the index of the resource that you want to retrieve. Indexes start at zero. If an index out of range is specified, U_MISSING_RESOURCE_ERROR is returned. To find the size of a resource, you can use int32_t ures_getSize(UResourceBundle *resourceBundle). The maximum index is the result of this API minus 1.
Accessing simple resource with an index 0 will return themselves. This is useful for iterating over all the resources regardless of type.
C++ overloads the get API with ResourceBundle get(int32_t index, UErrorCode& status) const.
If you don't care about the order of the resources and want simple code, you can use the iteration mechanism. To set up iteration over a complex resource, you can simply start iterating using the UResourceBundle* ures_getNextResource(UResourceBundle *resourceBundle, UResourceBundle *fillIn, UErrorCode *status). It is advisable though to reset the iterator for a resource before starting, in order to ensure that the iteration will indeed start from the beggining - just in case somebody else has already been playing with this resource. To reset the iterator use void ures_resetIterator(UResourceBundle *resourceBundle) API. To check whether there are more resources, call UBool ures_hasNext(UResourceBundle *resourceBundle). If you have iterated through the whole resource, NULL will be returned.
C++ provides analogous APIs: ResourceBundle getNext(UErrorCode& status), void resetIterator(void) and UBool hasNext(void).
In order to get to the data in the simple resources, you need to use appropriate APIs according to the type of a simple resource. They are summarized in the tables below. All the pointers returned should be considered pointers to read only data. Using an API on a resource of a wrong type will result in an error.
Integers, signed and unsigned:
Since the vast majority of data stored in resource bundles are strings, ICU's resource bundle framework provides a number of different convenience APIs that directly access strings stored in resources. They are analogous to APIs already discussed, with the difference that they return const UChar* or UnicodeString objects.
APIs that allow retrieving strings by specifying a key:
APIs that allow retrieving strings by specifying an index:
APIs for retrieving strings through iteration:
Resource bundle framework provides a number of additional APIs that allow you to get more information on the resources you are using. They are summarized in the following tables.
Gets the number of items in a resource. Simple resources always return size 1.
Gets the type of the resource. For a list of resource types, see: unicode/ures.h
Gets the key of a named resource or NULL if this resource is a member of an array.
Fills out the version structure for this resource.
Returns the locale this resource is from. This API is going to change, so stay tuned.
Resource bundles are written in its source format. Before using them, they must be compiled to the binary format using the genrb utility. Currently supported source format is a text file. The format is defined in a formal definition file .
This is an example of a resource bundle source file:
Binary format is described in the uresdata.h header file.
Syntax of the resources that can be stored in resource bundles is specified in the following table:
Although specifying type for some resources can be omitted for backward compatibility reasons, you are strongly encouraged to always specify the type of the resources. As structure gets more complicated, some combinations of resources that are not typed might produce unexpected results.
String values can contain C/Java-style escape sequences like \t, \r, \n, \xhh, \uhhhh and \U00hhhhhh, consistent with the u_unescape() C API, see the ustring.h API documentation.
A literal backslash (\) in a string value must be doubled (\\) or escaped with \x5C or \u005C.
A literal ASCII double quote (") in a double-quoted string must be escaped with \" or \x22 or \u0022.
You should also escape carriage return (\r) and line feed (\n) as well as control codes, noncharacters, unassigned code points and other default-invisible characters (see the Unicode Default_Ignorable_Code_Point property).
The way to write your resource is to start with a table that has your locale name. The contents of a table are between the curly brackets:
Then you can start adding resources to your bundle. Resources on the first level must be named and we suggest that you specify the type:
The resource bundle format doesn't care about indentation and line breaks. You can continue one string over many lines - you need to have the line break outside of the string:
In order to make your own resource bundle package, you need to perform several steps:
Rolling out your own data takes some practice, especially if you want to package it all together. You might want to take a look at how we package data. Good places to start (except of course ICU's own data ) are source/test/testdata/ and source/samples/ufortune/resources/ directories.
Also, here is a sample Windows batch file that does compiling and packing of several resources:
It is also possible to use the icupkg tool instead of pkgdata to generate .dat data archives. The icupkg tool became available in ICU4C 3.6. If you need the data in a shared or static library, you still need to use the pkgdata tool. For easier maintenance, packaging, installation and application patching, it's recommended that you use .dat data archives.
ICU provides tool that allow for converting resource bundles to and from XLIFF format. Files in XLIFF format can contain translations of resources. In that case, more than one resulting resource bundle will be constructed.
To produce a XLIFF file from a resource bundle, use the -x option of genrb tool from ICU4C. Assume that we want to convert a simple resource bundle to the XLIFF format:
To get a XLIFF file, we need to call genrb like this: genrb -x -l en root.txt. Option -x tells genrb to produce XLIFF file, option -l specifies the language of the resource. If the language is not specified, genrb will try to deduce the language from the resource name (en, zh, sh). If the resource name is not an ISO language code (root), default language for the platform will be used. Language will be a source attribute for all the translation units. XLIFF file produced from the resource above will be named root.xlf and will look like this:
This file can be sent to translators. Using translation tools that support XLIFF, translators will produce one or more translations for this resource. Processed file might look a bit like this:
In order to convert this file to a set of resource bundle files, we need to use ICU4J's com.ibm.icu.dev.tool.localeconverter.XLIFF2ICUConverter class.
Command line for running XLIFF2ICUConverter should specify the file than needs to be converted, sh.xlf in this case. Optionally, you can specify input and output directories as well as the package name. After running this tool, two files will be produced: en.txt and sh.txt. This is how they would look like:
These files can be then used as all the other resource bundle files.