This chapter explains locales, a fundamental concept in ICU. ICU services are parameterized by locale, to allow client code to be written in a locale-independent way, but to deliver culturally correct results.
A locale identifies a specific user community - a group of users who have similar culture and language expectations for human-computer interaction (and the kinds of data they process).
A community is usually understood as the intersection of all users speaking the same language and living in the same country. Furthermore, a community can use more specific conventions. For example, an English/United States/Military locale is separate from the regular English/United States locale since the US military writes times and dates differently than most of the civilian community.
A program should be localized according to the rules specific for the target locale. Many ICU services rely on the proper locale identification in their function.
The locale object in ICU is an identifier that specifies a particular locale and has fields for language, country, and an optional code to specify further variants or subdivisions. These fields also can be represented as a string with the fields separated by an underscore.
C++ API, locale is represented by the locale class, which provides
methods for finding language, country and variant components. In C API
the locale is defined simply by a character string. In Java API, the
locale is represented by ULocale which is analogous to Locale class but
provide additional support for ICU protocol. All the locale-sensitive
ICU services use the locale information to determine language and other
locale specific parameters of their function. The list of
locale-sensitive services can be found in the Introduction to ICU
section. Other parts of the library use the locale as an indicator to
For example, when the locale-sensitive date format service needs to format a date, it uses the convention appropriate to the current locale. If the locale is English, it uses the word "Monday" and if it is French, it uses the word "Lundi".
The locale object also defines the concept of a default locale. The default locale is the locale, used by many programs, that regulates the rest of the computer's behavior by default and is usually controlled by the user in a control panel window. The locale mechanism does not require a program to know which locale the user is using and thus makes most programming simpler.
Since locale objects can be passed as parameters or stored in variables, the program does not have to know specifically which locales they identify. Many applications enable a user to select a locale. The resulting locale object is passed as a parameter, which then produces the customized behavior for that locale.
A locale provides a means of identifying a specific region for the purposes of internationalization and localization.
A locale consists of one or more pieces of ordered information:
The languages are specified using a two- or three-letter lowercase code for a particular language. For example, Spanish is "es", English is "en" and French is "fr". The two-letter language code uses the ISO-639 standard.
The optional four-letter script code follows the language code. If specified, it should be a valid script code as listed on the Unicode ISO 15924 Registry .
There are often different language conventions within the same language. For example, Spanish is spoken in many countries in Central and South America but the currencies are different in each country. To allow for these differences among specific geographical, political, or cultural regions, locales are specified by two-letter, uppercase codes. For example, "ES" represents Spain and "MX" represents Mexico. The two letter country code uses the ISO-3166 standard.
Java supports two letter country codes that uses ISO-3166 and UN M.49 code.
Differences may also appear in language conventions used within the same country. For example, the Euro currency is used in several European countries while the individual country's currency is still in circulation. Variations inside a language and country pair are handled by adding a third code, the variant code. The variant code is arbitrary and completely application-specific. ICU adds "_EURO" to its locale designations for locales that support the Euro currency. Variants can have any number of underscored key words. For example, "EURO_WIN" is a variant for the Euro currency on a Windows computer.
Another use of the variant code is to designate the Collation (sorting order) of a locale. For instance, the "es__TRADITIONAL" locale uses the traditional sorting order which is different from the default modern sorting of Spanish.
Collation order and currency can be more flexibly specified using keywords instead of variants; see below.
The final element of a locale is an optional list of keywords together with their values. Keywords must be unique. Their order is not significant. Unknown keywords are ignored. The handling of keywords depends on the specific services that utilize them. Currently, the following keywords are recognized:
If any of these keywords is absent, the service requesting it will typically use the rest of the locale specifier in order to determine the appropriate behavior for the locale. The keywords allow a locale specifier to override or refine this default behavior.
Default locales are available to all the objects in a program. If you set a new default locale for one section of code, it can affect the entire program. Application programs should not set the default locale as a way to request an international object. The default locale is set to be the system locale on that platform.
For example, when you set the default locale, the change affects the default behavior of the Collator and NumberFormat instances. When the default locale is not wanted, you can set the desired locale using a factory method supplied with the classes such as Collator::createInstance().
Using the ICU C functions, NULL can be passed for a locale parameter to specify the default locale.
ICU is implemented as a set of services. One example of a service is the formatting of a numeric value into a string. Another is the sorting of a list of strings. When client code wants to use a service, the first thing it does is request a service object for a given locale. The resulting object is then expected to perform the its operations in a way that is culturally correct for the requested locale.
The requested locale is the one specified by the client code when the service object is requested.
locale is one for which ICU has data, or one in which client code has
registered a service. If the requested locale is not populated, then
ICU will fallback until it reaches a populated locale. The first
populated locale it reaches is the valid locale. The
Locale fallback proceeds as follows:
At any point, if the desired data is found, then the fallback procedure stops. Keywords are not altered during fallback until the default locale is reached, at which point all keywords are replaced by those assigned to the default locale.
Services request specific resources within the valid locale. If the valid locale directly contains the requested resource, then it is the actual locale. If not, then ICU will fallback until it reaches a locale that does directly contain the requested resource. The first such locale is the actual locale. The actual locale is reachable from the valid locale via zero or more fallback steps.
Client code may wish to know what the valid and actual locales are for a given service object. To support this, ICU services provide the method getLocale(). getLocale() takes an argument specifying whether the actual or valid locale is to be returned.
Some service object will have an empty or null return from getLocale(). This indicates that the given service object was not created from locale data, or that it has since been modified so that it no longer reflects locale data, typically through alteration of the pattern (but not localized symbol changes -- such changes do not reset the actual and valid locale settings).
Currently, the services that support the getLocale() API are the following classes and their subclasses:
Various services provide the API getFunctionalEquivalent to allow callers determine the functionally equivalent locale for a requested locale. For example, when instantiating a collator for the locale en_US_CALIFORNIA, the functionally equivalent locale may be en.
Service.getFunctionalEquivalent(A) == Service.getFunctionalEquivalent(B)
implies that the object returned by Service.getInstance(A) will behave equivalently to the object returned by Service.getInstance(B).
Here is a pseudo-code example:
The functional equivalent locale returned by a service has no meaning beyond what is stated above. For example, if the functional equivalent of Greek is Hebrew for collation, that makes no statement about the linguistic relation of the languages -- it only means that the two collators are functionally equivalent.
While two locales with the same functional equivalent are guaranteed to be equivalent, the converse is not true: If two locales are in fact equivalent, they may not return the same result from getFunctionalEquivalent. That is, if the object returned by Service.getInstance(A) behaves equivalently to the object returned by Service.getInstance(B), Service.getFunctionalEquivalent(A) may or may not be equal to Service.getFunctionalEquivalent(B). Take again the example of Greek and Hebrew, with respect to collation. These locales may happen to be functional equivalents (since they each just turn on full normalization), but it may or may not be the case that they return the same functionally equivalent locale. This depends on how the data is structured internally.
The functional equivalent for a locale may change over time. Suppose that Greek were enhanced to change sorting of additional ancient Greek characters. In that case, it would diverge; the functional equivalent of Greek would no longer be Hebrew.
ICU works with ICU format locale IDs. These are strings that obey the following character set and syntax restrictions:
ICU performs two kinds of canonicalizing operations on 'ICU format' locale IDs. Level 1 canonicalization is performed routinely and automatically by ICU API. The recommended procedure for client code using locale IDs from outside sources (e.g., POSIX, user input, etc.) is to pass such "foreign IDs" through level 2 canonicalization before use.
Level 1 canonicalization. This operation performs minor, isolated changes, such as changing "en-us" to "en_US". Level 1 canonicalization is not designed to handle "foreign" locale IDs (POSIX, .NET) but rather IDs that are in ICU format, but which do not have normalized case and delimiters. Level 1 canonicalization is accomplished by the ICU functions uloc_getName, Locale::createFromName, and Locale::Locale. The latter two API exist in both C++ and Java.
Level 2 canonicalization. This operation may make major changes to the ID, possibly replacing entire elements of the ID. An example is changing "fr-fr@EURO" to "fr_FR@currency=EUR". Level 2 canonicalization is designed to translate POSIX and .NET IDs, as well as nonstandard ICU locale IDs. Level 2 is a superset of level 1; every operation performed by level 1 is also performed by level 2. Level 2 canonicalization is performed by uloc_canonicalize and Locale::createCanonical. The latter API exists in both C++ and Java.
Certain other operations are not performed by either level 1 or level 2 canonicalization. These are listed here for completeness.
All API (with a few exceptions) in ICU4C that take a const char* locale parameter can be assumed to automatically peform level 1 canonicalization before using the locale ID to do resource lookup, keyword interpretation, etc. Specifically, the static API getLanguage, getScript, getCountry, and getVariant behave exactly like their non-static counterparts in the class Locale. That is, for any locale ID loc, new Locale(loc).getFoo() == Locale::getFoo(loc), where Foo is one of Language, Script, Country, or Variant.
The Locale constructor (in C++ and Java) taking multiple strings behaves exactly as if those strings were concatenated, with the '_' separator inserted between two adjacent non-empty strings, and the result passed to uloc_getName.
If you are localizing an application to a locale that is not already supported, you need to create your own Locale object. New Locale objects are created using one of the three constructors in this class:
Because a locale object is just an identifier for a region, no validity check is performed. If you want to verify that the particular resources are available for the locale you construct, you must query those resources. For example, you can query the NumberFormat object for the locales it supports using its getAvailableLocales() method.
New ULocale objects in java are created using one the following three constructor in this class:
The locale ID passed in the constructor consists of optional
languages, scripts, country and variant fields in that oder, separated
by underscore, followed by an optional
In C++, the Locale class provides a number of convenient constants that you can use to create locales. For example, the following refers to aNumberFormat object for the United States:
In C, a string with the language country and variant concatenated together with an underscore '_' describe a locale. For example, "en_US" is a locale that is based on the English language in the United States. The following can be used as equivalents to the locale constants:
In Java, the ULocale provides a number of convenient constants that can be used to create locales.
Locale-sensitive classes have a getAvailableLocales() method that returns all of the locales supported by that class. This method also shows the other methods that get locale information from the resource bundle. For example, the following shows that the NumberFormat class provides three convenience methods for creating a default NumberFormat object::
Locale-sensitive classes in java also have a getAvailableULocales() method that returns all of the locales supported by that class.
Once you've created a Locale in C++ and a ULocale in java, you can perform a query of the locale for information about itself. The following shows the information you can receive from a locale:
Each class that performs locale-sensitive operations allows you to get all the available objects of that type. You can sift through these objects by language, country, or variant, and use the display names to present a menu to the user. For example, you can create a menu of all the collation objects suitable for a given language.
ICU provides functions to negotiate the best locale to use for an operation, given a user's list of acceptable locales, and the application's list of available locales. For example, a browser sends the web server the HTTP “Accept-Language” header indicating which locales, with a ranking, are acceptable to the user. The server must determine which locale to use when returning content to the user.
Here is an example of selecting an acceptable locale within a CGI application:
Here is an example of selecting an acceptable locale within a Java application:
See Programming for Locale in C, C++ and Java for more information.