This section describes the design principles, architecture and coding conventions of the ICU Collation Service. ICU CollatorTo use the ICU Collation Service, an ICU Collator must first be instantiated. An ICU Collator is a data structure or object that maintains all of the property and state information necessary to define and support the specific collation behavior provided. Examples of properties described in the ICU Collator are the locale, whether normalization is to be performed, and how many levels of collation are to be evaluated. Examples of the state information described in the ICU Collator include the direction of a Collation Element Iterator (forward or backward) and the status of the last API executed. The ICU Collator is instantiated either by referencing a locale or by defining a custom set of rules (a tailoring). The ICU Collation Service uses the paradigm:
ICU Collator instances cannot be shared among threads. You should open them instead, and use a different collator for each separate thread. The safe clone function is supported for cloning collators in a thread-safe fashion.
Collation locale and keyword handlingWhen a collator is created from a locale, the collation service (like all ICU services) must map the requested locale to the localized collation data available to ICU at the time. It does so using the standard ICU locale fallback mechanism. See the fallback section (§) of the locale chapter for more details. If you pass a regular locale in, like "en_US", the collation service first searches with fallback for "collations/default" key. The first such key it finds will have an associated string value; this is the keyword name for the collation that is default for this locale. If the search falls all the way back to the root locale, the collation service will us the "collations/default" key there, which has the value "standard". If there is a locale with a keyword, like "de@collation=phonebook", the collation service searches with fallback for "collations/phonebook". If the search is successful, the collation service uses the string value it finds to instantiate a collator. If the search fails because no such key is present in any of ICU's locale data (e.g., "de@collation=funky"), the service returns a collator implementing UCA and the return UErrorCode is U_USING_DEFAULT_WARNING. Input values for collationCollation
deals with processing strings. ICU generally requires that all the
strings should be in UTF-16 format, and that all the required
conversion should done before ICU functions are used. In the case of
collation, there are APIs that can also take instances of character
iterators (UCharIterator). Theoretically, character iterators can iterate strings Collation ElementsAs discussed in the introduction, there are many possible orderings for sorted text, depending on language and other factors. Ideally, there is a way to describe each ordering as a set of rules for calculating numeric values for each string of text. The collation process then becomes one of simply comparing these numeric values. This essentially describes the way the ICU Collation Service works. To implement a particular sort ordering, first the relationship between each character or character sequence is derived. For example, a Spanish ordering defines the letter sequence "CH" to be between the letters "C" and "D". As also discussed in the introduction, to order strings properly requires that comparison of base letters must be considered separately from comparison of accents. Letter case must also be considered separately from either base letters or accents. Any ordering specification language must provide a way to define the relationships between characters or character sequences on multiple levels. ICU supports this by using "<" to describe a relationship at the primary level, using "<<" to describe a relationship at the secondary level, and using "<<<" to describe a relationship at the tertiary level. Here are some example usages:
A more complete description of the ordering specification symbols and their meanings is provided in the section on Collation Tailoring. Once a sort ordering is defined by specifying the desired relationships between characters and character sequences, ICU can convert these relationships to a series of numerical values (one for each level) that satisfy these same relationships. This series of numeric values, representing the relative weighting of a character or character sequence, is called a Collation Element (CE). A Collation Element is a 32-bit value, consisting of a 16-bit primary, 8-bit secondary, 6-bit tertiary weight and 2 case bits.
The sort weight of a string is represented by the collation elements of its component characters and character sequences. For example, the sort weight of the string "apple" would consist of its component Collation Elements, as shown here:
In this example, the letter "a" has a 16-bit primary weight of 1900 (hex), an 8-bit secondary weight of 05 (hex), and a combined 8-bit case-tertiary weight of 05 (hex). String comparison is performed by comparing the collation elements of each string. Each of the primary weights are compared. If a difference is found, that difference determines the relationship between the two strings. If no differences are found, the secondary weights are compared and so forth. With ICU it is possible to specify how many levels should be compared. For some applications, it can be desirable to compare only primary levels or to compare only primary and secondary levels. Sort KeysIf a string is to be compared repeatedly, it can be more efficient to use sort keys. Sort keys are useful in situations where a large amount of data is indexed and frequently searched. The sort key is generated once and used in subsequent comparisons, rather than repeatedly generating the string's Collation Elements. The key comparison is a very efficient and simple binary compare of strings of unsigned bytes. An important property of ICU sort keys is that you can obtain the same results by comparing 2 strings as you do by comparing the sort keys of the 2 strings (provided that the same ordering and related collation attributes are used). ICU sort key is a pre-processed sequence of bytes generated from a Unicode string. The weights for each comparison level are concatenated, separated by a "0x01" byte. The entire sequence is terminated with a 0x00 byte for convenience in C APIs.
Sort key sizeOne of the more important issues when considering using sort keys is the sort key size. Unfortunately, it is very hard to give a fast exact answer to the following question: "What is the maximum size for sort keys generated for strings of size X". This problem is twofold:
If one is to assume the worst case and use too big buffers, a lot of space will be wasted. However, if you use too small buffers, you will lose performance if generated sort keys are longer than supplied buffers too often. A good strategy for this problem would be to manually manage a large buffer for storing sortkeys and keep a list of indices to sort keys in this buffer (see the "large buffers" (§) Collation Example for more details). Here are some rules of a thumb, please do not rely on them. If you are looking at the East Asian locales, you probably want to go with 5 bytes per code unit. For Thai, 3 bytes per code unit should be sufficient. For all the other locales (mostly Latin and Cyrillic), you should be fine with 2 bytes per code unit. These values are based on average lengths of sortkeys generated with tertiary strength - if you need quaternary and identical strength (you should not), add 3 bytes per code unit to each of these. Partial sort keysIn some cases, most notably when implementing radix sorting, it is useful to produce only parts of sort keys at a time. ICU4C 2.6+ provides an API that allows producing parts of sort keys (ucol_nextSortKeyPart API). These sort keys may or may not be compressed; that is, they may or may not be compatible with regular sort keys. Merging sort keysSometimes, it is useful to be able to merge sort keys. One example is having separate sort keys for first and last names. If you need to perform an operation that requires a sort key generated on the whole name, instead of concatenating strings and regenerating sort keys, you should merge the sort keys. The merging is done by merging the corresponding levels while inserting a terminator between merged parts. The reserved sort key byte value for the merge terminator is 0x02. For more details see UCA section 1.6, Merging Sort Keys.
CLDR 1.9/ICU 4.6 and later map U+FFFE to a special collation element that is intended to allow concatenating strings like firstName+\uFFFE+lastName to yield the same results as merging their individual sort keys. As of ICU 50, this is not yet fully implemented in ICU. Generating bounds for a sort key (prefix matching)Having sort keys for strings allows for easy creation of bounds - sort keys that are guaranteed to be smaller or larger than any sort key from a give range. For example, if bounds are produced for a sortkey of string "smith", strings between upper and lower bounds with one level would include "Smith", "SMITH", "sMiTh". Two kinds of upper bounds can be generated - the first one will match only strings of equal length, while the second one will match all the strings with the same initial prefix. CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum primary weight, so that for example the string "smith\uFFFF" can be used as the upper bound rather than modifying the sort key for "smith". Collation Element IteratorThe collation element iterator is used for traversing Unicode string collation elements one at a time. It can be used to implement language-sensitive text search algorithms like Boyer-Moore. For most applications, the two API categories, compare and sort key, are sufficient. Most people do not need to manipulate collation elements directly. Example: Consider iterating over "apple" and "äpple". Here are sequences of collation elements:
The resulting CEs are typically masked according to the desired strength, and zero CEs are discarded. In the above example, masking with 0xFFFF0000 produces the results of NULL secondary and tertiary differences. The collator then ignores the NULL differences and declares a match. For more details see the paper "Efficient text searching in Java™: Finding the right string in any language" by Laura Werner ( http://icu-project.org/docs/papers/efficient_text_searching_in_java.html ). Collation AttributesThe ICU Collation Service has a number of attributes whose values can be changed during run time. These attributes affect both the functionality and the performance of the ICU Collation Service. This section describes these attributes and, where possible, their performance impact. Performance indications are only approximate and timings may vary significantly depending on the CPU, compiler, etc. Although string comparison by ICU and comparison of each string's sort key give the same results, attribute settings can impact the execution time of each method differently. To be precise in the discussion of performance, this section refers to the API employed in the measurement. The ucol_strcoll function is the API for string comparison. The ucol_getSortKey function is used to create sort keys.
|
X | FCD | NFC | NFD |
---|---|---|---|
A- ring | Y | Y | |
Angstrom | Y | ||
A + ring | Y | Y | |
A + grave | Y | Y | |
A-ring + grave | Y | ||
A + cedilla + ring | Y | Y | |
A + ring + cedilla | |||
A-ring + cedilla | Y |
With normalization mode turned on, the ucol_strcoll function slows down by 10%. In addition, the time to generate a sort key also increases by about 25%.
Alternate handling
This attribute allows shifting of the variable characters (usually spaces and punctuation, in the UCA also most symbols) from the primary to the quaternary strength level. This is set by using the UCOL_ALTERNATE_HANDLING attribute. For details see UCA: Variable Weighting, LDML: Collation Settings, and “Ignore Punctuation” Options.
UCOL_NON_IGNORABLE (CLDR/ICU default): variable characters are treated as all the other characters
UCOL_SHIFTED (UCA default): all the variable characters will be ignored at the primary, secondary and tertiary levels and their primary strengths will be shifted to the quaternary level.
Case Ordering
Some conventions require
uppercase letters to sort before lowercase ones, while others require
the opposite. This attribute is controlled by the value of the UCOL_CASE_FIRST.
The case difference in the UCA is contained in the tertiary weights
along with other appearance characteristics (like circling of letters).
The case-first attribute allows for emphasizing of the case
property of the letters by reordering the tertiary weights with either
upper-first, and/or lowercase-first. This difference gets the most
significant bit in the weight.
Valid values for this attribute are:
UCOL_OFF (default): leave tertiary weights unaffected
UCOL_LOWER_FIRST: causes lowercase letters and uncased characters to sort before uppercase
UCOL_UPPER_FIRST : causes uppercase letters to sort first
The case-first attribute does not affect the performance substantially.
Case level
When
this attribute is set, an additional level is formed between the
secondary and tertiary levels, known as the Case Level. The case level
is used to distinguish large and small Japanese Kana characters. Case
level could also be used in other situations. for example to
distinguish certain Pinyin characters.
Case level is controlled by UCOL_CASE_LEVEL attribute. Valid values for this attribute are
UCOL_OFF (default): no additional case level
UCOL_ON : adds a case level
Hiragana Quaternary
Hiragana Quaternary can be set to UCOL_ON, in which case Hiragana code points will sort before everything else on the quaternary level. If set to UCOL_OFF Hiragana letters are treated the same as all the other code points. This setting can be changed on run-time using the UCOL_HIRAGANA_QUATERNARY_MODE attribute. You probably won't need to use it.
Variable Top
Variable Top is a boundary which decides whether the code points will be treated as variable (shifted to quaternary level in the shifted mode) or non-ignorable. Special APIs are used for setting of variable top. It can basically be set either to a codepoint or a primary strength value.
Performance
ICU collation is designed to be fast, small and customizable. Several techniques are used to enhance the performance:
Providing optimized processing for Latin characters.
Comparing strings incrementally and stop at the first significant difference.
Tuning to eliminate unnecessary file access or memory allocation.
Providing efficient preflight functions that allows fast sort key size generation.
Using a single, shared copy of UCA in memory for the read-only default sort order. Only small tailoring tables are kept in memory for locale-specific customization.
Compressing sort keys efficiently.
Making the sort order to be data-driven.
In general, the best performance from the ICU Collation Service is expected by doing the following:
After opening a collator, keep and reuse it until done. Do not open new collators for the same sort order. (Note the restriction on multi-threading.)
Follow the best practice guidelines for generating sort key. Do not call ucol_getSortKey twice to first size the key and then allocate the sort key buffer and repeat the call to the function to fill in the buffer.
Use ucol_strcol when comparing two strings one time. If it is necessary to compare strings more than once, create the sort key first and compare the sort keys instead. Generating the sort keys of two strings is about 5-10 times slower than just comparing them directly.
Performance and Storage Implications of Attributes
Most people use the default attributes when comparing strings or when creating sort keys. When they do want to customize the ordering, the most common options are the following :
Attributes | Description |
---|---|
UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED | Used to ignore space and punctuation characters |
UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED and UCOL_STRENGTH == UCOL_QUATERNARY | Used to ignore the space and punctuation characters except when there are no previous letter, accent, or case/variable differences. |
UCOL_CASE_FIRST == UCOL_LOWER_FIRST or UCOL_CASE_FIRST == UCOL_UPPER_FIRST | Used to change the ordering of upper vs. lower case letters (as well as small vs. large kana) |
UCOL_CASE_LEVEL == UCOL_ON and UCOL_STRENGTH == UCOL_PRIMARY | Used to ignore only the accent differences. |
UCOL_NORMALIZATION_MODE == UCOL_ON | Force to always check for normalization. This is used if the input text may not be in FCD form. |
UCOL_FRENCH_COLLATION == UCOL_OFF | This is only useful for languages like French and Catalan that turn this attribute on by default. |
In
String Comparison, most of these options have little or no effect on
performance. The only noticeable one is normalization, which can cost
10%-40% in performance.
For Sort Keys, most of these options either leave the storage alone or reduce it. Shifting can reduce the storage by about 10%-20%; case level + primary-only can decrease it about 20% to 40%. Using no French accents can reduce the storage by about 38% , but only for languages like French and Catalan that turn it on by default. On the other hand, using Shift + Quad can increase the storage by 10%-15%. (The Identical Level also increases the length, but this option is not recommended).
![]() | All of the above numbers are based on tests run on a particular machine, with a particular set of data. (The data for each language is a large number of names in that language in the format <first_name>, <last name>.) The performance and storage may vary, depending on the particular computer, operating system, and data. |
Versioning
Sort keys are often stored on disk for later reuse. A common example is the use of keys to build indexes in databases. When comparing keys, it is important to know that both keys were generated by the same algorithms and weightings. Otherwise, identical strings with keys generated on two different dates, for example, might compare as unequal. Sort keys can be affected by new versions of ICU or its data tables, new sort key formats, or changes to the Collator. Starting with release 1.8.1, ICU provides a versioning mechanism to identify the version information of the following (but not limited to),
The run-time executable
The collation element content
The Unicode/UCA database
The tailoring table
The version information of Collator is a 32-bit integer. If a new version of ICU has changes affecting the content of collation elements, the version information will be changed. In that case, to use the new version of ICU collator will require regenerating any saved or stored sort keys. However, since ICU4C 1.8.1. it is possible to build your program so that it uses more than one version of ICU (only in C/C++, not in Java). Therefore, you could use the current version for the features you need and use the older version for collation.
Programming Examples
See the Collation Examples chapter for an example of how to compare and create sort keys with default locale in C , C++ and Java.