Collator InstantiationTo use the ICU Collation Service, you must instantiate an ICU Collator. The Collator defines the properties and behavior of the sort ordering. The Collator can be repeatedly referenced until all collation activities have been performed. The Collator can then be closed and removed. Instantiating the Predefined CollatorsICU comes with a large set of already predefined collators that are suited for specific locales. Most of the ICU locales have a predefined collator. In worst case, the default set of rules, which is equivalent to the UCA ordering, is used. To instantiate a predefined collator, use the APIs ucol_open, createInstance and getInstance for C, C++ and Java codes respectively. The C API takes a locale ID (or language tag) string argument, C++ takes a Locale object, and Java takes a Locale or ULocale. For some languages, multiple collation types are available; for example, "de@collation=phonebook". They can be enumerated via Collator::getKeywordValuesForLocale(). See also the list of available collation tailorings in the online collation demo. Starting with ICU 54, collation attributes can be specified via locale keywords as well, in the old locale extension syntax ("el@colCaseFirst=upper") or in language tag syntax ("el-u-kf-upper"). Keywords and values are case-insensitive. See the LDML Collation spec, Collation Settings, and the data file listing the valid collation keywords and their values. (The deprecated attributes kh/colHiraganaQuaternary and vt/variableTop are not supported.) For the old locale extension syntax, the data file's alias names are used (first alias, if defined, otherwise the name): de@collation=phonebook;colCaseLevel=yes;kv=space For the language tag syntax, the non-alias names are used, and "true" values can be omitted: de-u-co-phonebk-kc-kv-space This example demonstrates the instantiation of a collator. C:
C++:
Java:
Instantiating Collators Using Custom RulesIf the ICU predefined collators are not appropriate for your intended usage, you can This example demonstrates the instantiation of a collator. C:
C++:
Java:
CompareTwo of the most used functions in ICU collation API, ucol_strcoll and ucol_ getSortKey have their counterparts in both Win32 and ANSI APIs:
For more sophisticated usage, such as user-controlled language-sensitive text searching, an iterating interface to collation is provided. Please refer to the section below on CollationElementIterator for more detail. The ucol_compare
function compares one pair of strings at a time. Comparing two strings is
much faster than calculating sort keys for both of them. However, if
comparisons should be done repeatedly on a large number of strings,
generating and storing sort keys can improve performance. In all other
cases (such as quick sort or bubble sort of a The C API used for comparing two strings is ucol_strcoll. It requires two UChar * strings and their lengths as parameters, as well as a pointer to a valid UCollator instance. The result is a UCollationResult constant, which can be one of UCOL_LESS, UCOL_EQUAL or UCOL_GREATER. The C++ API offers the method Collator::compare with several overloads. Acceptable input arguments are UChar * with length of strings or UnicodeString instances. The result is a member of the EComparisonResult enum. The Java API provides the method Collator.compare with one overload. Acceptable input arguments are Strings or Objects. The result is an int value, which is less than zero if source is less than target, zero if source and target are equal, or greater than zero if source is greater than target. There are also several convenience functions and methods returning a boolean value, such as ucol_greater, ucol_greaterOrEqual, ucol_equal (in C) Collator::greater, Collator::greaterOrEqual, Collator::equal (in C++) and Collator.equals (in Java). ExamplesC:
C++:
Java:
GetSortKeyThe C API provides the ucol_getSortKey function, which requires (apart from a pointer to a valid UCollator instance), an original UCharpointer, together with its length. It also requires a pointer to a receiving buffer and its length. The C++ API provides the Collator::getSortKey method with similar parameters as the C version. It also provides Collator::getCollationKey, which produces a CollationKey object instance (a wrapper around a sort key). The Java API provides only the Collator.getCollationKey method, which produces a CollationKey object instance (a wrapper around a sort key). Sort keys are generally only useful in databases or other circumstances where function calls are extremely expensive. See Sortkeys vs Comparison. Sort Key FeaturesICU writes sort keys as sequences of bytes. Each sort key ends with one 00 byte and does not contain any other 00 byte. The terminating 00 byte is included in the length of the sort key as returned by the API (unlike any other ICU API where terminating NUL bytes or characters are not counted as part of the length). Sort key byte sequences must be compared with an unsigned-byte comparison, as with strcmp(). Comparing the sort keys of two strings from the same collator yields the same ordering as using the collator to compare the two strings directly. That is: Sort keys from different collators (different locale or strength or any other attributes/settings) are not comparable. Sort keys can be "merged" as described in UTS #10 Merging Sort Keys, via
Any further analysis or parsing of sort keys is not supported. Sort keys will change from one ICU version to another; therefore, if sort keys are stored in a database or other persistent storage, then each upgrade requires their regeneration.
Implementation notes: (Not supported as permanent constraints on sort keys) Byte 02 was unique as a merge separator for some versions of ICU before version ICU 53. Since ICU 53, 02 is also used in regular collation weights where there is no conflict (to expand the number of available short weights). Byte 01 has been unique as a level separator. This is not strictly necessary for non-primary levels. (A level's compressible "common" weight as its level separator would yield shorter sort keys.) However, the current implementation of ucol_mergeSortkeys() relies on it. (Also, test code currently examines sort keys for finding the strength of a comparison difference.) This may change in the future, especially if ucol_mergeSortkeys() were to become deprecated. Level separators are likely to be equivalent to single-byte weights (possibly compressible): Multi-byte level separators would noticeably lengthen sort keys for short strings. The byte values used in several ICU versions for sort keys and collation elements are documented in the “Special Byte Values” design doc on the ICU site. Sort Key Output Bufferucol_getSortKey() can operate in 'preflighting' mode, which returns the amount of memory needed to store the resulting sort key. This mode is automatically activated if the output buffer size passed is set to zero. Should the sort key become longer than the buffer provided, function again slips into preflighting mode. The overall performance is poorer than if the function is called with a zero output buffer . If the size of the sort key returned is greater than the size of the buffer provided, the content of the result buffer is undefined. In that case, the result buffer could be reallocated to its proper size and the sort key generator function can be used again. The best way to generate a series of sort keys is to do the following:
|
void GetSortKeys(const Ucollator* coll, const UChar* |
![]() | Although the API allows you to call ucol_getsortkey with NULL to see what the sort key length is, it is strongly recommended that you NOT determine the length first, then allocate and fill the sort key buffer. If you do, it requires twice the processing since computing the length has to do the same calculation as actually getting the sort key. Instead, the example shown above uses a stack buffer. |
Using Iterators for String Comparison
ICU4C's ucol_strcollIter API allows for comparing two strings that are supplied as character iterators (UCharIterator). This is useful when you need to compare differently encoded strings using strcoll. In that case, converting the strings first would be probably be wasteful, since strcoll usually gives the result before whole strings are processed. This API is implemented only as a C function in ICU4C. There are no equivalent C++ or ICU4J functions.
... |
Obtaining Partial Sort Keys
When using different
sort algorithms, such as radix sort, sometimes it is useful to process
strings only as much as needed to feed into the sorting algorithm. For
that purpose, ICU provides ucol_nextSortKeyPart
API, which also takes character iterators. This API allows for
iterating over subsequent pieces of an uncompressed sort key. Between
calls to the API you need to save a 64-bit state. Following is an
example of simulating a string compare function using partial sort key
API. Your usage model is bound to look much different.
static UCollationResult compareUsingPartials(UCollator *coll, |
Other Examples
A longer example is presented in the 'Examples' section. Here is an illustration of the usage model.
C:
#define MAX_KEY_SIZE 100 |
C++:
#define MAX_LIST_LENGTH 5 |
Java:
String s [] = { |
Collation ElementIterator
A collation element iterator can only be used in one direction. This is established at the time of the first call to retrieve a collation element. Once ucol_next (C), CollationElementIterator::next (C++) or CollationElementIterator.next (Java)are invoked, ucol_previous (C), CollationElementIterator::previous (C++) or CollationElementIterator.previous (Java) should not be used (and vice versa). The direction can be changed immediately after ucol_first , ucol_last, ucol_reset (in C), CollationElementIterator::first, CollationElementIterator::last, CollationElementIterator::reset (in C++) or CollationElementIterator.first, CollationElementIterator.last, CollationElementIterator.reset (in Java) is called, or when it reaches the end of string while traversing the string.
When ucol_next is called at the end of the string buffer, UCOL_NULLORDER is always returned with any subsequent calls to ucol_next. The same applies to ucol_previous.
An example of how iterators are used is the Boyer-Moore search implementation, which can be found in the samples section.
API Example
C:
UCollator *coll = ucol_open("en_US",status); |
C++:
UErrorCode status = U_ZERO_ERROR; |
Java:
try { |
Setting and Getting Attributes
The general attribute setting APIs are ucol_setAttribute (in C) and Collator::setAttribute (in C++). These APIs take an attribute name and an attribute value. If the name and the value pass a syntax and range check, the property of the collator is changed. If the name and value do not pass a syntax and range check, however, the state is not changed and the error code variable is set to an error condition. The Java version does not provide general attribute setting APIs, instead, each attribute will have its own setter API of the form RuleBasedCollator.setATTRIBUTE_NAME(arguments).
The attribute getting APIs are ucol_getAttribute (C) and Collator::getAttribute (C++). Both APIs require an attribute name as an argument and return an attribute value if a valid attribute name was supplied. If a valid attribute name was not supplied, however, they return an undefined result and set the error code. Similarly to the setter APIs for the Java version, no generic getter API is provided. Each attribute will have its own setter API of the form RuleBasedCollator.getATTRIBUTE_NAME() in the Java version.
References:
Mark Davis, Ken Whistler: "Unicode Technical Standard #10, Unicode Collation Algorithm" (http://www.unicode.org/unicode/reports/tr10/ )
Mark Davis: "ICU Collation Design Document" (http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm )
The Unicode Standard 3.0, chapter 5, "Implementation guidelines" (http://www.unicode.org/unicode/uni2book/ch05.pdf )
Laura Werner: "Efficient text searching in Java: Finding the right string in any language" (http://icu-project.org/docs/papers/efficient_text_searching_in_java.html )
Mark Davis, Martin Dürst: "Unicode Standard Annex #15: Unicode Normalization Forms" (http://www.unicode.org/unicode/reports/tr15/ ).