ICU uses UCA as a default starting point for ordering. Not all languages have sorting sequences that correspond with the UCA because UCA cannot simultaneously encompass the specifics of all the languages currently in use. Therefore, ICU provides a data-driven, flexible, and run-time customizable mechanism called "tailoring". Tailoring overrides the default order of code points and the values of the ICU Collation Service attributes. Collation RuleA tailoring is a set of rules. Each rule contains a string of ordered characters that starts with an anchor point or a reset value. The reset value is an absolute point that determines the order of other characters. For example, "&a < g", places "g" after "a" and the "a" does not change place. This rule has the following sorting consequences:
Note that only the word that starts with "g" has changed place. All the words sorted after "a" and "A" are sorted after "g". This is a non-complex example of a tailoring rule. Tailoring rules consist of zero or more rules and zero or more options. There must be at least one rule or at least one option. The rule syntax is discussed in more detail in the following sections. Note that the tailoring rules override the UCA ordering. In addition, if a character is reordered, it automatically reorders any other equivalent characters. For example, if the rule "&e<a" is used to reorder "a" in the list, "á" is also greater than "é". SyntaxThe following table summarizes the basic syntax necessary for most usages:
Escaping RulesMost of the characters can be used as parts of rules. However, whitespace characters will be skipped over, and all ASCII characters that are not digits or letters are considered to be part of syntax. In order to use these characters in rules, they need to be escaped. Escaping can be done in several ways:
The following examples are other tailorings: Serbian (Latin) or Croatian: & C < č <<< Č < ć <<< Ć This rule is needed because UCA usually considers accents to have secondary differences in order to base character. This ensures that 'ć' 'č' are treated as base letters.
Serbian (Latin) or Croatian: & Ð < dž <<< Dž <<< DŽ This rule is an example of a contraction. "D" alone is sorted after "C" and "Ž" is sorted after "Z", but "DŽ", due to the tailoring rule, is treated as a single letter that gets sorted after "Đ" and before "E" ("Đ" sorts as a base letter after "D" in the UCA). Another thing to note in this example is capitalization of the letter "DŽ". There are three versions, since all three can legally appear in text. The fourth version "dŽ" is omitted since it does not occur.
Danish: &V <<< w <<< W The letter 'W' is sorted after 'V', but is treated as a tertiary difference similar to the difference between 'v' and 'V'.
Default OptionsThe tailoring inherits all the attribute values from the UCA unless they are explicitly redefined in the tailoring. The following table summarizes the option settings. UCA default options are in emphasis.
A tailoring that consists only of options is also valid tailoring and has the same basic ordering as the UCA. The options that modify this tailoring are described in the following examples: The Greek tailoring has option settings only : [normalization on] The Latvian tailoring reorders uppercase and lowercase and uses backward French ordering:
Advanced Syntactical ElementsSeveral other syntactical elements are needed in more specific situations. These elements are summarized in the following table:
Indirect Positioning of Collation ElementsSince version 2.0 ICU allows for indirect positioning of collation elements. Similar to the option top, these options allow for positioning of the tailoring relative to significant sections of the UCA table. You can use [before] option to position before these sections.
Not all of indirect positioning anchors are useful. Most of the 'first' elements should be used with the [before] directive, in order to make sure that your tailoring will sort before an interesting section. Following are several fragments of real tailorings, illustrating some of the advanced syntactical elements: Expansion Example:French: & A << æ/e <<< Æ/E Letter 'Æ' is treated as a separate letter between 'A' and 'B'. However, the French language requires 'Æ' to be treated as a combination of letters 'A' and 'E' and to sort as an accent variation of this combination. This is an example of an expansion.
Prefix Example:Prefixes are used in Japanese tailoring to reduce the number of contractions. A big number of contractions is a performance burden, as their processing is much more complicated than the processing of regular elements. Prefixes should be used only to replace contractions followed by expansions and only if the expansion part is less frequent than the start of the contraction.
This could have been written as a series of contractions followed by expansion:
However, in that case ァ, ァ and ぁ would be treated as contractions. Since the prolonged sound mark (ー) occurs much less frequently than the other letters of Japanese Katakana and Hiragana, it is much more prudent to put the extra processing on it by using prefixes. Example:"Reset" always use only the base character as the insertion point even if there is an expansion. So the following rule,
is equivalent to
Which produces the following sort order: "JA" "MA" "KA" "KC" "JC" "MC"
The following is the collation elements for these strings with the specified rules:
Tailoring IssuesICU uses canonical closure. This means that for each code point in Unicode, if the canonically composed form of a tailored string produces different collation elements than the canonically decomposed form, then the canonically composed form is effectively added to the ordering. If 'a' is tailored, for example, all of the accented 'a' characters are also tailored. Canonical closure allows collators to process Unicode strings in the FCD form as well as in NFD. However, compatibility equivalents are NOT automatically added. If the rule "&b < a" is in tailoring, and the order of ⓐ (circled a) is important, it should be explicitly tailored. Redundant tailoring rules are removed, with later rules "winning". The strengths around the removed rules are also fixed. Example:The following table summarizes effects of different redundant rules.
If two different reset lists use the same character it is removed from the first one (see 1 in the table above). If the second character is a reset, the second list is inserted in the first (see 2). If both are resets, then the same thing happens (see 3). Whenever such an insertion occurs, the second strength "postpones" the position (see 4). If there is a "[before N]" on the reset, then the reset character is effectively replaced by the item that would be before it, either in a previous tailoring (if the letter occurs in one - see 5) or in the UCA. The N determines the 'distance' before, based on the strength of the difference (see 6-8). However, this is subject to postponement (see 9), so be careful! Reset semanticsThe reset semantic in ICU 1.8 is different from the previous ICU releases. Prior to version 1.8, the reset relation modifier was applicable only to the entry immediately following the reset entry. Also, the relation modifier applied to all entries that occurred until the next reset or primary relation. For example, was equivalent to Starting with ICU version 1.8, the modifier is equivalent to, The new semantic produces more intuitive results, especially when the character after the reset is decomposable. Since all rules are converted to NFD before they are interpreted, this can result in contractions that the rule-writer might not be aware of. Expansion propagates only until the next reset or primary relation occurs. For example, with the following rule: was equivalent to the following prior to ICU 1.8 and in Java, Starting with 1.8, it is equivalent to, & a = c / b <<< d / b << e / b <<< f / b < g <<< h Known LimitationsThe following are known limitations of the ICU collation implementation. These are theoretical limitations, however, since there are no known languages for which these limitations are an issue. However, for completeness they should be fixed in a future version after 1.8.1. The examples given are designed for simplicity in testing, and do not match any real languages. ExpansionThe goal of expansion is to sort as if the expansion text were inserted right after the character. For example, with the rule The text "...c..." should sort as if it were right after "...ae..." with a tertiary difference. There are a few cases where this is not currently true. Recursive ExpansionGiven the rules Expansion should sort the text "...c..." as if it were just after "...ae...", and that should also sort as if it were just after "...agi...". This requires that the compilation of expansions be recursive (and check for loops as well!). ICU currently does not do this.
Contractions Spanning ExpansionsICU currently always pre-compiles the expansion into an internal format (a list of one or more collation elements) when the rule is compiled. If there is contraction that spanned the end of the expanded text and the start of the original text, however, that contraction will not match. A text case that illustrates this is:
Since the pre-compiled expansions are a huge performance gain, we will probably keep the implementation the way it is, but in the future allow additional syntax to indicate those few expansions that need to behave as if the text were inserted because of the existence of another contraction. Note that such expansions need to be recursively expanded (as in #1), but rather than at pre-compile time, these need to be done at runtime. While it is possible to automatically detect these cases, it would be better to allow explicit control in case spanning is not desired. An example of such syntax might be something like: Notes: ICU does handle the case where there is a contraction that is completely inside the expansion. Suppose that someone had the rules: These do not cause c to sort as if it were ae, nor should they. NormalizationThe goal of normalization is to have all text sort as if it were first normalized (converted into NFD). For performance reasons, the rules are pre-processed so there is no need to perform normalization on strings that are already in the FCD format. The vast majority of strings are in FCD. Nulls in ContractionsNulls should not be used in contractions that could invoke normalization.
Contractions Spanning NormalizationThe following rule specifies that a grave accent followed by a b is a contraction, and sorts as if it were an e. On this basis, "...àb..." should sort as if it were just after "...ae...". Because of the preprocessing, however, the contraction will not match if this text is represented with the pre-composed character à, but will match if given the decomposed sequence a + grave accent. The same thing happens if the contraction spans the start of a normalized sequence.
Variable TopICU lets you set the top of the variable range. This can be done, for example, to allow you to ignore just SPACES, and not punctuation. Variable Top ExclusionThere is currently a limitation that causes variable top to (perhaps) exclude more characters than it should. This happens if you not only set variable top, but also tailor a number of characters around it with primary differences. The exact number that you can tailor depends on the internal "gaps" between the characters in the pre-compiled UCA table. Normally there is a gap of one. There are larger gaps between scripts (such as between Latin and Greek), and after certain other special characters. For example, if variable top is set to be at SPACE ('\u0020'), then it works correctly with up to 70 characters also tailored after space. However, if variable top is set to be equal to HYPHEN ('\u2010'), only one other value can be accommodated.
Case Level/First/SecondIn ICU, it is possible to override the tertiary settings programmatically. This is used to change the default case behavior to be all upper first or all lower first. It can also be used for a separate case level, or to ignore all other tertiary differences (such as between circled and non-circled letters, or between half-width and full-width katakana). The case values are derived directly from the Unicode character properties, and not set by the rules. Mixed Case ContractionsThere is currently a limitation that all contractions of multiple characters can only have three special case values: upper, lower, and mixed. All mixed-case contractions are grouped together, and are not affected by the upper first vs. lower first flag.
CautionsThe following are not known rule limitations, but rather cautions. ResetsSince resets always work on the existing state, the user is required to make sure that the rule entries are in the proper order.
Postpone InsertionWhen using a reset to insert a value X with a certain strength difference after a value Y, it actually is inserted just before the next item of the same strength or higher following Y. Thus, the following are equivalent:
Jamo TailoringIf Jamo characters are tailored, that causes the code to go through a slow path, which will have a significant effect on performance. Compatibility DecompositionsWhen tailoring a letter, the customization affects all of its canonical equivalents. That is, if tailoring rule sorts an 'a' after'e ', for example, then ""à", "á", ... are also sorted after 'e'.his is not true for compatibility equivalents. If the desired sorting order is for a superscript-a ("ª") to be after "e", it is necessary to specify the rule for that. Case DifferencesSimilarly, when tailoring an "a" to be sorted after "e", including "A" to be after "e" as well, it is required to have a specific rule for that sorting sequence. Automatic ExpansionsICU will automatically form expansions whenever a reset is to a multi-character value that is not a contraction. For example, & ab <<< c is equivalent to & a <<< c / b. The user may be unaware of this happening, since it may not be obvious that the reset is to a multi-character value. For example, & à<<< d is equivalent to & a <<< d / ` |