OverviewNormalization is used to convert text to a unique, equivalent form. Software can normalize equivalent strings to one particular sequence, such as normalizing composite character sequences into pre-composed characters. Normalization allows for easier sorting and searching of text. The ICU normalization APIs support the standard normalization forms which are described in great detail in Unicode Technical Report #15 (Unicode Normalization Forms) and the Normalization, Sorting and Searching sections of chapter 5 of the Unicode Standard. ICU also supports related, additional operations. Some of them are described in Unicode Technical Note #5 (Canonical Equivalence in Applications). New APIICU 4.4 adds the Normalizer2 API (in Java, C++ and C), replacing almost all of the old Normalizer API. There is a design doc with many details. All of the replaced old API is now implemented as a thin wrapper around the new API. Here is a summary of the differences:
The new API does not replace a few pieces of the old API:
Data File SyntaxThe gennorm2 tool accepts one or more .txt files and generates a .nrm binary data file for Normalizer2.getInstance(). For gennorm2 command line options, invoke gennorm2 --help. gennorm2 starts with no data. If you want to include standard Unicode Normalization data, use the files in {ICU4C}/source/data/unidata/norm2/ . You can modify one of them, or provide it together with one or more additional files that add or remove mappings. Hangul/Jamo data (mappings and ccc=0) are predefined and cannot be modified. Mappings in one text file can override mappings in previous files of the same gennorm2 invocation. Comments start with #. White space between tokens is ignored. Characters are written as hexadecimal code points. Combining class values are written as decimal numbers. In each file, each character can have at most one mapping and at most one ccc (canonical combining class) value. A ccc value must not be 0. (ccc=0 is the default.) A two-way mapping must map to a sequence of exactly two characters. A one-way mapping can map to zero, one, two or more characters. Mapping to zero characters removes the original character in normalization. The generator tool will apply each mapping recursively to each other. Groups of mappings that are forbidden by the Unicode Normalization algorithms are reported as errors. For example, if a character has a two-way mapping, then neither of its mapping characters can have a one-way mapping. * Unicode 6.1 # Optional Unicode version (since ICU 49; default: uchar.h U_UNICODE_VERSION)00AA>0061 # One-way mapping0300..0314:230 # ccc for a code point rangeIt is possible to override mappings from previous source files, including removing a mapping: 00AA-Exampleclass NormSample {public: // ICU service objects should be cached and reused, as usual. NormSample(UErrorCode &errorCode) : nfkc(*Normalizer2::getInstance(NULL, "nfkc", UNORM2_COMPOSE, errorCode), fcd(*Normalizer2::getInstance(NULL, "nfkc", UNORM2_FCD, errorCode) {} // Normalize a string. UnicodeString toNFKC(const UnicodeString &s, UErrorCode &errorCode) { return nfkc.normalize(s, errorCode); } // Ensure FCD before processing (like in sort key generation). // In practice, almost all strings pass the FCD test, so it might make sense to // test for it and only normalize when necessary, rather than always normalizing. void processText(const UnicodeString &s, UErrorCode &errorCode) { UnicodeString fcdString; const UnicodeString *ps; // points to either s or fcdString int32_t spanQCYes=fcd.spanQuickCheckYes(s, errorCode); if(U_FAILURE(errorCode)) { return; // report error } if(spanQCYes==s.length()) { ps=&s; // s is already in FCD } else { // unnormalized suffix as a read-only alias (does not copy characters) UnicodeString unnormalized=s.tempSubString(spanQCYes); // set the fcdString to the FCD prefix as a read-only alias fcdString.setTo(FALSE, s.getBuffer(), spanQCYes); // automatic copy-on-write, and append the FCD'ed suffix fcd.normalizeSecondAndAppend(fcdString, unnormalized, errorCode); ps=&fcdString; if(U_FAILURE(errorCode)) { return; // report error } } // ... now process the string *ps which is in FCD ... }private: const Normalizer2 &nfkc; const Normalizer2 &fcd;}; |
