OverviewNormalization is used to convert text to a
unique, equivalent form. Systems can normalize Unicode-encoded text to
one particular sequence, such as normalizing composite character
sequences into pre-composed characters. Normalizer allows for easier sorting and searching of text. Normalizer supports the standard normalization forms and are described in great detail in Unicode Technical Report #15 (Unicode Normalization Forms)
and Section 5.7 of the Unicode Standard. UsageNormalizer
transforms text into the canonical composed and decomposed forms. In
addition, you can have it perform compatibility decompositions so that
you can treat compatibility characters the same as their equivalents. Normalizer adds one optional behavior, IGNORE_HANGUL,
that differs from the standard Unicode Normalization Forms in not
normalizing Korean syllables. This option can be passed to the Normalizer constructors} and to the static compose and decompose methods. This option will be turned off by default. There are three common usage models for Normalizer: You can use normalize() to process an entire input string at once.
You can create a Normalizer object and use it to iterate through the normalized form of a string by calling first() and next(). For
example, when you are comparing two strings you want to stop the
comparison as soon as a significant difference is found. This way, you
do not have the overhead of converting an entire string if only the
first characters are important.
You can use setIndex() and getIndex() to perform a random-access iteration.
Transformation Methodsnormalize() Normalizes a string using the given normalization operation. compose() Composes a string forming the separate Unicode characters into their corresponding user characters. decompose() Decomposes a string into its separate Unicode characters.
Movement Methods
 | Normalizer objects behave like iterators and have methods such as setIndex(), next(), previous(), etc. You should note that while the setIndex() and getIndex() refer to indices in the underlying Unicode input text, the next() and previous()
methods iterate through characters in the normalized output. This means
that there is not necessarily a one-to-one correspondence between
characters returned by next() and previous() and the indices passed to and returned from setIndex() and getIndex(). It is for this reason that Normalizer does not implement the CharacterIterator interface. |
Programming Examples in C and C++Programming example for normalizing a string
. |