Transforms‎ > ‎

Case Mappings

Overview

Case mapping is used to handle the mapping of upper-case, lower-case, and title case characters for a given language. Case is a normative property of characters in specific alphabets (e.g. Latin, Greek, Cyrillic, Armenian, and archaic Georgian) whereby characters are considered to be variants of a single letter. ICU refers to these variants, which may differ markedly in shape and size, as uppercase letters (also known as capital or majuscule) and lower-case letters (also known as small or minuscule). Alphabets with case differences are called bicameral and alphabets without case differences are called unicameral.

Due to the inclusion of certain composite characters for compatibility, such as the Latin capital letter 'DZ' (\u01F1 'DZ'), there is a third case called title case. Title case is used to capitalize the first character of a word such as the Latin capital letter 'D' with small letter 'z' ( \u01F2 'Dz'). The term "title case" can also be used to refer to words whose first letter is an uppercase or title case letter and the rest are lowercase letters. However, not all words in the title of a document or first words in a sentence will be title case. The use of title case words is language dependent. For example, in English, "Taming of the Shrew" would be the appropriate capitalization and not "Taming Of The Shrew".

Note Although the archaic Georgian script contained upper- and lowercase pairs, they are rarely used in modern Georgian.

Sample code is available in the ICU source code library at icu/source/samples/ustring/ustring.cpp .

Please refer to The Unicode Standard for more information about case mapping:

  • 3.13 Default Case Algorithms
  • 4.2 Case
  • 5.18 Case Mappings

Simple (Single-Character) Case Mapping

The general case mapping in ICU is non-language based and a 1 to 1 generic character map.

A character is considered to have a lowercase, uppercase, or title case equivalent if there is a respective "simple" case mapping specified for the character in the Unicode Character Database (UnicodeData.txt). If a character has no mapping equivalent, the result is the character itself.

The APIs provided for the general case mapping, located in uchar.h file, handles only single characters of type UChar32 and returns only single characters. To convert a string to a non-language based specific case, use the APIs in either the unistr.h or ustring.h files with a NULL argument locale.

Full (Language-Specific) Case Mapping

There are different case mappings for different locales. For instance, unlike English, the character Latin small letter 'i' in Turkish has an equivalent Latin capital letter 'I' with dot above ( \u0130 'İ').

Similar to the simple case mapping API, a character is considered to have a lowercase, uppercase or title case equivalent if there is a respective mapping specified for the character in the Unicode Character database (UnicodeData.txt). In the case where a character has no mapping equivalent, the result is the character itself.

To convert a string to a language based specific case, use the APIs in ustring.h and unistr.h with an intended argument locale.

ICU implements full Unicode string case mappings. In general,

  • case mapping can change the number of code points and/or code units of a string,
  • is language-sensitive (results may differ depending on language), and
  • is context-sensitive (a character in the input string may map differently depending on surrounding characters).

Case Folding

Case folding maps strings to a canonical form where case differences are erased. Using the case folding API, ICU supports fast matches without regard to case in lookups, since only binary comparison is required.

The CaseFolding.txt file in the Unicode Character Database is used for performing locale-independent case folding. This text file is generated from the case mappings in the Unicode Character Database, using both the single-character and the multi-character mappings. The CaseFolding.txt file transforms all characters having different case forms into a common form. To compare two strings for non-case-sensitive matching, you can transform each string and then use a binary comparison. There are also functions to compare two strings case-insensitively using the same case folding data.

Unicode case folding is not context-sensitive. It is also not language-sensitive, although there is a flag for whether to apply special mappings for use with Turkic (Turkish/Azerbaijani) text data.

Character case folding APIs implementations are located in:

  1. uchar.h for single character folding

  2. ustring.h and unistr.h for character string folding.

Comments