Overview
Text processing requires that a program
treat text appropriately. If text is exchanged between several systems,
it is important for them to process the text consistently. This is done
by assigning each character, or a range of characters, attributes or
properties used for text processing, and by defining standard
algorithms for at least the basic text operations.
Traditionally,
such attributes and algorithms have not been well-defined for most
character sets, and text processing had to rely on ad-hoc solutions.
Over time, standards were created for querying properties of the system
codepage. However, the set of these properties was limited. Their data
was not coordinated among implementations, and standard algorithms were
not available.
It is one of the strengths of Unicode that it not
only defines a very large character set, but also assigns a
comprehensive set of properties and usage notes to all characters. It
defines standard algorithms for critical text processing, and the data
is publicly provided and kept up-to-date. See http://www.unicode.org/
for more information.
Sample code is available in the ICU source code library at icu/source/samples/props/props.cpp
. See also the source code for the Unicode browser
demo application, which can be used online
to browse Unicode characters with their properties.
Unicode Character Database properties in ICU APIs
The
following table shows all Unicode Character Database properties (except
for purely "extracted" ones and Unihan properties) and the
corresponding ICU APIs. Most of the time, ICU4C provides functions in icu/source/common/unicode/uchar.h and ICU4J provides parallel functions in the com.ibm.icu.lang.UCharacter class. Properties of a single Unicode character are accessed by its 21-bit code point value (type: UChar32=int32_t in C/C++, int in Java). Most properties are also available via UnicodeSet APIs and patterns.
See the Unicode Character Database
itself for comparison. PropertyAliases.txt lists all properties by name and type.
Most properties that use binary, integer, or enumerated values are available via functions u_hasBinaryProperty and u_getIntPropertyValue
which take UProperty enum constants to select the property. (ICU4J
UCharacter member functions do not have the "u_" prefix.) The constant
names include the long property name according to PropertyAliases.txt,
e.g., UCHAR_LINE_BREAK. Corresponding property value enum constant names often contain the short property name and the long value name, e.g., U_LB_LINE_FEED. For enumeration/integer type properties, the enumeration result type is also listed here.
Some
UnicodeSet APIs use the same UProperty constants. Other UnicodeSet APIs
and UnicodeSet and regular expression patterns use the long or short
property aliases and property value aliases (see PropertyAliases.txt
and PropertyValueAliases.txt).
There is one pseudo-property, UCHAR_GENERAL_CATEGORY_MASK for which the APIs do not use a single value but a bit-set (a mask) of zero or more values, with each bit corresponding to one UCHAR_GENERAL_CATEGORY
value. This allows ICU to represent property value aliases for multiple
general categories, like "Letters" (which stands for "Uppercase
Letters", "Lowercase Letters", etc.). In other words, there are two ICU
properties for the same Unicode property, one delivering single values
(for per-code point lookup) and the other delivering sets of values
(for use with value aliases and UnicodeSet).
UCD Name
(see PropertyAliases.txt) |
Type |
|
ICU4C uchar.h
ICU4J UCharacter |
UCD File (.txt) |
| Age |
Unicode version |
(U) |
C: u_charAge fills in UVersionInfo
Java: getAge returns a VersionInfo reference |
DerivedAge |
| Alphabetic |
binary |
(U) |
u_isUAlphabetic, UCHAR_ALPHABETIC |
DerivedCoreProperties |
| ASCII_Hex_Digit |
binary |
(U) |
UCHAR_ASCII_HEX_DIGIT |
PropList |
| Bidi_Class |
enum UCharDirection |
(U) |
u_charDirection, UCHAR_BIDI_CLASS |
UnicodeData |
| Bidi_Control |
binary |
(U) |
UCHAR_BIDI_CONTROL |
PropList |
| Bidi_Mirrored |
binary |
(U) |
u_isMirrored, UCHAR_BIDI_MIRRORED |
UnicodeData |
| Bidi_Mirroring_Glyph |
code point |
|
u_charMirror |
BidiMirroring |
| Block |
enum UBlockCode (growing) |
(U) |
ublock_getCode, UCHAR_BLOCK |
Blocks |
| Canonical_Combining_Class |
0..255 |
(U) |
u_getCombiningClass, UCHAR_CANONICAL_COMBINING_CLASS |
UnicodeData |
| Case_Folding |
Unicode string |
|
u_strFoldCase (ustring.h) |
CaseFolding |
| Case_Ignorable |
binary |
(U) |
UCHAR_CASE_IGNORABLE |
DerivedCoreProperties |
| Cased |
binary |
(U) |
UCHAR_CASED |
DerivedCoreProperties |
| Changes_When_Casefolded |
binary |
(U) |
UCHAR_CHANGES_WHEN_CASEFOLDED |
DerivedCoreProperties |
| Changes_When_Casemapped |
binary |
(U) |
UCHAR_CHANGES_WHEN_CASEMAPPED |
DerivedCoreProperties |
| Changes_When_NFKC_Casefolded |
binary |
(U) |
UCHAR_CHANGES_WHEN_NFKC_CASEFOLDED |
DerivedNormalizationProps |
| Changes_When_Lowercased |
binary |
(U) |
UCHAR_CHANGES_WHEN_LOWERCASED |
DerivedCoreProperties |
| Changes_When_Titlecased |
binary |
(U) |
UCHAR_CHANGES_WHEN_TITLECASED |
DerivedCoreProperties |
| Changes_When_Uppercased |
binary |
(U) |
UCHAR_CHANGES_WHEN_UPPERCASED |
DerivedCoreProperties |
| Composition_Exclusion |
binary |
(c) |
contributes to Full_Composition_Exclusion |
CompositionExclusions |
| Dash |
binary |
(U) |
UCHAR_DASH |
PropList |
| Decomposition_Mapping |
Unicode string |
|
NFKC Normalizer2::getRawDecomposition() |
UnicodeData |
| Decomposition_Type |
enum UDecompositionType |
(U) |
UCHAR_DECOMPOSITION_TYPE |
UnicodeData |
| Default_Ignorable_Code_Point |
binary |
(U) |
UCHAR_DEFAULT_IGNORABLE_CODE_POINT |
DerivedCoreProperties |
| Deprecated |
binary |
(U) |
UCHAR_DEPRECATED |
PropList |
| Diacritic |
binary |
(U) |
UCHAR_DIACRITIC |
PropList |
| East_Asian_Width |
enum UEastAsianWidth |
(U) |
UCHAR_EAST_ASIAN_WIDTH |
EastAsianWidth |
| Expands_On_NF* |
binary |
|
available via normalization API (normalizer2.h) |
DerivedNormalizationProps |
| Extender |
binary |
(U) |
UCHAR_EXTENDER |
PropList |
| FC_NFKC_Closure |
Unicode string |
|
u_getFC_NFKC_Closure |
DerivedNormalizationProps |
| Full_Composition_Exclusion |
binary |
(U) |
UCHAR_FULL_COMPOSITION_EXCLUSION |
DerivedNormalizationProps |
| General_Category |
enum (<= 32 values) |
(U) |
u_charType, UCHAR_GENERAL_CATEGORY, UCHAR_GENERAL_CATEGORY_MASK, UCharCategory |
UnicodeData |
| Grapheme_Base |
binary |
(U) |
UCHAR_GRAPHEME_BASE |
DerivedCoreProperties |
| Grapheme_Cluster_Break |
enum UGraphemeClusterBreak |
(U) |
UCHAR_GRAPHEME_CLUSTER_BREAK |
GraphemeBreakProperty |
| Grapheme_Extend |
binary |
(U) |
UCHAR_GRAPHEME_EXTEND |
DerivedCoreProperties |
| Grapheme_Link |
binary |
(U) |
UCHAR_GRAPHEME_LINK |
DerivedCoreProperties |
| Hangul_Syllable_Type |
enum UHangulSyllableType |
(U) |
UCHAR_HANGUL_SYLLABLE_TYPE |
HangulSyllableType |
| Hex_Digit |
binary |
(U) |
UCHAR_HEX_DIGIT |
PropList |
| Hyphen |
binary |
(U) |
UCHAR_HYPHEN |
PropList |
| ID_Continue |
binary |
(U) |
UCHAR_ID_CONTINUE |
DerivedCoreProperties |
| ID_Start |
binary |
(U) |
UCHAR_ID_START |
DerivedCoreProperties |
| Ideographic |
binary |
(U) |
UCHAR_IDEOGRAPHIC |
PropList |
| IDS_Binary_Operator |
binary |
(U) |
UCHAR_IDS_BINARY_OPERATOR |
PropList |
| IDS_Triary_Operator |
binary |
(U) |
UCHAR_IDS_TRINARY_OPERATOR |
PropList |
| Indic_Matra_Category |
(enum) |
|
provisional, not yet supported |
IndicMatraCategory |
| Indic_Syllabic_Category |
(enum) |
|
provisional, not yet supported |
IndicSyllabicCategory |
| ISO_Comment |
ASCII string |
|
u_getISOComment |
UnicodeData |
| Jamo_Short_Name |
ASCII string |
(c) |
contributes to Name |
Jamo |
| Join_Control |
binary |
(U) |
UCHAR_JOIN_CONTROL |
PropList |
| Joining_Group |
enum UJoiningGroup |
(U) |
UCHAR_JOINING_GROUP |
ArabicShaping |
| Joining_Type |
enum UJoiningType |
(U) |
UCHAR_JOINING_TYPE |
ArabicShaping |
| Line_Break |
enum ULineBreak |
(U) |
UCHAR_LINE_BREAK |
LineBreak |
| Logical_Order_Exception |
binary |
(U) |
UCHAR_LOGICAL_ORDER_EXCEPTION |
PropList |
| Lowercase |
binary |
(U) |
u_isULowercase, UCHAR_LOWERCASE |
DerivedCoreProperties |
| Lowercase_Mapping |
Unicode string + conditions |
|
available via u_strToLower (ustring.h) |
UnicodeData + SpecialCasing |
| Math |
binary |
(U) |
UCHAR_MATH |
DerivedCoreProperties |
| Name |
ASCII string |
(U) |
u_charName(U_UNICODE_CHAR_NAME or U_EXTENDED_CHAR_NAME) |
UnicodeData |
| Name_Alias |
ASCII string |
|
u_charName(U_CHAR_NAME_ALIAS) |
NameAliases |
| NF*_QuickCheck |
enum UNormalizationCheckResult (no/maybe/yes) |
(U) |
UCHAR_NF*_QUICK_CHECK and available via quickCheck (normalizer2.h) |
DerivedNormalizationProps |
| NFKC_Casefold |
Unicode string |
|
available via normalization API (normalizer2.h "nfkc_cf") |
DerivedNormalizationProps |
| Noncharacter_Code_Point |
binary |
(U) |
UCHAR_NONCHARACTER_CODE_POINT, U_IS_UNICODE_NONCHAR (utf.h) |
PropList |
| Numeric_Type |
enum UNumericType |
(U) |
UCHAR_NUMERIC_TYPE |
UnicodeData |
| Numeric_Value |
double |
(U) |
u_getNumericValue
Java/UnicodeSet: only non-negative integers, no fractions |
UnicodeData |
| Other_Alphabetic |
binary |
(c) |
contributes to Alphabetic |
PropList |
| Other_Default_Ignorable_Code_Point |
binary |
(c) |
contributes to Default_Ignorable_Code_Point |
PropList |
| Other_Grapheme_Extend |
binary |
(c) |
contributes to Grapheme_Extend |
PropList |
| Other_Lowercase |
binary |
(c) |
contributes to Lowercase |
PropList |
| Other_Math |
binary |
(c) |
contributes to Math |
PropList |
| Other_Uppercase |
binary |
(c) |
contributes to Uppercase |
PropList |
| Pattern_Syntax |
binary |
(U) |
UCHAR_PATTERN_SYNTAX |
PropList |
| Pattern_White_Space |
binary |
(U) |
UCHAR_PATTERN_WHITE_SPACE |
PropList |
| Quotation_Mark |
binary |
(U) |
UCHAR_QUOTATION_MARK |
PropList |
| Radical |
binary |
(U) |
UCHAR_RADICAL |
PropList |
| Script |
enum UScriptCode (growing) |
(U) |
uscript_getCode (uscript.h), UCHAR_SCRIPT |
Scripts |
| Script_Extensions (provisional) |
list of enum UScriptCode (growing) |
(U) |
uscript_getScriptExtensions & uscript_hasScript (uscript.h), UCHAR_SCRIPT_EXTENSIONS
UnicodeSet [:scx=Arab:] is a superset of [:sc=Arab:] |
ScriptExtensions |
| Sentence_Break |
enum USentenceBreak |
(U) |
UCHAR_SENTENCE_BREAK |
SentenceBreakProperty |
| Simple_Case_Folding |
code point |
|
u_foldCase |
CaseFolding |
| Simple_Lowercase_ Mapping |
code point |
|
u_tolower |
UnicodeData |
| Simple_Titlecase_ Mapping |
code point |
|
u_totitle |
UnicodeData |
| Simple_Uppercase_ Mapping |
code point |
|
u_toupper |
UnicodeData |
| Soft_Dotted |
binary |
(U) |
UCHAR_SOFT_DOTTED |
PropList |
| Special_Case_Condition |
conditions |
|
available via u_strToLower etc. (ustring.h) |
SpecialCasing |
| STerm |
binary |
(U) |
UCHAR_S_TERM |
PropList |
| Terminal_Punctuation |
binary |
(U) |
UCHAR_TERMINAL_PUNCTUATION |
PropList |
| Titlecase_Mapping |
Unicode string + conditions |
|
u_strToTitle (ustring.h) |
UnicodeData + SpecialCasing |
| Unicode_1_Name |
ASCII string |
(U) |
u_charName(U_UNICODE_10_CHAR_NAME or U_EXTENDED_CHAR_NAME) |
UnicodeData |
| Unified_Ideograph |
binary |
(U) |
UCHAR_UNIFIED_IDEOGRAPH |
PropList |
| Uppercase |
binary |
(U) |
u_isUUppercase, UCHAR_UPPERCASE |
DerivedCoreProperties |
| Uppercase_Mapping |
Unicode string + conditions |
|
u_strToUpper (ustring.h) |
UnicodeData + SpecialCasing |
| White_Space |
binary |
(U) |
u_isUWhiteSpace, UCHAR_WHITE_SPACE |
PropList |
| Word_Break |
enum UWordBreakValues |
(U) |
UCHAR_WORD_BREAK |
WordBreakProperty |
| XID_Continue |
binary |
(U) |
UCHAR_XID_CONTINUE |
DerivedCoreProperties |
| XID_Start |
binary |
(U) |
UCHAR_XID_START |
DerivedCoreProperties |
Notes:
-
(c) - This property only contributes to "real" properties (mostly "Other_..." properties), so there is no direct support for this property in ICU.
-
(U) - This property is available via the UnicodeSet APIs and
patterns. Any property available in UnicodeSet is also available in
regular expressions. Properties which are not available in UnicodeSet
are generally those that are not available through a UProperty selector.
Customization
ICU does not provide the means to
modify properties at runtime. The properties are provided exactly as
specified by a recent version of the Unicode Standard (as published in
the Character Database
). However, if an application requires custom properties (for example, for Private Use
characters), then it is possible to change or add them at build-time.
This is done by modifying the Character Database files copied into the
ICU source tree at icu/source/data/unidata. For the most common properties, the file to modify is UnicodeData.txt.
To
add a character to such a file, a line must be inserted into the file
with the format used in that file (see the online documentation on the Unicode site
for more information). These files are processed by ICU tools at build
time. For example, the genprops tool reads several of the files and
writes the binary file uprops.dat, which is then packaged into the
common ICU data file. It is important for the operation of those tools
that the Unicode character code points of the entries are in ascending
order (gaps are allowed). Any available Unicode code point (0 to 10ffff16)
can be used. Code point values should be written with either 4, 5, or 6
hex digits. The minimum number of digits possible should be used (but
no fewer than 4). Note that the Unicode Standard specifies that the 32
code point U+fdd0..U+fdef and the 34 code points U+...fffe and
U+...ffff are not characters, therefore they should not be added to any
of the character database files.
After modifying one of these
files, the ICU data needs to be rebuilt. This is done via tools that are outside the normal ICU build, and the resulting files are checked into the ICU source tree. See the Unicode update changelog for details.
Properties in ICU Rule Syntax
ICU rule syntaxes should use the Unicode Pattern_White_Space set as syntactic "spaces" to allow for the usage of white space characters outside of the normal ASCII range while still maintaining backward compatibility. See http://www.unicode.org/reports/tr31/#Pattern_Syntax for more information.