Strings‎ > ‎

Properties

Overview

Text processing requires that a program treat text appropriately. If text is exchanged between several systems, it is important for them to process the text consistently. This is done by assigning each character, or a range of characters, attributes or properties used for text processing, and by defining standard algorithms for at least the basic text operations.

Traditionally, such attributes and algorithms have not been well-defined for most character sets, and text processing had to rely on ad-hoc solutions. Over time, standards were created for querying properties of the system codepage. However, the set of these properties was limited. Their data was not coordinated among implementations, and standard algorithms were not available.

It is one of the strengths of Unicode that it not only defines a very large character set, but also assigns a comprehensive set of properties and usage notes to all characters. It defines standard algorithms for critical text processing, and the data is publicly provided and kept up-to-date. See http://www.unicode.org/ for more information.

Sample code is available in the ICU source code library at icu/source/samples/props/props.cpp . See also the source code for the Unicode browser demo application, which can be used online to browse Unicode characters with their properties.

Unicode Character Database properties in ICU APIs

The following table shows all Unicode Character Database properties (except for purely "extracted" ones and Unihan properties) and the corresponding ICU APIs. Most of the time, ICU4C provides functions in icu/source/common/unicode/uchar.h and ICU4J provides parallel functions in the com.ibm.icu.lang.UCharacter class. Properties of a single Unicode character are accessed by its 21-bit code point value (type: UChar32=int32_t in C/C++, int in Java). Most properties are also available via UnicodeSet APIs and patterns.

See the Unicode Character Database itself for comparison. PropertyAliases.txt lists all properties by name and type.

Most properties that use binary, integer, or enumerated values are available via functions u_hasBinaryProperty and u_getIntPropertyValue which take UProperty enum constants to select the property. (ICU4J UCharacter member functions do not have the "u_" prefix.) The constant names include the long property name according to PropertyAliases.txt, e.g., UCHAR_LINE_BREAK. Corresponding property value enum constant names often contain the short property name and the long value name, e.g., U_LB_LINE_FEED. For enumeration/integer type properties, the enumeration result type is also listed here.

Some UnicodeSet APIs use the same UProperty constants. Other UnicodeSet APIs and UnicodeSet and regular expression patterns use the long or short property aliases and property value aliases (see PropertyAliases.txt and PropertyValueAliases.txt).

There is one pseudo-property, UCHAR_GENERAL_CATEGORY_MASK for which the APIs do not use a single value but a bit-set (a mask) of zero or more values, with each bit corresponding to one UCHAR_GENERAL_CATEGORY value. This allows ICU to represent property value aliases for multiple general categories, like "Letters" (which stands for "Uppercase Letters", "Lowercase Letters", etc.). In other words, there are two ICU properties for the same Unicode property, one delivering single values (for per-code point lookup) and the other delivering sets of values (for use with value aliases and UnicodeSet).

UCD Name
(see PropertyAliases.txt)
Type   ICU4C uchar.h
ICU4J UCharacter
UCD File (.txt)
Age Unicode version (U) C: u_charAge fills in UVersionInfo
Java: getAge returns a VersionInfo reference
DerivedAge
Alphabetic binary (U) u_isUAlphabetic, UCHAR_ALPHABETIC DerivedCoreProperties
ASCII_Hex_Digit binary (U) UCHAR_ASCII_HEX_DIGIT PropList
Bidi_Class enum UCharDirection (U) u_charDirection, UCHAR_BIDI_CLASS UnicodeData
Bidi_Control binary (U) UCHAR_BIDI_CONTROL PropList
Bidi_Mirrored binary (U) u_isMirrored, UCHAR_BIDI_MIRRORED UnicodeData
Bidi_Mirroring_Glyph code point   u_charMirror BidiMirroring
Block enum UBlockCode (growing) (U) ublock_getCode, UCHAR_BLOCK Blocks
Canonical_Combining_Class 0..255 (U) u_getCombiningClass, UCHAR_CANONICAL_COMBINING_CLASS UnicodeData
Case_Folding Unicode string   u_strFoldCase (ustring.h) CaseFolding
Case_Ignorable binary (U) UCHAR_CASE_IGNORABLE DerivedCoreProperties
Cased binary (U) UCHAR_CASED DerivedCoreProperties
Changes_When_Casefolded binary (U) UCHAR_CHANGES_WHEN_CASEFOLDED DerivedCoreProperties
Changes_When_Casemapped binary (U) UCHAR_CHANGES_WHEN_CASEMAPPED DerivedCoreProperties
Changes_When_NFKC_Casefolded binary (U) UCHAR_CHANGES_WHEN_NFKC_CASEFOLDED DerivedNormalizationProps
Changes_When_Lowercased binary (U) UCHAR_CHANGES_WHEN_LOWERCASED DerivedCoreProperties
Changes_When_Titlecased binary (U) UCHAR_CHANGES_WHEN_TITLECASED DerivedCoreProperties
Changes_When_Uppercased binary (U) UCHAR_CHANGES_WHEN_UPPERCASED DerivedCoreProperties
Composition_Exclusion binary (c) contributes to Full_Composition_Exclusion CompositionExclusions
Dash binary (U) UCHAR_DASH PropList
Decomposition_Mapping Unicode string   NFKC Normalizer2::getRawDecomposition() UnicodeData
Decomposition_Type enum UDecompositionType (U) UCHAR_DECOMPOSITION_TYPE UnicodeData
Default_Ignorable_Code_Point binary (U) UCHAR_DEFAULT​_IGNORABLE_CODE_POINT DerivedCoreProperties
Deprecated binary (U) UCHAR_DEPRECATED PropList
Diacritic binary (U) UCHAR_DIACRITIC PropList
East_Asian_Width enum UEastAsianWidth (U) UCHAR_EAST_ASIAN_WIDTH EastAsianWidth
Expands_On_NF* binary   available via normalization API (normalizer2.h) DerivedNormal­izationProps
Extender binary (U) UCHAR_EXTENDER PropList
FC_NFKC_Closure Unicode string   u_getFC_NFKC_Closure DerivedNormal­izationProps
Full_Composition_Exclusion binary (U) UCHAR_FULL​_COMPOSITION_EXCLUSION DerivedNormal­izationProps
General_Category enum (<= 32 values) (U) u_charType, UCHAR_GENERAL_CATEGORY, UCHAR_GENERAL_CATEGORY_MASK, UCharCategory UnicodeData
Grapheme_Base binary (U) UCHAR_GRAPHEME_BASE DerivedCoreProperties
Grapheme_Cluster_Break enum UGraphemeClusterBreak (U) UCHAR_GRAPHEME_CLUSTER_BREAK GraphemeBreakProperty
Grapheme_Extend binary (U) UCHAR_GRAPHEME_EXTEND DerivedCoreProperties
Grapheme_Link binary (U) UCHAR_GRAPHEME_LINK DerivedCoreProperties
Hangul_Syllable_Type enum UHangulSyllableType (U) UCHAR_HANGUL_SYLLABLE_TYPE HangulSyllableType
Hex_Digit binary (U) UCHAR_HEX_DIGIT PropList
Hyphen binary (U) UCHAR_HYPHEN PropList
ID_Continue binary (U) UCHAR_ID_CONTINUE DerivedCoreProperties
ID_Start binary (U) UCHAR_ID_START DerivedCoreProperties
Ideographic binary (U) UCHAR_IDEOGRAPHIC PropList
IDS_Binary_Operator binary (U) UCHAR_IDS_BINARY_OPERATOR PropList
IDS_Triary_Operator binary (U) UCHAR_IDS_TRINARY_OPERATOR PropList
Indic_Matra_Category (enum)   provisional, not yet supported IndicMatraCategory
Indic_Syllabic_Category (enum)   provisional, not yet supported IndicSyllabicCategory
ISO_Comment ASCII string   u_getISOComment UnicodeData
Jamo_Short_Name ASCII string (c) contributes to Name Jamo
Join_Control binary (U) UCHAR_JOIN_CONTROL PropList
Joining_Group enum UJoiningGroup (U) UCHAR_JOINING_GROUP ArabicShaping
Joining_Type enum UJoiningType (U) UCHAR_JOINING_TYPE ArabicShaping
Line_Break enum ULineBreak (U) UCHAR_LINE_BREAK LineBreak
Logical_Order_Exception binary (U) UCHAR_LOGICAL_ORDER_EXCEPTION PropList
Lowercase binary (U) u_isULowercase, UCHAR_LOWERCASE DerivedCoreProperties
Lowercase_Mapping Unicode string + conditions   available via u_strToLower (ustring.h) UnicodeData + SpecialCasing
Math binary (U) UCHAR_MATH DerivedCoreProperties
Name ASCII string (U) u_charName(U_UNICODE_CHAR_NAME or U_EXTENDED_CHAR_NAME) UnicodeData
Name_Alias ASCII string   u_charName(U_CHAR_NAME_ALIAS) NameAliases
NF*_QuickCheck enum UNormalizationCheckResult (no/maybe/yes) (U) UCHAR_NF*_QUICK_CHECK and available via quickCheck (normalizer2.h) DerivedNormal­izationProps
NFKC_Casefold Unicode string   available via normalization API (normalizer2.h "nfkc_cf") DerivedNormalizationProps
Noncharacter_Code_Point binary (U) UCHAR_NONCHARACTER​_CODE_POINT, U_IS_UNICODE_NONCHAR (utf.h) PropList
Numeric_Type enum UNumericType (U) UCHAR_NUMERIC_TYPE UnicodeData
Numeric_Value double (U) u_getNumericValue
Java/UnicodeSet: only non-negative integers, no fractions
UnicodeData
Other_Alphabetic binary (c) contributes to Alphabetic PropList
Other_Default_Ignorable​_Code_Point binary (c) contributes to Default_Ignorable​_Code_Point PropList
Other_Grapheme_Extend binary (c) contributes to Grapheme_Extend PropList
Other_Lowercase binary (c) contributes to Lowercase PropList
Other_Math binary (c) contributes to Math PropList
Other_Uppercase binary (c) contributes to Uppercase PropList
Pattern_Syntax binary (U) UCHAR_PATTERN_SYNTAX PropList
Pattern_White_Space binary (U) UCHAR_PATTERN_WHITE_SPACE PropList
Quotation_Mark binary (U) UCHAR_QUOTATION_MARK PropList
Radical binary (U) UCHAR_RADICAL PropList
Script enum UScriptCode (growing) (U) uscript_getCode (uscript.h), UCHAR_SCRIPT Scripts
Script_Extensions (provisional) list of enum UScriptCode (growing) (U) uscript_getScriptExtensions & uscript_hasScript (uscript.h), UCHAR_SCRIPT_EXTENSIONS
UnicodeSet [:scx=Arab:] is a superset of [:sc=Arab:]
ScriptExtensions
Sentence_Break enum USentenceBreak (U) UCHAR_SENTENCE_BREAK SentenceBreakProperty
Simple_Case_Folding code point   u_foldCase CaseFolding
Simple_Lowercase_ Mapping code point   u_tolower UnicodeData
Simple_Titlecase_ Mapping code point   u_totitle UnicodeData
Simple_Uppercase_ Mapping code point   u_toupper UnicodeData
Soft_Dotted binary (U) UCHAR_SOFT_DOTTED PropList
Special_Case_Condition conditions   available via u_strToLower etc. (ustring.h) SpecialCasing
STerm binary (U) UCHAR_S_TERM PropList
Terminal_Punctuation binary (U) UCHAR_TERMINAL_PUNCTUATION PropList
Titlecase_Mapping Unicode string + conditions   u_strToTitle (ustring.h) UnicodeData + SpecialCasing
Unicode_1_Name ASCII string (U) u_charName(U_UNICODE_10_CHAR_NAME or U_EXTENDED_CHAR_NAME) UnicodeData
Unified_Ideograph binary (U) UCHAR_UNIFIED_IDEOGRAPH PropList
Uppercase binary (U) u_isUUppercase, UCHAR_UPPERCASE DerivedCoreProperties
Uppercase_Mapping Unicode string + conditions   u_strToUpper (ustring.h) UnicodeData + SpecialCasing
White_Space binary (U) u_isUWhiteSpace, UCHAR_WHITE_SPACE PropList
Word_Break enum UWordBreakValues (U) UCHAR_WORD_BREAK WordBreakProperty
XID_Continue binary (U) UCHAR_XID_CONTINUE DerivedCoreProperties
XID_Start binary (U) UCHAR_XID_START DerivedCoreProperties

Notes:

  1. (c) - This property only contributes to "real" properties (mostly "Other_..." properties), so there is no direct support for this property in ICU.

  2. (U) - This property is available via the UnicodeSet APIs and patterns. Any property available in UnicodeSet is also available in regular expressions. Properties which are not available in UnicodeSet are generally those that are not available through a UProperty selector.

Customization

ICU does not provide the means to modify properties at runtime. The properties are provided exactly as specified by a recent version of the Unicode Standard (as published in the Character Database ). However, if an application requires custom properties (for example, for Private Use characters), then it is possible to change or add them at build-time. This is done by modifying the Character Database files copied into the ICU source tree at icu/source/data/unidata. For the most common properties, the file to modify is UnicodeData.txt.

To add a character to such a file, a line must be inserted into the file with the format used in that file (see the online documentation on the Unicode site for more information). These files are processed by ICU tools at build time. For example, the genprops tool reads several of the files and writes the binary file uprops.dat, which is then packaged into the common ICU data file. It is important for the operation of those tools that the Unicode character code points of the entries are in ascending order (gaps are allowed). Any available Unicode code point (0 to 10ffff16) can be used. Code point values should be written with either 4, 5, or 6 hex digits. The minimum number of digits possible should be used (but no fewer than 4). Note that the Unicode Standard specifies that the 32 code point U+fdd0..U+fdef and the 34 code points U+...fffe and U+...ffff are not characters, therefore they should not be added to any of the character database files.

After modifying one of these files, the ICU data needs to be rebuilt. This is done via tools that are outside the normal ICU build, and the resulting files are checked into the ICU source tree. See the Unicode update changelog for details.

Properties in ICU Rule Syntax

ICU rule syntaxes should use the Unicode Pattern_White_Space set as syntactic "spaces" to allow for the usage of white space characters outside of the normal ASCII range while still maintaining backward compatibility. See http://www.unicode.org/reports/tr31/#Pattern_Syntax for more information.

Comments