This section explains how to handle Unicode strings with ICU in C and C++.
Sample code is available in the ICU source code library at icu/source/samples/ustring/ustring.cpp .
Strings are the most common and fundamental form of handling text in software. Logically, and often physically, they contain contiguous arrays (vectors) of basic units. Most of the ICU API functions work directly with simple strings, and where possible, this is preferred.
Sometimes, text needs to be accessed via more powerful and complicated methods. For example, text may be stored in discontiguous chunks in order to deal with frequent modification (like typing) and large amounts, or it may not be stored in the internal encoding, or it may have associated attributes like bold or italic styles.
ICU provides multiple text access interfaces which were added over time. If simple strings cannot be used, then consider the following:
The following provides some historical perspective and comparison between the interfaces.
ICU has long provided the CharacterIterator interface for some services. It allows for abstract text access, but has limitations:
The core Java adopted an early version of CharacterIterator; later functionality, like support for supplementary code points, was back-ported from ICU4C to ICU4J to form the UCharacterIterator class.
The UCharIterator C interface was added to allow for incremental normalization and collation in C. It is entirely code unit (UChar)-oriented, uses only post-increment iteration and has a smaller number of overridable methods.
The Replaceable (Java & C++) and UReplaceable (C) interfaces are designed for, and used in, Transliterator. They are random-access interfaces, not iterators.
The UText text access interface was designed as a possible replacement for all previous interfaces listed above, with additional functionality. It allows for high-performance operation through the use of storage-native indexes (for efficient use of non-UTF-16 text) and through accessing multiple characters per function call. Code point iteration is available with functions as well as with C macros, for maximum performance. UText is also writable, mostly patterned after Replaceable. For details see the UText chaper.
In Java, ICU uses the standard String and StringBuffer classes, char, etc. See the Java documentation for details.
Strings in C and C++ are, at the lowest level, arrays of some particular base type. In most cases, the base type is a char, which is an 8-bit byte in modern compilers. Some APIs use a "wide character" type wchar_t that is typically 8, 16, or 32 bits wide and upwards compatible with char. C code passes char * or wchar_t pointers to the first element of an array. C++ enables you to create a class for encapsulating these kinds of character arrays in handy and safe objects.
The interpretation of the byte or wchar_t values depends on the platform, the compiler, the signed state of both char and wchar_t, and the width of wchar_t. These characteristics are not specified in the language standards. When using internationalized text, the encoding often uses multiple chars for most characters and a wchar_t that is wide enough to hold exactly one character code point value each. Some APIs, especially in the standard library (stdlib), assume that wchar_t strings use a fixed-width encoding with exactly one character code point per wchar_t.
In order to take advantage of Unicode with its large character repertoire and its well-defined properties, there must be types with consistent definitions and semantics. The Unicode standard defines a default encoding based on 16-bit code units. This is supported in ICU by the definition of the UChar to be an unsigned 16-bit integer type. This is the base type for character arrays for strings in ICU.
With the UTF-16 encoding form, a single Unicode code point is encoded with either one or two 16-bit UChar code units (unambiguously). "Supplementary" code points, which are encoded with pairs of code units, are rare in most texts. The two code units are called "surrogates", and their unit value ranges are distinct from each other and from single-unit value ranges. Code should be generally optimized for the common, single-unit case.
16-bit Unicode strings in internal processing contain sequences of 16-bit code units that may not always be well-formed UTF-16. ICU treats single, unpaired surrogates as surrogate code points, i.e., they are returned in per-code point iteration, they are included in the number of code points of a string, and they are generally treated much like normal, unassigned code points in most APIs. Surrogate code points have Unicode properties although they cannot be assigned an actual character.
ICU string handling functions (including append, substring, etc.) do not automatically protect against producing malformed UTF-16 strings. Most of the time, indexes into strings are naturally at code point boundaries because they result from other functions that always produce such indexes. If necessary, the user can test for proper boundaries by checking the code unit values, or adjust arbitrary indexes to code point boundaries by using the C macros U16_SET_CP_START() and U16_SET_CP_LIMIT() (see utf.h) and the UnicodeString functions getChar32Start() and getChar32Limit().
UTF-8 and UTF-32 are supported with converters (ucnv.h), macros (utf.h), and convenience functions (ustring.h), but only a subset of APIs works with UTF-8 directly as string encoding form.
See the UTF-8 subpage for details about working with UTF-8. Some of the following sections apply to UTF-8 APIs as well; for example sections about handling lengths and overflows.
A Unicode code point is an integer with a value from 0 to 0x10FFFF. ICU 2.4 and later defines the UChar32 type for single code point values as a 32 bits wide signed integer (int32_t). This allows the use of easily testable negative values as sentinels, to indicate errors, exceptions or "done" conditions. All negative values and positive values greater than 0x10FFFF are illegal as Unicode code points.
ICU 2.2 and earlier defined UChar32 depending on the platform: If the compiler's wchar_t was 32 bits wide, then UChar32 was defined to be the same as wchar_t. Otherwise, it was defined to be an unsigned 32-bit integer. This means that UChar32 was either a signed or unsigned integer type depending on the compiler. This was meant for better interoperability with existing libraries, but was of little use because ICU does not process 32-bit strings — UChar32 is only used for single code points. The platform dependence of UChar32 could cause problems with C++ function overloading.
The compiler's and the runtime character set's codepage encodings are not specified by the C/C++ language standards and are usually not a Unicode encoding form. They typically depend on the settings of the individual system, process, or thread. Therefore, it is not possible to instantiate a Unicode character or string variable directly with C/C++ character or string literals. The only safe way is to use numeric values. It is not an issue for User Interface (UI) strings that are translated. These UI strings are loaded from a resource bundle, which is generated from a text file that can be in Unicode or in any other ICU-provided codepage. The binary form of the genrb tool generates UTF-16 strings that are ready for direct use.
There is a useful exception to this for program-internal strings and test strings. Within each "family" of character encodings, there is a set of characters that have the same numeric code values. Such characters include Latin letters, the basic digits, the space, and some punctuation. Most of the ASCII graphic characters are invariant characters. The same set, with different but again consistent numeric values, is invariant among almost all EBCDIC codepages. For details, see icu/source/common/unicode/utypes.h . With strings that contain only these invariant characters, it is possible to use efficient ICU constructs to write a C/C++ string literal and use it to initialize Unicode strings.
In some APIs, ICU uses char * strings. This is either for file system paths or for strings that contain invariant characters only (such as locale identifiers). These strings are in the platform-specific encoding of either ASCII or EBCDIC. All other codepage differences do not matter for invariant characters and are manipulated by the C stdlib functions like strcpy().
In some APIs where identifiers are used, ICU uses char * strings with invariant characters. Such strings do not require the full Unicode repertoire and are easier to handle in C and C++ with char * string literals and standard C library functions. Their useful character repertoire is actually smaller than the set of graphic ASCII characters; for details, see utypes.h . Examples of char * identifier uses are converter names, locale IDs, and resource bundle table keys.
There is another, less efficient way to have human-readable Unicode string literals in C and C++ code. ICU provides a small number of functions that allow any Unicode characters to be inserted into a string with escape sequences similar to the one that is used in the C and C++ language. In addition to the familiar \n and \xhh etc., ICU also provides the \uhhhh syntax with four hex digits and the \Uhhhhhhhh syntax with eight hex digits for hexadecimal Unicode code point values. This is very similar to the newer escape sequences used in Java and defined in the latest C and C++ standards. Since ICU is not a compiler extension, the "unescaping" is done at runtime and the backslash itself must be escaped (duplicated) so that the compiler does not attempt to "unescape" the sequence itself.
The length of a string and all indexes and offsets related to the string are always counted in terms of UChar code units, not in terms of UChar32 code points. (This is the same as in common C library functions that use char * strings with multi-byte encodings.)
Often, a user thinks of a "character" as a complete unit in a language, like an 'Ä', while it may be represented with multiple Unicode code points including a base character and combining marks. (See the Unicode standard for details.) This often requires users to index and pass strings (UnicodeString or UChar *) with multiple code units or code points. It cannot be done with single-integer character types. Indexing of such "characters" is done with the BreakIterator class (in C: ubrk_ functions).
Even with such "higher-level" indexing functions, the actual index values will be expressed in terms of UChar code units. When more than one code unit is used at a time, the index value changes by more than one at a time.
ICU uses signed 32-bit integers (int32_t) for lengths and offsets. Because of internal computations, strings (and arrays in general) are limited to 1G base units or 2G bytes, whichever is smaller.
Strings are either terminated with a NUL character (code point 0, U+0000) or their length is specified. In the latter case, it is possible to have one or more NUL characters inside the string.
Input string arguments are typically passed with two parameters: The (const) UChar * pointer and an int32_t length argument. If the length is -1 then the string must be NUL-terminated and the ICU function will call the u_strlen() method or treat it equivalently. If the input string contains embedded NUL characters, then the length must be specified.
Output string arguments are typically passed with a destination UChar * pointer and an int32_t capacity argument and the function returns the length of the output as an int32_t. There is also almost always a UErrorCode argument. Essentially, a UChar array is passed in with its start and the number of available UChars. The array is filled with the output and if space permits the output will be NUL-terminated. The length of the output string is returned. In all cases the length of the output string does not include the terminating NUL. This is the same behavior found in most ICU and non-ICU string APIs, for example u_strlen(). The output string may contain NUL characters as part of its actual contents, depending on the input and the operation. Note that the UErrorCode parameter is used to indicate both errors and warnings (non-errors). The following describes some of the situations in which the UErrorCode will be set to a non-zero value:
Preflighting: The returned length is always the full output length even if the output buffer is too small. It is possible to pass in a capacity of 0 (and an output array pointer of NUL) for "pure preflighting" to determine the necessary output buffer size. Add one to make the output string NUL-terminated.
Note that — whether the caller intends to "preflight" or not — if the output length is equal to or greater than the capacity, then the UErrorCode is set to U_STRING_NOT_TERMINATED_WARNING or U_BUFFER_OVERFLOW_ERROR respectively, as described above.
However, "pure preflighting" is very expensive because the operation has to be processed twice — once for calculating the output length, and a second time to actually generate the output. It is much more efficient to always provide an output buffer that is expected to be large enough for most cases, and to reallocate and repeat the operation only when an overflow occurred. (Remember to reset the UErrorCode to U_ZERO_ERROR before calling the function again.) In C/C++, the initial output buffer can be a stack buffer. In case of a reallocation, it may be possible and useful to cache and reuse the new, larger buffer.
In C, Unicode strings are similar to standard char * strings. Unicode strings are arrays of UChar and most APIs take a UChar * pointer to the first element and an input length and/or output capacity, see above. ICU has a number of functions that provide the Unicode equivalent of the stdlib functions such as strcpy(), strstr(), etc. Compared with their C standard counterparts, their function names begin with u_. Otherwise, their semantics are equivalent. These functions are defined in icu/source/common/unicode/ustring.h.
Sometimes, Unicode code points need to be accessed in C for iteration, movement forward, or movement backward in a string. A string might also need to be written from code points values. ICU provides a number of macros that are defined in the icu/source/common/unicode/utf.h and utf8.h/utf16.h headers that it includes (utf.h is in turn included with utypes.h).
Macros for 16-bit Unicode strings have a U16_ prefix. For example:
There are also macros with a U_ prefix for code point range checks (e.g., test for non-character code point), and U8_ macros for 8-bit (UTF-8) strings. See the header files and the API References for more details.
In ICU 2.4, the utf*.h macros have been revamped, improved, simplified, and renamed. The old macros continue to be available. They are in utf_old.h, together with an explanation of the change. utf.h, utf8.h and utf16.h contain the new macros instead. The new macros are intended to be more consistent, more useful, and less confusing. Some macros were simply renamed for consistency with a new naming scheme.
The documentation of the old macros has been removed. If you need it, see a User Guide version from ICU 4.2 or earlier (see the download page).
C Unicode String Literals
There is a pair of macros that together enable users to instantiate a Unicode string in C — a UChar  array — from a C string literal:
With invariant characters, it is also possible to efficiently convert char * strings to and from UChar * strings:
Testing for well-formed UTF-16 strings
It is sometimes useful to test if a 16-bit Unicode string is well-formed UTF-16, that is, that it does not contain unpaired surrogate code units. For a boolean test, call a function like u_strToUTF8() which sets an error code if the input string is malformed. (Provide a zero-capacity destination buffer and treat the buffer overflow error as "is well-formed".) If you need to know the position of the unpaired surrogate, you can iterate through the string with U16_NEXT() and U_IS_SURROGATE().
UnicodeString is a C++ string class that wraps a UChar array and associated bookkeeping. It provides a rich set of string handling functions.
UnicodeString combines elements of both the Java String and StringBuffer classes. Many UnicodeString functions are named and work similar to Java String methods but modify the object (UnicodeString is "mutable").
UnicodeString provides functions for random access and use (insert/append/find etc.) of both code units and code points. For each non-iterative string/code point macro in utf.h there is at least one UnicodeString member function. The names of most of these functions contain "32" to indicate the use of a UChar32.
Code point and code unit iteration is provided by the CharacterIterator abstract class and its subclasses. There are concrete iterator implementations for UnicodeString objects and plain UChar  arrays.
Most UnicodeString constructors and functions do not have a UErrorCode parameter. Instead, if the construction of a UnicodeString fails, for example when it is constructed from a NULL UChar * pointer, then the UnicodeString object becomes "bogus". This can be tested with the isBogus() function. A UnicodeString can be put into the "bogus" state explicitly with the setToBogus() function. This is different from an empty string (although a "bogus" string also returns TRUE from isEmpty()) and may be used equivalently to NULL in UChar * C APIs (or null references in Java, or NULL values in SQL). A string remains "bogus" until a non-bogus string value is assigned to it. For complete details of the behavior of "bogus" strings see the description of the setToBogus() function.
Some APIs work with the Replaceable abstract class. It defines a simple interface for random access and text modification and is useful for operations on text that may have associated meta-data (e.g., styled text), especially in the Transliterator API. UnicodeString implements Replaceable.
Like in C, there are macros that enable users to instantiate a UnicodeString from a C string literal. One macro requires the length of the string as in the C macros, the other one implies a strlen().
It is possible to efficiently convert between invariant-character strings and UnicodeStrings by using constructor, setTo() or extract() overloads that take codepage data (const char *) and specifying an empty string ("") as the codepage name.
The internal buffer of UnicodeString objects is available for direct handling in C (or C-style) APIs that take UChar * arguments. It is possible but usually not necessary to copy the string contents with one of the extract functions. The following describes several direct buffer access methods.
The UnicodeString function getBuffer() const returns a readonly const UChar *. The length of the string is indicated by UnicodeString's length() function. Generally, UnicodeString does not NUL-terminate the contents of its internal buffer. However, it is possible to check for a NUL character if the length of the string is less than the capacity of the buffer. The following code is an example of how to check the capacity of the buffer: (s.length()<s.getCapacity() && buffer[s.length()]==0)
An easier way to NUL-terminate the buffer and get a const UChar * pointer to it is the getTerminatedBuffer() function. Unlike getBuffer() const, getTerminatedBuffer() is not a const function because it may have to (reallocate and) modify the buffer to append a terminating NUL. Therefore, use getBuffer() const if you do not need a NUL-terminated buffer.
There is also a pair of functions that allow controlled write access to the buffer of a UnicodeString: UChar *getBuffer(int32_t minCapacity) and releaseBuffer(int32_t newLength). UChar *getBuffer(int32_t minCapacity) provides a writeable buffer of at least the requested capacity and returns a pointer to it. The actual capacity of the buffer after the getBuffer(minCapacity) call may be larger than the requested capacity and can be determined with getCapacity().
Once the buffer contents are modified, the buffer must be released with the releaseBuffer(int32_t newLength) function, which sets the new length of the UnicodeString (newLength=-1 can be passed to determine the length of NUL-terminated contents like u_strlen()).
Between the getBuffer(minCapacity) and releaseBuffer(newLength) function calls, the contents of the UnicodeString is unknown and the object behaves like it contains an empty string. A nested getBuffer(minCapacity), getBuffer() const or getTerminatedBuffer() will fail (return NULL) and modifications of the string via UnicodeString member functions will have no effect.
See the UnicodeString API documentation for more information.
There are efficient ways to wrap C-style strings in C++ UnicodeString objects without copying the string contents. In order to use C strings in C++ APIs, the UChar * pointer and length need to be wrapped into a UnicodeString. This can be done efficiently in two ways: With a readonly alias and a writable alias. The UnicodeString object that is constructed actually uses the UChar * pointer as its internal buffer pointer instead of allocating a new buffer and copying the string contents.
If the original string is a readonly const UChar *, then the UnicodeString must be constructed with a read only alias. If the original string is a writable (non-const) UChar * and is to be modified (e.g., if the UChar * buffer is an output buffer) then the UnicodeString should be constructed with a writeable alias. For more details see the section "Maximizing Performance with the UnicodeString Storage Model" and search the unistr.h header file for "alias".
UnicodeString uses four storage methods to maximize performance and minimize memory consumption:
In general, UnicodeString objects have "copy-on-write" semantics. Several objects may share the same string buffer, but a modification only affects the object that is modified itself. This is achieved by copying the string contents if it is not owned exclusively by this one object. Only after that is the object modified.
Even though it is fairly efficient to copy UnicodeString objects, it is even more efficient, if possible, to work with references or pointers. Functions that output strings can be faster by appending their results to a UnicodeString that is passed in by reference, compared with returning a UnicodeString object or just setting the local results alone into a string reference.
As mentioned in the overview of this chapter, ICU and most other Unicode-supporting software uses 16-bit Unicode for internal processing. However, there are circumstances where UTF-8 is used instead. This is usually the case for software that does little or no processing of non-ASCII characters, and/or for APIs that predate Unicode, use byte-based strings, and cannot be changed or replaced for various reasons.
A common perception is that UTF-8 has an advantage because it was designed for compatibility with byte-based, ASCII-based systems, although it was designed for string storage (of Unicode characters in Unix file names) rather than for processing performance.
While ICU mostly does not natively use UTF-8 strings, there are many ways to work with UTF-8 strings and ICU. For more information see the newer UTF-8 subpage.
It is even rarer to use UTF-32 for string processing than UTF-8. While 32-bit Unicode is convenient because it is the only fixed-width UTF, there are few or no legacy systems with 32-bit string processing that would benefit from a compatible format, and the memory bandwidth requirements of UTF-32 diminish the performance and handling advantage of the fixed-width format.
Over time, the wchar_t type of some C/C++ compilers became a 32-bit integer, and some C libraries do use it for Unicode processing. However, application software with good Unicode support tends to have little use for the rudimentary Unicode and Internationalization support of the standard C/C++ libraries and often uses custom types (like ICU's) and UTF-16 or UTF-8.
For those systems where 32-bit Unicode strings are used, ICU offers some convenience functions.
Beginning with ICU release 2.0, there are a few changes to the ICU string facilities compared with earlier ICU releases.
Some of the NUL-termination behavior was inconsistent across the ICU API functions. In particular, the following functions used to count the terminating NUL character in their output length (counted one more before ICU 2.0 than now): ucnv_toUChars, ucnv_fromUChars, uloc_getLanguage, uloc_getCountry, uloc_getVariant, uloc_getName, uloc_getDisplayLanguage, uloc_getDisplayCountry, uloc_getDisplayVariant, uloc_getDisplayName
Some functions used to set an overflow error code even when only the terminating NUL did not fit into the output buffer. These functions now set UErrorCode to U_STRING_NOT_TERMINATED_WARNING rather than to U_BUFFER_OVERFLOW_ERROR.
The aliasing UnicodeString constructors and most extract functions have existed for several releases prior to ICU 2.0. There is now an additional extract function with a UErrorCode parameter. Also, the getBuffer, releaseBuffer and getCapacity functions are new to ICU 2.0.
For more information about these changes, please consult the old and new API documentation.