Migration from Standard C and POSIX APIs
The ISO C
and POSIX standards define a number of APIs for string handling and
internationalization in C. They do not support Unicode well because
they were initially designed before Unicode/ISO 10646 were developed,
and the POSIX APIs are also problematic for other internationalization
aspects.
This chapter discusses C/POSIX APIs with their problems, and shows which ICU APIs to use instead.
 | We
use the term "POSIX" to mean the POSIX.1 standard (IEEE Std 1003.1)
which defines system interfaces and headers with relevance for string
handling and internationalization. The XPG3, XPG4, Single Unix
Specification (SUS) and other standards include POSIX.1 as a subset,
adding other specifications that are irrelevant for this topic. |
This chapter is not complete yet – more POSIX APIs are expected to be discussed in the future.
Strings and Characters
Character Sets and Encodings
ISO C
The ISO C standard provides two basic character types (char and wchar_t)
and defines strings as arrays of units of these types. The standard
allows nearly arbitrary character and string character sets and
encodings, which was necessary when there was no single character set
that worked everywhere.
For portable C programs, characters and
strings are opaque, i.e., a program cannot assume that any particular
character is represented by any particular code or sequence of codes.
Programs use standard library functions to handle characters and
strings. Only a small set of characters — usually the set of graphic
characters available in US-ASCII — can be reliably accessed via
character and string literals.
Problems
Many
different encodings are used on each platform, making it difficult for
multiple programs and libraries to process the same text.
Programs often need to know the codes of special characters. For
example, code that parses a filename needs to know how the path and
file separators are encoded; this is commonly possible because
filenames deliberately use US-ASCII characters, but any software that
uses non-ASCII characters becomes platform-dependent. It is practically
impossible to provide sophisticated text processing without knowledge
of the character set, its string encoding, and other detailed features.
The C/POSIX standards only provide a very limited set of useful
functions for character and string handling; many functions that are
provided do not work for non-trivial cases.
While the size of the char type is in practice fixed to 8 bits in modern compilers, and its common encodings are reasonably well documented, the size of wchar_t varies between 8/16/32 bits depending on the compiler, and only few of the string encodings used with it are documented.
See also What size wchar_t do I need for Unicode?
.
A program based on this model must be recompiled for each
platform. Usually, it must be recompiled for each supported language or
family of languages.
The ISO C standard basically requires, by how its standard
functions are defined, that the data type for a single character code
in a large character set is the same as the string base unit type (wchar_t).
This has led to C standard library implementations using Unicode
encodings which are either limited for single-character functions to
only part of Unicode, or suffer from reduced interoperability with most
Unicode-aware software.
ICU
ICU always processes Unicode text. Unicode
covers all languages and allows safe hard coding of character codes, in
addition to providing many standard or recommended algorithms and a lot
of useful character property data. See the chapters about Unicode Basics
and Strings
and others.
ICU
uses the 16-bit encoding form of Unicode (UTF-16) for processing,
making it fully interoperable with most Unicode-aware software. (See UTF-16 for Processing
.) In the case of ICU4J, this is naturally the case because the Java language and the JDK use UTF-16.
ICU uses and/or provides direct access to all of the Unicode properties
which provide a much finer-grained classification of characters than C/POSIX character classes
.
In
C/C++ source code character and string literals, ICU uses only
"invariant" characters. They are the subset of graphic ASCII characters
that are almost always encoded with the same byte values on all
systems. (One set of byte values for ASCII-based systems, and another
such set of byte values for EBCDIC systems.) See utypes.h
for the set of "invariant" characters.
With
the use of Unicode, the implementation of many of the Unicode standard
algorithms, and its cross-platform availability, ICU provides for
consistent, portable, and reliable text processing.
Case Mappings
ISO C
The standard C functions tolower(), towupper(), etc. take and return one character code each.
Problems
This does not work for German, where the character "ß" (sharp s) uppercases to the two characters "SS". (It "expands".)
It does not work for Greek, where the character "Σ" (capital
sigma) lowercases to either "ς" (small final sigma) or "σ" (small
sigma) depending on whether the capital sigma is the last letter in a
word. (It is context-dependent.)
It does not work for Lithuanian and Turkic languages where a
"combining dot above" character may need to be removed in certain
cases. (It "contracts" and is language- and context-dependent.)
There are a number of other such cases.
There are no standard functions for title-casing strings.
There are no standard functions for case-folding strings.
(Case-folding is used for case-insensitive comparisons; there are
C/POSIX functions for direct, case-insensitive comparisons of pairs of
strings. Case-folding is useful when one string is compared to many
others, or as part of a chain of transformations of a string.)
ICU
Case mappings are operations taking and
returning strings, to support length changes and context dependencies.
Unicode provides algorithms and data for proper case mappings, and ICU
provides APIs for them. (See the API references for various string
functions and for Transforms/Transliteration.)
Character Classes
ISO C
The standard C functions isalpha(), isdigit(),
etc. take a character code each and return boolean values for whether
the character belongs to the current locale's respective character
class.
Problems
Character classes are bound to locales, instead of providing consistent classifications for characters.
The same character may have different classifications depending on the locale and the platform.
There are only very few POSIX character classes, and they are
not well defined. For example, there is a class for punctuation
characters but not one for symbols.
For example, the dollar symbol (“$”) may or may not belong to the punct class depending on the locale, even on the same system.
The standard allows at most two sets of decimal digits: The
digits of the “portable character set” (i.e., those in the ASCII
repertoire) and one more. Some implementations only recognize ASCII
digits in the isdigit() function. However, there are many sets of decimal digits in a multilingual character set like Unicode.
The POSIX standard assumes that each locale definition file
carries the character class data for all relevant characters. With many
locales using overlapping character repertoires, this can lead to a lot
of duplication. For efficiency, many UTF-8 locales define character
classes only for very few characters instead of for all of Unicode. For
example, some de_DE.utf-8 locales only define character classes for
characters used in German, or for the repertoire of ISO 8859-1 – in
other words, for only a tiny fraction of the representable Unicode
repertoire. Processing of text using more than this repertoire is not
possible with such an implementation.
For more about the problems with POSIX character classes in a Unicode context see Annex C: Compatibility Properties in Unicode Technical Standard #18: Unicode Regular Expressions
and see the mailing list archives for the unicode list (on unicode.org). See also the ICU design document about C/POSIX character classes
.
ICU
ICU provides locale-independent access to all Unicode properties
(except Unihan.txt properties), as well as to the POSIX character classes, via functions defined in uchar.h and in ICU4J's UCharacter class (see API references) as well as via UnicodeSet. The POSIX character classes are implemented according to the recommendations in UTS #18.
The
Unicode Character Database defines more than 70 character properties,
their values are designed for the large character set as well as for
real text processing, and they are updated with each version of
Unicode. The UCD is available online, facilitating industry-wide
consistency in the implementation of Unicode properties.
Formatting and Parsing
Currency Formatting
POSIX
The strfmon()
function is used to format monetary values. The default format and the
currency display symbol or display name are selected by the LC_MONETARY locale ID. The number formatting can also be controlled with a formatting string resembling what printf() uses.
Problems
Selection
of the currency via a locale ID is unreliable: Countries change
currencies over time, and the locale data for a particular country may
not be available. This results in using the wrong currency. For
example, an application may assume that a country has switched from a
previous currency to the Euro, but it may run on an OS that predates
the switch.
Using a single locale ID for the whole format makes it very
difficult to format values for multiple currencies with the same number
format (for example, for an exchange rate list or for showing the price
of an item adjusted for several currencies). strfmon() allows to specify the number format fully, but then the application cannot use a country's default number format.
The set of formattable currencies is limited to those that are available via locale IDs on a particular system.
There does not appear to be a function to parse currency values.
ICU
ICU number formatting APIs have separate,
orthogonal settings for the number format, which can be selected with a
locale ID, and the currency, which is specified with an ISO code. See
the Formatting Numbers
chapter for details.