Conversion Overview
A converter is used to convert from one character encoding to
another. In the case of ICU, the conversion is always between Unicode
and another encoding, or vice-versa. A text encoding is a particular
mapping from a given character set definition to the actual bits used
to represent the data.
Unicode provides a single character set
that covers the major languages of the world, and a small number of
machine-friendly encoding forms and schemes to fit the needs of
existing applications and protocols. It is designed for best
interoperability with both ASCII and ISO-8859-1 (the most widely used
character sets) to make it easier for Unicode to be used in almost all
applications and protocols.
Hundreds of encodings have been
developed over the years, each for small groups of languages and for
special purposes. As a result, the interpretation of text, input,
sorting, display, and storage depends on the knowledge of all the
different types of character sets and their encodings. Programs have
been written to handle either one single encoding at a time and switch
between them, or to convert between external and internal encodings.
There is no single, authoritative source of precise definitions of many of the encodings and their names. However, IANA
is the best source for names, and our Character Set repository is a good source of encoding definitions for each platform.
The
transferring of text from one machine to another one often causes some
loss of information. Some platforms have a different interpretation of
the text than the other platforms. For example, Shift-JIS can be
interpreted differently on Windows™ compared to UNIX®. Windows maps
byte value 0x5C to the backslash symbol, while some UNIX machines map
that byte value to the Yen symbol. Another problem arises when a
character in the codepage looks like the Unicode Greek letter Mu or the
Unicode micro symbol. Some platforms map this codepage byte sequence to
one Unicode character, while another platform maps it to the other
Unicode character. Fallbacks can partially fix this problem by mapping
both Unicode characters to the same codepage byte sequence. Even though
some character information is lost, the text is still readable.
ICU's converter API has the following main features:
Unicode surrogate support
Support for all major encodings
Consistent text conversion across all computer platforms
Text data can be streamed (buffered) through the API
Fast text conversion
Supports fallbacks to the codepage
Supports reverse fallbacks to Unicode
Allows callbacks for handling and substituting invalid or unmapped byte sequences
Allows a user to add support for unsupported encodings
This section deals with the processes of converting encodings to and from Unicode.
Recommendations
Use Unicode encodings whenever possible.
Together with Unicode for internal processing, it makes completely
globalized systems possible and avoids the many problems with
non-algorithmic conversions. (For a discussion of such problems, see
for example "Character Conversions and Mapping Tables"
on http://icu-project.org/docs/
and the XML Japanese Profile
.)
Use UTF-8 and UTF-16.
Use UTF-16BE, SCSU and BOCU-1 as appropriate.
In special environments, other Unicode encodings may be used as
well, such as UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, UTF-EBCDIC,
and CESU-8. (For turning Unicode filenames into ASCII-only filename
strings, the IMAP-mailbox-name encoding can be used.)
Do not exchange text with single/unpaired surrogates.
Use legacy charsets only when absolutely necessary. For best data fidelity:
ISO-8859-1
is relatively unproblematic — if its limited character repertoire is
sufficient — because it is converted trivially (1:1) to Unicode,
avoiding conversion table problems for its small set of characters. (By
contrast, proper conversion from US-ASCII requires a check for illegal
byte values 0x80..0xff, which is an unnecessary complication for modern
systems with 8-bit bytes. ISO-8859-1 is nearly as ubiquitous for modern
systems as US-ASCII was for 7-bit systems.)
If you need to communicate with a certain platform, then use the
same conversion tables as that platform itself, or at least ones that
are very, very close.
ICU's conversion table repository contains hundreds of Unicode
conversion tables from a number of common vendors and platforms as well
as comparisons between these conversion tables: http://icu-project.org/charts/charset/
.
Do not trust codepage documentation that is not
machine-readable, for example nice-looking charts: They are usually
incomplete and out of date.
ICU's default build includes about 200 conversion tables. See the ICU Data
chapter for how to add or remove conversion tables and other data.
In ICU, you can (and should) also use APIs that map a charset
name together with a standard/platform name. This allows you to get
different converters for the same ambiguous charset name (like
"Shift-JIS"), depending on the standard or platform specified. See the convrtrs.txt
alias table, the Using Converters
chapter and API references
.
For data exchange (rather than pure display), turn off fallback mappings: ucnv_setFallback(cnv, FALSE);
For some text formats, especially XML and HTML, it is possible
to set an "escape callback" function that turns unmappable Unicode code
points into corresponding escape sequences, preventing data loss. See
the API references and the ucnv sample code
.
Never modify a conversion table. Instead, use existing
ones that match precisely those in systems with which you communicate.
"Modifying" a conversion table in reality just creates a new one, which
makes the whole situation even less manageable.