Conversion

Conversion Overview

A converter is used to convert from one character encoding to another. In the case of ICU, the conversion is always between Unicode and another encoding, or vice-versa. A text encoding is a particular mapping from a given character set definition to the actual bits used to represent the data.

Unicode provides a single character set that covers the major languages of the world, and a small number of machine-friendly encoding forms and schemes to fit the needs of existing applications and protocols. It is designed for best interoperability with both ASCII and ISO-8859-1 (the most widely used character sets) to make it easier for Unicode to be used in almost all applications and protocols.

Hundreds of encodings have been developed over the years, each for small groups of languages and for special purposes. As a result, the interpretation of text, input, sorting, display, and storage depends on the knowledge of all the different types of character sets and their encodings. Programs have been written to handle either one single encoding at a time and switch between them, or to convert between external and internal encodings.

There is no single, authoritative source of precise definitions of many of the encodings and their names. However, IANA is the best source for names, and our Character Set repository is a good source of encoding definitions for each platform.

The transferring of text from one machine to another one often causes some loss of information. Some platforms have a different interpretation of the text than the other platforms. For example, Shift-JIS can be interpreted differently on Windows™ compared to UNIX®. Windows maps byte value 0x5C to the backslash symbol, while some UNIX machines map that byte value to the Yen symbol. Another problem arises when a character in the codepage looks like the Unicode Greek letter Mu or the Unicode micro symbol. Some platforms map this codepage byte sequence to one Unicode character, while another platform maps it to the other Unicode character. Fallbacks can partially fix this problem by mapping both Unicode characters to the same codepage byte sequence. Even though some character information is lost, the text is still readable.

ICU's converter API has the following main features:

  1. Unicode surrogate support

  2. Support for all major encodings

  3. Consistent text conversion across all computer platforms

  4. Text data can be streamed (buffered) through the API

  5. Fast text conversion

  6. Supports fallbacks to the codepage

  7. Supports reverse fallbacks to Unicode

  8. Allows callbacks for handling and substituting invalid or unmapped byte sequences

  9. Allows a user to add support for unsupported encodings

This section deals with the processes of converting encodings to and from Unicode.

Recommendations

  1. Use Unicode encodings whenever possible. Together with Unicode for internal processing, it makes completely globalized systems possible and avoids the many problems with non-algorithmic conversions. (For a discussion of such problems, see for example "Character Conversions and Mapping Tables" on http://icu-project.org/docs/ and the XML Japanese Profile .)

    1. Use UTF-8 and UTF-16.

    2. Use UTF-16BE, SCSU and BOCU-1 as appropriate.

    3. In special environments, other Unicode encodings may be used as well, such as UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, UTF-EBCDIC, and CESU-8. (For turning Unicode filenames into ASCII-only filename strings, the IMAP-mailbox-name encoding can be used.)

    4. Do not exchange text with single/unpaired surrogates.

  2. Use legacy charsets only when absolutely necessary. For best data fidelity:

    1. ISO-8859-1 is relatively unproblematic — if its limited character repertoire is sufficient — because it is converted trivially (1:1) to Unicode, avoiding conversion table problems for its small set of characters. (By contrast, proper conversion from US-ASCII requires a check for illegal byte values 0x80..0xff, which is an unnecessary complication for modern systems with 8-bit bytes. ISO-8859-1 is nearly as ubiquitous for modern systems as US-ASCII was for 7-bit systems.)

    2. If you need to communicate with a certain platform, then use the same conversion tables as that platform itself, or at least ones that are very, very close.

    3. ICU's conversion table repository contains hundreds of Unicode conversion tables from a number of common vendors and platforms as well as comparisons between these conversion tables: http://icu-project.org/charts/charset/ .

    4. Do not trust codepage documentation that is not machine-readable, for example nice-looking charts: They are usually incomplete and out of date.

    5. ICU's default build includes about 200 conversion tables. See the ICU Data chapter for how to add or remove conversion tables and other data.

    6. In ICU, you can (and should) also use APIs that map a charset name together with a standard/platform name. This allows you to get different converters for the same ambiguous charset name (like "Shift-JIS"), depending on the standard or platform specified. See the convrtrs.txt alias table, the Using Converters chapter and API references .

    7. For data exchange (rather than pure display), turn off fallback mappings: ucnv_setFallback(cnv, FALSE);

    8. For some text formats, especially XML and HTML, it is possible to set an "escape callback" function that turns unmappable Unicode code points into corresponding escape sequences, preventing data loss. See the API references and the ucnv sample code .

    9. Never modify a conversion table. Instead, use existing ones that match precisely those in systems with which you communicate. "Modifying" a conversion table in reality just creates a new one, which makes the whole situation even less manageable.