In a comprehensive conversion library, there are three kinds of codepage converter implementations: converters that use algorithms, mapping data, or those converters that use both.
ICU provides converter implementations for all three groups of codepages. Since ICU always converts, to or from Unicode, the purely algorithmic converters are the ones for Unicode encodings (such as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE, SCSU, BOCU-1 and UTF-7). Since Unicode is based on US-ASCII and ISO-8859-1 ("ISO Latin-1"), these encodings also use algorithmic converters for performance reasons.
Most other codepages use simple byte sequences but are not encodings of Unicode. They are converted with generic code using mapping data tables. ICU also supports a few encodings, like ISO-2022 and its variants, that employ an algorithmic structure to switch between a set of codepages. The converters for these encodings are algorithmic but use mapping tables for the embedded codepages.
Character encodings are either stateful or stateless:
This distinction between stateless and stateful encodings is important, because it determines if any available ICU converter implementation is used. The following are some more important considerations related to stateless versus stateful encodings:
The following sections in this chapter discuss the mapping data tables that are used in ICU. For related material, please see:
As stated above, most ICU converters rely on character mapping tables. ICU 1.8 has one single data structure for all character mapping tables, which is used by a generic Multi-Byte Character Set (MBCS) converter implementation. The implementation is flexible enough to handle stateless encodings with the following parameters:
Prior to version 1.8, ICU used more specific, more limited, converter implementations for Single Byte Character Set (SBCS), Double Byte Character Set (DBCS), and the stateful Extended Binary Coded Decimal Interchange Code (EBCDIC) codepages. Mapping table data is provided in text files. ICU comes with several dozen .ucm files (UniCode Mapping, in icu/source/data/mappings/) that are translated at build time by its makeconv tool (source code in icu/source/tools/makeconv). The makeconv tool writes one binary, memory-mappable .cnv file per .ucm file. The resulting .cnv files are included by default in the common data file for use at runtime.
The format of the .ucm files is similar to the format of the UPMAP files as provided by IBM® in the codepage repository and as used in the uconvdef tool on AIX. UPMAP is a text file that specifies the mapping of a codepage character to and from Unicode.
The format of the .cnv files is ICU-specific. The .cnv file format may change between ICU versions even for the same .ucm files. The .ucm file format may be extended to include more features.
The following sections concentrate on the .ucm file format. The .cnv file format is described in the source code in the icu/source/common/ucnvmbcs.c directory and is updated using the MBCS converter implementation.
These conversion tables can have more than one name. ICU allows multiple names ("aliases") for the same encoding. It matches a requested encoding name against a list of names in icu/source/data/mappings/convrtrs.txt and when it finds a match, ICU opens a converter with the name in the leftmost position in the matching line. The name matching is not case-sensitive and ICU ignores spaces, dashes, and underscores. At build time, the gencnval tool located in the icu/source/tools/gencnval directory, generates a binary form of the convrtrs.txt file as a data file for runtime for the cnvalias.icu file ("Converter Aliases data file").
.ucm files are line-oriented text files. Empty lines and comments starting with '#' are ignored.
A .ucm file contains two sections:
The header fields are:
The following shows the exact implied state tables for non-MBCS types. A state table may need to be overwritten in order to allow supplementary characters (U+10000 and up).
The subchar and subchar1 fields have been known to cause some confusion. The following conditions outline when each are used:
In the CHARMAP section of a .ucm file, each line contains a Unicode code point (like <U(1-6 hexadecimal digits for the code point)> ), a codepage character byte sequence (each byte like \xhh (2 hexadecimal digits} ), and an optional "precision" or "fallback" indicator.
The precision indicator either must be present in all mappings or in none of them. The indicator is a pipe symbol ‘|’ followed by a 0, 1, 2, 3, or 4 that has the following meaning:
Fallback mappings from Unicode typically do not map codes for the same character, but for "similar" ones. This mapping is sometimes done if a character exists in Unicode but not in the codepage. To replace it, ICU maps a codepage code to a similar-looking code for human-readable output. This mapping feature is not useful for text data transmission especially in markup languages where a Unicode code point can be escaped with its code point value. The ICU application programming interface (API) ucnv_setFallback() controls this fallback behavior.
"Reverse fallbacks" are technically similar, but the same Unicode character can be encoded twice in the codepage. ICU always uses reverse fallbacks at runtime.
A subset of the fallback mappings from Unicode is always used at runtime: Those that map private-use Unicode code points. Fallbacks from private-use code points are often introduced as replacements for previous roundtrip mappings for the same pair of codes. These replacements are used when a Unicode version assigns a new character that was previously mapped to that private-use code point. The mapping table is then changed to map the same codepage byte sequence to the new Unicode code point (as a new roundtrip) and the mapping from the old private-use code point to the same codepage code is preserved as a fallback.
A "good one-way" mapping is like a fallback, but ICU always uses "good one-way" mappings at runtime, regardless of the fallback API flag.
The idea is that fallbacks normally lose information, such as mapping from a compatibility variant of a letter to the ASCII version; however, fallbacks from PUA and reverse fallbacks are assumed to be for "the same character", just an older code for it.
Something similar happens with from-Unicode Variation Selector sequences. It is possible to round-trip (|0) either the unadorned character or the sequence with a variation selector, and add a "good one-way" mapping (|4) from the other version. That "good one-way" mapping does not lose much information, and it is used even if the "use fallback" API flag is false. Alternatively, both mappings could be fallbacks (|1) that should be controlled by the "use fallback" attribute.
The conversion to Unicode uses a state machine to achieve the above capabilities with reasonable data file sizes. The state machine information itself is loaded with the conversion data and defines the structure of the codepage, including which byte sequences are valid, unassigned, and illegal. This data cannot (or not easily) be computed from the pure mapping data. Instead, the .ucm files for MBCS encodings have additional entries that are specific to the ICU makeconv tool. The state tables for SBCS, DBCS, and EBCDIC_STATEFUL are implied, but they can be overridden (see the examples below). These state tables are specified in the header section of the .ucm file that contains the <icu:state> element. Each line defines one aspect of the state machine. The state machine uses a table of as many rows as there are states (= as many as there are <icu:state> lines). Each row has 256 entries; one for each possible byte value.
The state table lines in the .ucm header conform to the following Extended Backus-Naur Form (EBNF)-like grammar (whitespace is allowed between all tokens):
Each state table row description (that follows the <icu:state>) begins with an optional initial or surrogates keyword and is followed by one or more column entries. For the purpose of codepage state tables, the states=rows in the table are numbered beginning at 0 for the first line in the .ucm file header. The numbers are assigned implicitly by the makeconv tool in order of the <icu:state> lines.
A row may be empty (nothing following the <icu:state>) — that is equivalent to "all illegal" or 0-ff.i and is useful for trail byte states for all-illegal byte sequences.
Each column entry contains at least one hexadecimal byte value or value range and is separated by a comma. The column entry specifies how to interpret an input byte in the row's state. If neither a next state nor an action is explicitly specified (only the byte range is given) then the byte value terminates the byte sequence, results in a valid mapping to a Unicode BMP character, and resets the state number to 0. The first line with <icu:state> is called state 0.
The next state can be explicitly specified with a separating colon ( : ) followed by the number of the state (=number/index of the row, starting at 0). This specification is mostly used for intermediate byte values (such as bytes that are not the last ones in a sequence). The state machine needs to proceed to the next state and read another byte. In this case, no other action is specified.
If the byte value(s) terminate(s) a byte sequence, then the byte sequence results in the following depending on the action that is announced with a period ( . ) followed by a letter:
If an action is specified without a next state, then the next state number defaults to 0. In other words, a byte value (range) terminates a sequence if there is an action specified for it, or when there is neither an action nor a next state. In this case, the byte value defaults to "valid, next state is 0" (equivalent to :0.).
If a byte value is not specified in any column entry row, then it is illegal in the current state. If a byte value is specified in more than one column entry of the same row, then ICU uses the last state. These specifications allow you to assign common properties for a wide byte value range followed by a few exceptions. This is easier than having to specify mutually exclusive ranges, especially if many of them have the same properties.
The optional keyword at the beginning of a state line has the following effect:
When converting to Unicode, the state machine starts in state number 0. In each iteration, the state machine reads one input (codepage) byte and either proceeds to the next state as specified, or treats it as a final byte with the specified action and an optional non-0 next (initial) state. This means that a state table needs to have at least as many state rows as the maximum number of bytes per character, which is the maximum length of any byte sequence.
Exception: For EBCDIC_STATEFUL codepages, double-byte sequences start in state 1, with the SI/SO bytes switching from state 0 to state 1 or from state 1 to state 0. See the default state table below.
ICU 2.8 adds an additional "extension" data structure to its conversion tables. The new data structure supports a number of new features. When any of the following features are used, then all mappings must use a precision indicator.
Before ICU 2.8, only one Unicode code point could be converted to or from one complete codepage byte sequence. The new data structure supports the conversion between multiple Unicode code points and multiple complete codepage byte sequences. (A "complete codepage byte sequence" is a sequence of bytes which is valid according to the state table.)
Simply write more than one Unicode code point on a mapping line, and/or
more than one complete codepage byte sequence. Plus signs (+) are
optional between code points and between bytes. For example,
<U304B><U309A> \xEC\xB5 |0
and test3.ucm contains
<U101234>+<U50005>+<U60006> \x07+\x00+\x01\x02\x0f+\x09 |0
For more examples see the ICU conversion data and the icu/source/test/testdata/test*.ucm test data files.
ICU 2.8 supports up to 19 UChars on the Unicode side of a mapping and up to 31 bytes on the codepage side.
The longest match possible is converted in order to properly handle tables where the source sides of some mappings are prefixes of the source sides of other mappings.
As a side effect, if conversion offsets are written and a potential match crosses buffer boundaries, then some of the initial offsets for the following output may be unknown (-1) because their input was stored in the converter from a previous buffer while looking for a longer match.
Conversion tables for SI/SO-stateful (usually EBCDIC_STATEFUL) codepages cannot include mappings with SI or SO bytes or where there are SBCS characters in a multi-character byte sequence. In other words, for these tables there must be exactly one byte in a mapping or else a sequence of one or more DBCS characters.
Physically, a binary conversion table (.cnv) file automatically contains both a traditional "base table" data structure for the 1:1 mappings and a new "extension table" for the m:n mappings if any are encountered in the .ucm file. An extension table can also be requested manually by splitting the CHARMAP into two. The first CHARMAP section will be used for the base table, and the second only for the extension table. M:n mappings in the first CHARMAP will be moved to the extension table.
In order to save space for very similar conversion tables, it is possible to create delta .cnv files that contain only an extension table and the name of another .cnv file with a base table. The base file must be split into two CHARMAPs such that the base file's base table does not contain any mappings that contradict any of the delta file's mappings.
The delta (extension-only) file uses only a single CHARMAP section. In addition, it nees a line in the header that both causes building just a delta file and specifies the name of the base file. For example, windows-936-2000.ucm contains
makeconv ignores all mappings for the delta file that are also in the base file's base table. If the two conversion tables are sufficiently similar, then the delta file will contain only a relatively small set of mappings, which results in a small .cnv file. At runtime, both the delta file and its base file are loaded, and the base file's base table is used together with the extension file. The base file works as a standalone file, using its own extension table for its full set of mappings. The base file must be in the same ICU data package as the delta file.
The hard part is to split the base file's mappings into base and extension CHARMAPs such that the base table does not overlap with any delta file, while all shared mappings should be in the base table. (The base table data structure is more compact than the extension table data structure.)
ICU provides the ucmkbase tool in the ucmtools collection to do this.
For example, the following illustrates how to use ucmkbase to make a base .ucm file for three Shift-JIS conversion table variants. (ibm-943_P15A-2003.ucm becomes the base.)
After this, the two delta .ucm files only need to get the following line added before the start of their CHARMAPs:
The ICU tools and runtime code handle DBCS-only conversion tables specially, allowing them to be built into delta files with MBCS or EBCDIC_STATEFUL base files without using their single-byte mappings, and without ucmkbase moving the single-byte mappings of the base file into the base file's extension table. See for example ibm-16684_P110-2003.ucm and ibm-1390_P110-2003.ucm.
ICU 2.8 adds support for the specification of which unassigned Unicode code points should be mapped to subchar1 rather than the default subchar. See the discussion of subchar1 above for more details.
The extension table data structure also removes one minor limitation on ICU conversion tables: Fallback mappings to a single byte 00 are now allowed and handled properly. ICU versions before 2.8 could only handle roundtrips to/from 00.
The following shows the exact implied state tables for non-MBCS types, A state table may need to be overwritten in order to allow supplementary characters (U+10000 and up).
This single-row state table describes US-ASCII. Byte values from 0 to 0x7f are valid and map to Unicode characters up to U+ffff. Byte values from 0x80 to 0xff are illegal.
This two-row state table describes the Shift-JIS structure which encodes some characters with one byte each and others with two bytes each. Bytes 0 to 0x7f and 0xa0 to 0xdf are valid single-byte encodings. Bytes 0x81 to 0x9f and 0xe0 to 0xfc are lead bytes. (For example, they are followed by one of the bytes that is specified as valid in state 1). A byte sequence of 0x85 0x61 is valid while a single byte of 0x80 or 0xff is illegal. Similarly, a byte sequence of 0x85 0x31 is illegal.
This fairly complicated state table describes EUC-JP. Valid byte sequences are one, two, or three bytes long. Two-byte sequences have a lead byte of 0x8e and end in state 2, or have lead bytes 0xa1 to 0xfe and end in state 1. Three-byte sequences have a lead byte of 0x8f and continue in state 3. Some final byte value ranges are entirely unassigned, therefore they end in state 4 with an action letter of u for "unassigned" to save significant memory for the code units table. Assigned three-byte sequences end in state 1 like most two-byte sequences.
SBCS default state table:
SBCS by default implies the structure for single-byte, 8-bit codepages.
DBCS default state table:
These are four states — the fourth has an empty line (equivalent to 0-ff.i)! DBCS codepages, by default, are defined with the EBCDIC double-byte structure. Valid sequences are pairs of bytes from 0x41 to 0xfe and the one pair 0x40/0x40 for the double-byte space. The structure is defined such that all illegal byte sequences are always two in length. Therefore, every byte in the initial state is a lead byte.
EBCDIC_STATEFUL default state table:
This is the structure of Mixed Single-byte and Double-byte EBCDIC codepages, which are stateful and use the Shift-In/Shift-Out (SI/SO) bytes 0x0f/0x0e. The initial state 0 is almost the same as for SBCS except for SI and SO. State 1 is also an initial state and is the basis for a state-shifted version of the DBCS structure above. All double-byte sequences return to state 1 and SI switches back to state 0. SI and SO are also allowed in their own states with no effect.
An initial state never needs a surrogates designation or .p because Unicode mapping results in initial states that are stored directly in the state table, providing enough room in each cell. The size of a generated .cnv mapping table file depends primarily on the number and distribution of the mappings and on the number of valid, multi-byte sequences that the state table allows. Each state table row takes up one kilobyte.
For single-byte codepages, the state table cells contain all two-Unicode mappings. Code point results for multi-byte sequences are stored in an array with enough room for all valid byte sequences. For all byte sequences that end in a surrogates or .p state, Unicode allocates two code units.
If possible, valid state table entries may be changed to .u to reduce the number of valid, assignable sequences and to make the .cnv file smaller. If additional states are necessary, then each additional state itself adds 1kB to the file size, diminishing the file size savings. See the EUC-JP example above.
For codepages with up to two bytes per character, the makeconv tool automatically compacts the bytes, if possible, by introducing one more trail byte state. This state replaces valid entries in the original trail state with unassigned entries and changes each lead byte entry to work with the new state if there are no mappings with that lead byte.
For codepages with up to three or four bytes per character, compaction must be done manually. However, if the verbose option is set on the command line, the makeconv tool will print useful information about unassigned byte sequences.