When designing applications around Unicode characters, it is sometimes required to convert between Unicode encodings or between Unicode and legacy text data. The vast majority of modern Operating Systems support Unicode to some degree, but sometimes the legacy text data from older systems need to be converted to and from Unicode. This conversion process can be done with an ICU converter.
ICU provides comprehensive character set conversion services, mapping tables, and implementations for many encodings. Since ICU uses Unicode (UTF-16) internally, all converters convert between UTF-16 (with the endianness according to the current platform) and another encoding. This includes Unicode encodings. In other words, internal text is 16-bit Unicode, while "external text" used as source or target for a conversion is always treated as a byte stream.
ICU converters are available for a wide range of encoding schemes. Most of them are based on mapping table data that is handled by few generic implementations. Some encodings are implemented algorithmically in addition to (or instead of) using mapping tables, especially Unicode encodings. The partly or entirely table-based encoding schemes include: All ICU converters map only single Unicode character code points to and from single codepage character code points. ICU converters do not deal directly with combining characters, bidirectional reordering, or Arabic shaping, for example. Such processes, if required, must be handled separately. For example, while in Unicode, the ICU BiDi APIs can be used for bidirectional reordering after a conversion to Unicode or before a conversion from Unicode.
ICU converters are not designed to perform any encoding autodetection. This means that the converters do not autodetect "endianness", the 6 Unicode encoding signatures, or the Shift-JIS vs. EUC-JP, etc. There are two exceptions: The UTF-16 and UTF-32 converters work according to Unicode's specification of their Character Encoding Schemes, that is, they read the BOM to figure out the actual "endianness".
The ICU mapping tables mostly come from an IBM® codepage repository. For non-IBM codepages, there is typically an equivalent codepage registered with this repository. However, the textual data format (.ucm files) is generic, and data for other codepage mapping tables can also be added.
ICU has code to determine the default codepage of the system or process. This default codepage can be used to convert char * strings to and from Unicode.
Depending on system design, setup and APIs, it may not always be possible to find a default codepage that fully works as expected. For example,
If you have means of detecting a default codepage name that are more appropriate for your application, then you should set that name with ucnv_setDefaultName() as the first ICU function call. This makes sure that the internally cached default converter will be instantiated from your preferred name.
Starting in ICU 2.0, when a converter for the default codepage cannot be opened, a fallback default codepage name and converter will be used. On most platforms, this will be US-ASCII. For z/OS (OS/390), ibm-1047,swaplfnl is the default fallback codepage. For AS/400 (iSeries), ibm-37 is the default fallback codepage. This default fallback codepage is used when the operating system is using a non-standard name for a default codepage, or the converter was not packaged with ICU. The feature allows ICU to run in unusual computing environments without completely failing.
A "Converter" refers to the C structure "UConverter". Converters are cheap to create. Any data that is shared between converters of the same kind (such as the mappings, the name and the properties) are automatically cached and shared in memory.
Codepages with encoding schemes have been given many names by various vendors and platforms over the years. Vendors have different ways specify which codepage and encoding are being used. IBM uses a CCSID (Coded Character Set IDentifier). Windows uses a CPID (CodePage IDentifier). Macintosh has a TextEncoding. Many Unix vendors use IANA character set names. Many of these names are aliases to converters within ICU.
In order to help identify which names are recognized by certain platforms, ICU provides several converter alias functions. The complete description of these functions can be found in the ICU API Reference .
Even though IANA specifies a list of aliases, it usually does not specify the mappings or the actual character set for the aliases. Sometimes vendors will map similar glyph variants to different Unicode code points or sometimes they will assign completely different glyphs for the same codepage code point. Because of these ambiguities, you can sometimes get U_AMBIGUOUS_ALIAS_WARNING for the returned UErrorCode when more than one converter uses the requested alias. This is only a warning, and the results can still be used. This UErrorCode value is just a reminder that you may not get what you expected. The above functions can help you to determine which converter you actually wanted.
EBCDIC based converters do have the option to swap the newline and linefeed character mappings. This can be useful when transferring EBCDIC documents between z/OS (MVS, os/390 and the rest of the zSeries family) and another EBCDIC machine like OS/400 on iSeries. The ",swaplnlf" or UCNV_SWAP_LFNL_OPTION_STRING from ucnv.h can be appended to a converter alias in order to achieve this behavior. You can view other available options in ucnv.h.
You can always skip many of these aliasing and mapping problems by just using Unicode.
There are four ways to create a converter:
Closing a converter frees memory occupied by that instance of the converter. However it does not release the larger shared data tables the converter might use. OS-level code may call ucnv_flushCache() to explicitly free memory occupied by unused tables .
Note that a Converter is created with a certain type (for instance, ISO-8859-3) which does not change over the life of that object . Converters should be allocated one per thread. They are cheap to create, as the shared data doesn't need to be reallocated.
This is the typical life cycle of a converter, as shown step-by-step:
A converter cannot be shared between threads at the same time. However, if it is reset it can be used for unrelated chunks of data. For example, use the same converter for converting data from Unicode to ISO-8859-3, and then reset it. Use the same converter for converting data from ISO-8859-3 back into Unicode.
If it is necessary to convert a large quantity of data in smaller buffers, use the same converter to convert each buffer. This will make sure any state is preserved from one chunk to the next. Doing this conversion is known as streaming or buffering, and is mentioned Buffered Conversion section (§) later in this chapter.
Cloning a converter returns a clone of the converter object along with any internal state that the converter might be storing. Cloning routines must be used with extreme care when using converters for stateful or multibyte encodings. If the converter object is carrying an internal state, and the newly-created clone is used to convert a new chunk of text, the converter produces incorrect results. Also note that the caller owns the cloned object and has to call ucnv_close() to dispose of the object. Calling ucnv_reset() before cloning will reset the converter to its original state.
Converters can be reset explicitly or implicitly. Explicit reset is done by calling:
The converters are reset implicitly when the conversion functions are called with the "flush" parameter set to "TRUE" and the source is consumed.
Not all characters can be converted from Unicode to other codepages. In most cases, Unicode is a superset of the characters supported by any given codepage.
The default behavior of ICU in this case is to substitute the illegal or unmappable sequence, with the appropriate substitution sequence for that codepage. For example, ISO-8859-1, along with most ASCII-based codepages, has the character 0x1A (Control-Z) as the substitution sequence. When converting from Unicode to ISO-8859-1, any characters which cannot be converted would be replaced by 0x1A's.
SubChar1 is sometimes used as substitution character in MBCS conversions. For more information on SubChar1 please see the Conversion Data chapter.
In stateful converters like ISO-2022-JP, if a substitution character has to be written to the target, then an escape/shift sequence to change the state to single byte mode followed by a substitution character is written to the target.
The substitution character can be changed by calling the ucnv_setSubstChars() function with the desired codepage byte sequence. However, this has some limitations: It only allows setting a single character (although the character can consist of multiple bytes), and it may not work properly for some stateful converters (like HZ or ISO 2022 variants) when setting a multi-byte substitution character. (It will work for EBCDIC_STATEFUL ones.) Moreover, for setting a particular character, the caller needs to know the correct byte sequence for that character in the converter's codepage. (For example, a space [U+0020] is encoded as 0x20 in ASCII-based codepages, 0x40 in EBCDIC-based ones, 0x00 0x20 or 0x20 0x00 in UTF-16 depending on the stream's endianness, etc.)
The ucnv_setSubstString() function (new in ICU 3.6) lifts these limitations. It takes a Unicode string and verifies that it can be converted to the codepage without error and that it is not too long (32 bytes as of ICU 3.6). The string can contain zero, one or more characters. An empty string has the effect of using the skip callback. See the Error Callbacks below. Stateful converters are fully supported. The same Unicode string will give equivalent results with all converters that support its conversion.
Internally, ucnv_setSubstString() stores the byte sequence from the test conversion if the converter is stateless, or the Unicode string itself if the converter is stateful. If the Unicode string is stored, then it is converted on the fly during substitution, handling all state transitions.
The function ucnv_getSubstChars() can be used to retrieve the substitution byte sequence if it is the default one, set by ucnv_setSubstChars(), or if ucnv_setSubstString() stored the byte sequence for a stateless converter. The Unicode string set for a stateful converter cannot be retrieved.
In conversion to Unicode, errors are normally due to ill-formed byte sequences: Unused byte values, or lead bytes not followed by trail bytes according to the encoding scheme. Well-formed but unmappable sequences are unusual but possible.
The ICU default behavior is to emit an U+FFFD REPLACEMENT CHARACTER per offending sequence.
If the conversion table .ucm file contains a <subchar1> entry (such as in the ibm-943 table), a U+001A C0 control ("SUB") is emitted for single-byte illegal/unmappable input rather than U+FFFD REPLACEMENT CHARACTER. For details on this behavior look for "001A" in the Conversion Data chapter.
Here are some of the error codes which have significant meaning for conversion:
What actually happens is that an "error callback function" is called at the point where the conversion failure occurred. The function can deal with the failed characters as it sees fit. Possible options at the callback's disposal include ignoring the bad sequence, converting it to a different sequence, and returning an error to the caller. The callback can also consume any data past where the error occurred, whether or not that data would have caused an error. Only one callback is installed at a time, per direction (to or from unicode).
A number of canned functions are provided by ICU, and an application can write new ones. The "callbacks" are either From Unicode (to codepage), or To Unicode (from codepage). Here is a list of the canned callbacks in ICU:
Here are some examples of how to use callbacks.
Writing a callback is somewhat involved, and will be covered more completely in a future version of this document. One might look at the source to the provided callbacks as a starting point, and address any further questions to the mailing list.
Basically, callback, unlike other ICU functions which expect to be called with U_ZERO_ERROR as the input, is called in an exceptional error condition. The callback is a kind of 'last ditch effort' to rectify the error which occurred, before it is returned back to the caller. This is why the implementation of STOP is very simple:
The error code such as U_INVALID_CHAR_FOUND is returned to the user. If the callback determines that no error should be returned to the user, then the callback must set the error code to U_ZERO_ERROR. Note that this is a departure from most ICU functions, which are supposed to check the error code and return immediately if it is set.
Unicode has a number of characters that are not by themselves meaningful but assist with line breaking (e.g., U+00AD Soft Hyphen & U+200B Zero Width Space), bi-directional text layout (U+200E Left-To-Right Mark), collation and other algorithms (U+034F Combining Grapheme Joiner), or indicate a preference for a particular glyph variant (U+FE0F Variation Selector 16). These characters are "invisible" by default, that is, they should normally not be shown with a glyph of their own, except in special circumstances. Examples include showing a hyphen for when a Soft Hyphen was used for a line break, or modifying the glyph of a character preceding a Variation Selector.
Unicode has a character property to identify such characters, as well as currently-unassigned code points that are intended to be used for similar purposes: Default_Ignorable_Code_Point, or "DI" for short: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:DI:]
Most charsets do not have most or any of these characters.
ICU 54 and above by default skip default-ignorable code points if they are unmappable. (Ticket #10551)
Older versions of ICU replaced unmappable default-ignorable code points like any other unmappable code points, by a question mark or whatever substitution character is defined for the charset.
For best results, a custom from-Unicode callback can be used to ignore Default_Ignorable_Code_Point characters that cannot be converted, so that they are removed from the charset output rather than replaced by a visible character.
This is a code snippet for use in a custom from-Unicode callback:
When a converter is instantiated, it can be used to convert both in the Unicode to Codepage direction, and also in the Codepage to Unicode direction. There are three ways to use the converters, as well as a convenience function which does not require the instantiation of a converter.
Data must be contained entirely within a single string or buffer.
In this type, the input data is in the specified codepage. With each function call, only the next Unicode codepoint is converted at a time. This might be the most efficient way to scan for a certain character, or other processing of a single character at a time, because converters are stateful. This works even for multibyte charsets, and for stateful ones such as iso-2022-jp.
This is used in situations where a large document may be read in off of disk and processed. Also, many codepages take multiple bytes to encode a character, or have state. These factors make it impossible to convert arbitrary chunks of data without maintaining state across chunks. Even conversion from Unicode may encounter a leading surrogate at the end of one buffer, which needs to be paired with the trailing surrogate in the next buffer.
A basic API principle of the ICU to/from Unicode functions is that they will ALWAYS attempt to consume all of the input (source) data, unless the output buffer is full or some other error occurs. In other words, there is no need to ever test whether all of the source data has been consumed.
The basic loop that is used with the ICU buffer conversion routines is the same in the to and from Unicode directions. In the following pseudocode, either 'source' (for fromUnicode) or 'target' (for toUnicode) are UTF-16 UChars.
The above code optimizes for processing entire chunks of input data. An efficient size for the output buffer can be calculated as follows. (in bytes):
There are two loops used, an outer and an inner. The outer loop fetches input data to keep the source buffer full, and the inner loop 'writes' out data to keep the output buffer empty.
Note that while this efficiently handles data on the input side, there are some cases where the size of the output buffer is fixed. For instance, in network applications it is sometimes desirable to fill every output packet completely (not including the last packet in the sequence). The above loop does not ensure that every output buffer is completely full. For example, if a 4 UChar input buffer was used, and a 3 byte output buffer with fromUnicode(), the loop would typically write 3 bytes, then 1, then 3, and so on. If, instead of efficient use of the input data, the goal is filling output buffers, a slightly different loop can be used.
In such a scenario, the inner write does not occur unless a buffer overflow occurs OR 'flush' is true. So, the 'write' and resetting of the target and targetLimit pointers would only happen if(err == U_BUFFER_OVERFLOW_ERROR || flush == TRUE)
The flush parameter on each conversion call should be set to FALSE, until the conversion call is called for the last time for the buffer. This is because the conversion is stateful. On the last conversion call, the flush parameter should be set to TRUE. More details are mentioned in the API reference in ucnv.h .
is the process of asking the conversion API for the size of target
buffer required. (For a more general discussion, see the Preflighting section (§) in the Strings chapter.)
This is accomplished by calling the ucnv_fromUChars and ucnv_toUChars functions.
ICU provides some convenience functions for conversions:
See the ICU Conversion Examples for more information.