Overview
Character set detection is the process of determining the character
set, or encoding, of character data in an unknown format. This is, at
best, an imprecise operation using statistics and heuristics. Because
of this, detection works best if you supply at least a few hundred
bytes of character data that's mostly in a single language. In some
cases, the language can be determined along with the encoding.
Several
different techniques are used for character set detection. For
multi-byte encodings, the sequence of bytes is checked for legal
patterns. The detected characters are also check against a list of
frequently used characters in that encoding. For single byte encodings,
the data is checked against a list of the most commonly occurring three
letter groups for each language that can be written using that
encoding. The detection process can be configured to optionally ignore
html or xml style markup, which can interfere with the detection
process by changing the statistics.
The input data can either be
a Java input stream, or an array of bytes. The output of the detection
process is a list of possible character sets, with the most likely one
first. For simplicity, you can also ask for a Java Reader that will
read the data in the detected encoding.
CharsetMatch
The CharsetMatch
class holds the result of comparing the input data to a particular
encoding. You can use an instance of this class to get the name of the
character set, the language, and how good the match is. You can also
use this class to decode the input data.
To find out how good the match is, you use the getConfidence() method to get a confidence value. This is an integer from 0 to 100. The higher the value, the more confidence there is in the match For example:
CharsetMatch match = ...; int confidence;
confidence = match.getConfidence(); if (confidence < 50 ) { // handle a poor match... } else { // handle a good match... }
|
In C, you can use ucsdet_getConfidence(const UCharsetMatch *ucsm, UErrorCode *status) method to get a confidence value
const UCharsetMatch *ucm; UErrorCode status = U_ZERO_ERROR; int32_t confidence = ucsdet_getConfidence(ucm, &status); if (confidence <50 ){ //handle a poor match... } else { //handle a good match... } |
To get the name of the character set, which can be used as an encoding name in Java, you use the getName() method:
CharsetMatch match = ...; byte characterData[] = ...; String charsetName; String unicodeData;
charsetName = match.getName(); unicodeData = new String(characterData, charsetName);
|
To get the name of the character set in C :
const UCharsetMatch *ucm; UErrorCode status = U_ZERO_ERROR; const char *name = ucsdet_getName(ucm, &status); |
To get the three letter ISO code for the detected language, you use the getLanguage() method. If the language could not be determined, getLanguage() will return null:
CharsetMatch match = ...; String languageCode;
languageCode = match.geLanguae(); if (languageCode != null) { // handle the language code... }
|
The ucsdet_getLanguage(const UCharsetMatch *ucsm, UErrorCode
*status) method can be used in C to get the language code. If the
language could not be determined, the method will return an empty
string.
const UCharsetMatch *ucm; UErrorCode status = U_ZERO_ERROR; const char *language = ucsdet_getLanguage(ucm, &status); |
If you want to get a Java String containing the converted data you can use the getString() method:
CharsetMatch match = ...; String unicodeData;
unicodeData = match.getString();
|
If you want to limit the number of characters in the string, pass the maximum number of characters you want to the getString() method:
CharsetMatch match = ...; String unicodeData;
unicodeData = match.getString(1024);
|
To get a java.io.Reader to read the converted data, use the getReader() method:
CharsetMatch match = ...; Reader reader; StringBuffer sb = new StringBuffer(); char[] buffer = new char[1024]; int bytesRead = 0;
reader = match.getReader(); while ((bytesRead = reader.read(buffer, 0, 1024)) >= 0) { sb.append(buffer, 0, bytesRead); } reader.close();
|
CharsetDetector
The CharsetDetector class does the actual detection. It matches the input data against all character sets, and computes a list of CharsetMatch objects to hold the results. The input data can be supplied as an array of bytes, or as a java.io.InputStream.
To use a CharsetDetector object, first you construct it, and then you set the input data, using the setText() method. Because setting the input data is separate from the construction, it is easy to reuse a CharsetDetector object:
CharsetDetector detector; byte[] byteData = ...; InputStream streamData = ...;
detector = new CharsetDetector();
detector.setText(byteData); // use detector with byte data...
detector.setText(streamData); // use detector with stream data...
|
If you want to know which character set matches your input data with the highest confidence, you can use the detect() method, which will return a CharsetMatch object for the match with the highest confidence:
CharsetDetector detector; CharsetMatch match; byte[] byteData = ...;
detector = new CharsetDetector();
detector.setText(byteData); match = detector.detect();
|
If you want to know which character set matches your input
data in C, you can use the ucsdet_detect( UCharsetDetector *csd ,
UErrorCode *status) method.
UCharsetDetector* csd; const UCharsetMatch *ucm; static char buffer[BUFFER_SIZE] = {....}; int32_t inputLength = ... //length of the input text UErrorCode status = U_ZERO_ERROR; ucsdet_setText(csd, buffer, inputLength, &status); ucm = ucsdet_detect(csd, &status); |
If you want to know all of the character sets that could match your input data with a non-zero confidence, you can use the detectAll() method, which will return an array of CharsetMatch objects sorted by confidence, from highest to lowest.:
CharsetDetector detector; CharsetMatch matches[]; byte[] byteData = ...;
detector = new CharsetDetector();
detector.setText(byteData); matches = detector.detectAll();
for (int m = 0; m < matches.length; m += 1) { // process this match... }
|
 | Note:The
ucsdet_detectALL( UCharsetDetector *csd , int32_t* matchesFound,
UErrorCode* status) method can be used in C in order to detect all of
the character sets where matchesFound is a pointer to a variable that
will be set to the number of charsets identified that are consistent
with the input data. |
The CharsetDetector class also implements a crude input filter
that can strip out html and xml style tags. If you want to enable the
input filter, which is disabled when you construct a CharsetDetector,
you use the enableInputFilter() method, which takes a boolean. Pass in
true if you want to enable the input filter, and false if you want to
disable it:
CharsetDetector detector; CharsetMatch match; byte[] byteDataWithTags = ...;
detector = new CharsetDetector();
detector.setText(byteDataWithTags); detector.enableInputFilter(true); match = detector.detect();
|
To enable an input filter in C, you can use ucsdet_enableInputFilter( UCharsetDetector* csd, UBool filter) function.
UCharsetDetector* csd; const UCharsetMatch *ucm; static char buffer[BUFFER_SIZE] = {....}; int32_t inputLength = ... //length of the input text UErrorCode status = U_ZERO_ERROR; ucsdet_setText(csd, buffer, inputLength, &status); ucsdet_enableInputFilter( csd, TRUE); ucm = ucsdet_detect(csd, &status);
|
If you have more detailed knowledge about the structure of the input
data, it is better to filter the data yourself before you pass it to CharsetDetector.
For example, you might know that the data is from an html page that
contains CSS styles, which will not be stripped by the input filter.
You can use the inputFilterEnabled() method to see if the input filter is enabled:
CharsetDetector detector; detector = new CharsetDetector();
// do a bunch of stuff with detector // which may or may not enable the input filter...
if (detector.inputFilterEnabled()) { // handle enabled input filter } else { // handle disabled input filter }
|
 | Note:The
ICU4C API provide uscdet_isInputFilterEnabled(const UCharsetDetector*
csd) function to check whether the input filter is enabled. |
The CharsetDetector class
also has two convenience methods that let you detect and convert the
input data in one step: the getReader() and getString() methods:
CharsetDetector detector; byte[] byteData = ...; InputStream streamData = ...; String unicodeData; Reader unicodeReader;
detector = new CharsetDetector();
unicodeData = detector.getString(byteData, null);
unicodeReader = detector.getReader(streamData, null);
|
| Note: the second argument to the getReader()
and getString() methods is a String called declaredEncoding, which is
not currently used. There is also a setDeclaredEncoding() method, which
is also not currently used. |
The following code is equivalent to using the convenience methods:
CharsetDetector detector; CharsetMatch match; byte[] byteData = ...; InputStream streamData = ...; String unicodeData; Reader unicodeReader;
detector = new CharsetDetector();
detector.setText(byteData); match = detector.detect(); unicodeData = match.getString();
detector.setText(streamData); match = detector.detect(); unicodeReader = match.getReader();CharsetDetector
|
Detected Encodings
The following table shows all the encodings that can be detected. You can get this list (without the languages) by calling the getAllDetectableCharsets() method:
| Character Set | Languages |
|---|
UTF-8 |
|
UTF-16BE |
|
UTF-16LE |
|
UTF-32BE |
|
UTF-32LE |
|
Shift_JIS | Japanese |
ISO-2022-JP | Japanese |
ISO-2022-CN | Simplified Chinese |
ISO-2022-KR | Korean |
GB18030 |
|
Big5 | Traditional Chinese |
EUC-JP | Japanese |
EUC-KR | Korean |
ISO-8859-1 | Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish |
ISO-8859-2 | Czech, Hungarian, Polish, Romanian |
ISO-8859-5 | Russian |
ISO-8859-6 | Arabic |
ISO-8859-7 | Greek |
ISO-8859-8 | Hebrew |
windows-1251 | Russian |
windows-1256 | Arabic |
KOI8-R | Russian |
ISO-8859-9 | Turkish |