Boundary Analysis

Overview of Text Boundary Analysis

Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text. Examples of this process include:

  1. Locating appropriate points to word-wrap text to fit within specific margins while displaying or printing.

  2. Locating the beginning of a word that the user has selected.

  3. Counting characters, words, sentences, or paragraphs.

  4. Determining how far to move the text cursor when the user hits an arrow key (Some characters require more than one position in the text store and some characters in the text store do not display at all).

  5. Making a list of the unique words in a document.

  6. Figuring out if a given range of text contains only whole words.

  7. Capitalizing the first letter of each word.

  8. Locating a particular unit of the text (For example, finding the third word in the document).

The BreakIterator classes were designed to support these kinds of tasks. The BreakIterator objects maintain a location between two characters in the text. This location will always be a text boundary. Clients can move the location forward to the next boundary or backward to the previous boundary. Clients can also check if a particular location within a source text is on a boundary or find the boundary which is before or after a particular location.

Four Types of BreakIterator

ICU BreakIterators can be used to locate the following kinds of text boundaries:

  1. Character Boundary

  2. Word Boundary

  3. Line-break Boundary

  4. Sentence Boundary

Each type of boundary is found in accordance with the rules specified by Unicode Standard Annex #29, Unicode Text Segmentation (http://www.unicode.org/reports/tr29 ) or Unicode Standard Annex #14, Unicode Line Breaking Algorithm (http://www.unicode.org/reports/tr14)


Character Boundary

The character-boundary iterator locates the boundaries according to the rules defined in http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries. These boundaries try to match what a user would think of as a "character"—a basic unit of a writing system for a language—which may be more than just a single Unicode code point. 

The letter Ä, for example, can be represented in Unicode either with a single code-point value or with two code-point values (one representing the A and another representing the umlaut). The character-boundary iterator will treat either representation as a single character.

End-user characters, as described above, are also called grapheme clusters, in an attempt to limit the confusion caused by multiple meanings for the word "character".

Word Boundary

The word-boundary iterator locates the boundaries of words, for purposes such as double click selection or "Find whole words" operations.

Words boundaries are identified according to the rules in http://www.unicode.org/reports/tr29/#Word_Boundaries, supplemented by a word dictionary for text in Chinese, Japanese, Thai or Khmer. The rules used for locating word breaks take into account the alphabets and conventions used by different languages.

Here's an example of a sentence, showing the boundary locations that will be identified by a word break iterator:

TODO:

Line-break Boundary

The line-break iterator locates positions that would be appropriate points to wrap lines when displaying the text. The boundary rules are define here: http://www.unicode.org/reports/tr14/

This example shows the differences in the break locations produced by word and line break iterators

TODO:

Sentence Boundary

A sentence-break iterator locates sentence boundaries according to the rules defined here: http://www.unicode.org/reports/tr29/#Sentence_Boundaries


Dictionary-Based BreakIterator

Some languages are written without spaces, and word and line breaking requires more than rules over character sequences. ICU provides dictionary support for word boundaries in Thai, Chinese, Japanese and Khmer.

Use of the dictionaries is automatic when text in one of the dictionary languages is encountered. There is no separate API, and no extra programming steps required by applications making use of the dictionaries.


Usage

To locate boundaries in a document, create a BreakIterator using the BreakIterator::create***Instance family of methods in C++, or the ubrk_open() function (C). "***" is Character, Word, Line or Sentence, depending on the type of iterator wanted. These factory methods also take a parameter that specifies the locale for the language of the text to be processed.

When creating a BreakIterator, a locale is also specified, and the behavior of the BreakIterator obtained may be specialized in some way for that locale. For ICU 2.6, Break Iterators for the Thai locale will make use of a Thai dictionary for finding word and line boundaries; all other locales will use the default boundary rules.

Applications also may register customized BreakIterators for use in specific locales. Once such a break iterator has been registered, any requests for break iterators for the
locale will return copies of the registered break iterator.

ICU may cache service instances. Therefore, registration should be done during startup, before opening services by locale ID.

In the general-usage-model, applications will use the following basic steps to analyze a piece of text for boundaries:

  1. Create a BreakIterator with the desired behavior

  2. Use the setText() or adoptText() methods to set the iterator to analyze a particular piece of text. Since BreakIterator uses a CharacterIterator to access the text, it can be stored in any form as long as you provide an appropriate CharacterIterator. There is a convenience method for analyzing a UnicodeString, but the user also can analyze part of a UnicodeString by creating a StringCharacterIterator directly.

  3. Locate the desired boundaries using the appropriate combination of first(), last(), next(), previous(), preceding(), and following() methods.

The setText() or the adoptText() method can be called more than once, allowing a single BreakIterator to be reused to analyze different pieces of text. Because the creation of a BreakIterator can be relatively time-consuming, it makes good sense to cache and reuse BreakIterators within an application.

Set the text to be searched using the following:

  1. adoptText(CharacterIterator) sets the BreakIterator to analyze a new piece of text. The new piece of text is specified with a CharacterIterator, which allows BreakIterator to analyze the text for boundaries no matter how it happens to be stored [it always accesses the text through the CharacterIterator]. The BreakIterator takes ownership of the CharacterIterator and will delete it when the process is completed.

  2. setText(UnicodeString) is a shortcut for the adoptText() method. If the text is a UnicodeString, the user can call setText and pass it the string, rather than creating a StringCharacterIterator and passing it to the adoptText() method. This method will create the StringCharacterIterator. To analyze only part of a UnicodeString, on the other hand, create the StringCharacterIterator manually, specify the substring, and then pass it to the adoptText() method.

  3. getText() method returns a const reference to the CharacterIterator that the BreakIterator is using to access the text.

  4. createText() method returns a clone of the CharacterIterator that the BreakIterator is using to access the text. Ownership of the clone is transferred to the caller. (The caller can seek the returned CharacterIterator without affecting the BreakIterator, but if the actual text underlying the iterator is changed, the adoptText() method must be called again to make sure the BreakIterator does not malfunction.)

The iterator always points to a boundary position between two characters. The numerical value of the position, as returned by current() is the zero-based index of the character following the boundary. Thus a position of zero represents a boundary preceding the first character of the text, and a position of one represents a boundary between the first and
second characters.

The first() and last() methods reset the iterator's current position to the beginning or end of the text (the beginning and the end are always considered boundaries). The next() and previous() methods advance the iterator one boundary forward or backward from the current position. If the next() or previous() methods run off the beginning or end of the text, it returns DONE. The current() method returns the current position.

The following() and preceding() methods are used for random access or to reposition the iterator to some arbitrary spot in the middle of the text. Since a BreakIterator always points to a boundary position, the following() and preceding() methods will never set the iterator to point to the position specified by the caller (even if it is, in fact, a boundary position). BreakIterator will, however, set the iterator to the nearest boundary position before or after the specified position. The isBoundary() method returns true or false, based on whether or not the specified position is a boundary position. It does this by calling the preceding() and next() methods, so it also repositions the iterator either at the specified position or the first boundary position after it. If any of these functions is passed an out-or-range offset, it returns DONE and repositions the iterator to the beginning or end of the text.

Reuse

It is slow and wasteful to repeatedly create and destroy a BreakIterator when it is not necessary. For example, do not create a separate BreakIterator for each line in a document that is being word-wrapped. Keep around a single instance of a line BreakIterator and use it whenever a line break iterator is needed.

Accuracy

ICU's break iterators implement the default boundary rules described in the Unicode Consortium Technical Reports 14 and 29 . These are relatively simple boundary rules that can be implemented efficiently, and are sufficient for many purposes and languages. However, some languages and applications will require a more sophisticated linguistic analysis of the text in order to find boundaries with good accuracy. Such an analysis is not directly available from ICU at this time.

Break Iterators based on custom, user-supplied boundary rules can be created and used by applications with requirements that are not met by the standard default boundary rules.

BreakIterator Boundary Analysis Examples

Print out all the word-boundary positions in a UnicodeString:

In C++:

void listWordBoundaries(const UnicodeString& s) {
    UErrorCode status = U_ZERO_ERROR;
    BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status);


    bi->setText(s);
    int32_t p = bi->first();
    while (p != BreakIterator::DONE) {
        printf("Boundary at position %d\n", p);
        p = bi->next();
    }
    delete bi;
}

In C:

void listWordBoundaries(const UChar* s,
                        int32_t len) {
    UBreakIterator* bi;
    int32_t p;
    UErrorCode err = U_ZERO_ERROR;

    bi = ubrk_open(UBRK_WORD, 0, s, len, &err);
    if (U_FAILURE(err)) return;
    p = ubrk_first(bi);
    while (p != UBRK_DONE) {
        printf("Boundary at position %d\n", p);
        p = ubrk_next(bi);
    }
    ubrk_close(bi);
}

Get the boundaries of the word that contains a double-click position:

In C++:

void wordContaining(BreakIterator& wordBrk,
                    int32_t idx,
                    const UnicodeString& s,
                    int32_t& start,
                    int32_t& end) {
    // this function is written to assume that we have an
    // appropriate BreakIterator stored in an object or a
    // global variable somewhere-- When possible, programmers
    // should avoid having the create() and delete calls in
    // a function of this nature.
    if (s.isEmpty())
        return;
    wordBrk.setText(s);

    start = wordBrk.preceding(idx + 1);
    end = wordBrk.next();
    // NOTE: for this and similar operations, use preceding() and next()
    // as shown here, not following() and previous(). preceding() is
    // faster than following() and next() is faster than previous()

    // NOTE: By using preceding(idx + 1) above, we're adopting the convention
    // that if the double-click comes right on top of a word boundary, it
    // selects the word that _begins_ on that boundary (preceding(idx) would
    // instead select the word that _ends_ on that boundary).
}

In C:

void wordContaining(UBreakIterator* wordBrk,
                    int32_t idx,

                    const UChar* s,
                    int32_t sLen,
                    int32_t* start,
                    int32_t* end,
                    UErrorCode* err) {
    if (wordBrk == NULL || s == NULL || start == NULL || end == NULL) {
        *err = U_ILLEGAL_ARGUMENT_ERROR;
        return;
    }
    ubrk_setText(wordBrk, s, sLen, err);
    if (U_SUCCESS(*err)) {
        *start = ubrk_preceding(wordBrk, idx + 1);
        *end = ubrk_next(wordBrk);
    }
}


Check for Whole Words

Use the following to check if a range of text is a "whole word":

In C++:

UBool isWholeWord(BreakIterator& wordBrk,
                   const UnicodeString& s,
                   int32_t start,
                   int32_t end) {
    if (s.isEmpty())
        return FALSE;

    wordBrk.setText(s);
    if (!wordBrk.isBoundary(start))
        return FALSE;

    return wordBrk.isBoundary(end); }

In C:

UBool isWholeWord(UBreakIterator* wordBrk,
                   const UChar* s,
                   int32_t sLen,
                   int32_t start,
                   int32_t end,
                   UErrorCode* err) {
    UBool result = FALSE;

    if (wordBrk == NULL || s == NULL) {
        *err = U_ILLEGAL_ARGUMENT_ERROR;
        return FALSE;
    }
    ubrk_setText(wordBrk, s, sLen, err);

    if (U_SUCCESS(*err)) {
        result = ubrk_isBoundary(wordBrk, start)
                 >> ubrk_isBoundary(wordBrk, end);
    }
    return result;
}

Although users can check for "whole words" using these methods, it is possible to get
better performance (in most cases) with the following algorithm:

bool isWholeWord(BreakIterator *wordBrk,
                   const UnicodeString& s,
                   int32_t start,
                   int32_t end) {
    wordBrk->setText(s);
    if (!wordBrk->isBoundary(start))
        return false;

    int32_t p = wordBrk->current();
    while (p < end)
        p = wordBrk->next();

    return p == end;
}

This algorithm is faster because the next() method is the fastest boundary-detection method in BreakIterator. The following() and isBoundary() method [while it calls following()] is the slowest. Two calls to the isBoundary() method is faster only when the selection range is long and comprises more than roughly four words.

Count the words in a document (C++ only):

int32_t containsLetters(RuleBasedBreakIterator& bi,
                        const UnicodeString& s,
                        int32_t start) {
    bi.setText(s);
    int32_t count = 0;
    while (start != BreakIterator::DONE) {
        int breakType = bi.getRuleStatus();
        if (breakType != UBRK_WORD_NONE) {
            // Exclude spaces, punctuation, and the like.
            //   A status value UBRK_WORD_NONE indicates that the boundary does
            //   not start a word or number.
            //   
            ++count;
        }
        start = bi.next();
    }
    return count;
}

The function getRuleStatus() returns an enum giving additional information on the text preceding the last break position found. Using this value, it is possible to distinguish between numbers, words, words containing kana characters, words containing ideographic characters, and non-word characters, such as spaces or punctuation. The sample uses the break status value to filter out, and not count, boundaries associated with non-word characters.

Word-wrap a document (C++ only):

The sample function below wraps a paragraph so that each line is less than or equal to 72 characters. The function fills in an array passed in by the caller with the starting offsets of
each line in the document. Also, it fills in a second array to track how many trailing white space characters there are in the line. For simplicity, it is assumed that an outside process has already broken the document into paragraphs. For example, it is assumed that every string the function is passed has a single newline at the end only.

int32_t wrapParagraph(const UnicodeString& s,
                   const Locale& locale,
                   int32_t lineStarts[],
                   int32_t trailingwhitespace[],
                   int32_t maxLines,
                   UErrorCode &status) {

    int32_t        numLines = 0;
    int32_t        p, q;
    const int32_t MAX_CHARS_PER_LINE = 72;
    UChar          c;

    BreakIterator *bi = BreakIterator::createLineInstance(locale, status);
    if (U_FAILURE(status)) {
        delete bi;
        return 0;
    }
    bi->setText(s);


    p = 0;
    while (p < s.length()) {
        // jump ahead in the paragraph by the maximum number of
        // characters that will fit
        q = p + MAX_CHARS_PER_LINE;

        // if this puts us on a white space character, a control character
        // (which includes newlines), or a non-spacing mark, seek forward
        // and stop on the next character that is not any of these things
        // since none of these characters will be visible at the end of a
        // line, we can ignore them for the purposes of figuring out how
        // many characters will fit on the line)
        if (q < s.length()) {
            c = s[q];
            while (q < s.length()
                   && (u_isspace(c)
                       || u_charType(c) == U_CONTROL_CHAR
                       || u_charType(c) == U_NON_SPACING_MARK
            )) {
                ++q;
                c = s[q];
            }
        }

        // then locate the last legal line-break decision at or before
        // the current position ("at or before" is what causes the "+ 1")
        q = bi->preceding(q + 1);

        // if this causes us to wind back to where we started, then the
        // line has no legal line-break positions. Break the line at
        // the maximum number of characters
        if (q == p) {
            p += MAX_CHARS_PER_LINE;
            lineStarts[numLines] = p;
            trailingwhitespace[numLines] = 0;
            ++numLines;
        }
        // otherwise, we got a good line-break position. Record the start of this
        // line (p) and then seek back from the end of this line (q) until you find
        // a non-white space character (same criteria as above) and
        // record the number of white space characters at the end of the
        // line in the other results array
        else {
            lineStarts[numLines] = p;
            int32_t nextLineStart = q;

            for (q--; q > p; q--) {
                c = s[q];
                if (!(u_isspace(c)
                       || u_charType(c) == U_CONTROL_CHAR
                       || u_charType(c) == U_NON_SPACING_MARK)) {
                    break;
                }
            }
            trailingwhitespace[numLines] = nextLineStart - q -1;
            p = nextLineStart;
           ++numLines;
        }
        if (numLines >= maxLines) {
            break;
        }
    }
    delete bi;
    return numLines;
}

Most text editors would not break lines based on the number of characters on a line. Even with a monospaced font, there are still many Unicode characters that are not displayed and therefore should be filtered out of the calculation. With a proportional font, character widths are added up until a maximum line width is exceeded or an end of the paragraph marker is reached.

Trailing white space does not need to be counted in the line-width measurement because it does not need to be displayed at the end of a line. The sample code above returns an array of trailing white space values because an external rendering process needs to be able to measure the length of the line (without the trailing white space) to justify the lines. For example, if the text is right-justified, the invisible white space would be drawn outside the margin. The line would actually end with the last visible character.

In either case, the basic principle is to jump ahead in the text to the location where the line would break (without taking word breaks into account). Then, move backwards using the preceding() method to find the last legal breaking position before that location. Iterating straight through the text with next() method will generally be slower.

ICU BreakIterator Data Files

The source code for the ICU break rules for the standard boundary types is located in the directory icu/source/data/brkitr. These files will be built, and the corresponding binary state tables incorporated into ICU's data, by the standard ICU4C build process.

Beginning with ICU 3.0, the same break rule source files and compiled state tables are used for both ICU4C and ICU4J. The state tables are built using ICU4C, and the resulting binary tables are incorporated into ICU4J.

RBBI Rules

ICU locates boundary positions within text by means of rules, which are a variant form of regular expressions. A boundary rule is an expression that matches a section of text - a word or sentence or whatever - that should remain together, with boundaries occurring
between the ranges of matched text. A set of rules consists of a series of regular expressions separated by semicolons; the rules, taken together, define regions of text that are kept together between boundaries. Boundaries occur at the end of text ranges matched by the rules.

Forward, Reverse, Safe Point rules

For each type of boundary, four sets of rules are required, as described in the following table.

Forward Advance (match text) starting from a boundary position and continuing to the next following boundary.
Reverse Starting from a boundary, match backwards, until the preceding boundary position.
Safe Forward Starting from any arbitrary position in the text, match forward to a safe position, which is a position from which the normal Reverse rule will work correctly.
Safe Reverse Starting from any arbitrary position in the text, move backwards to a safe position, which is a position from which the normal Forward rule will work correctly.

All four rules need to be supplied.

Normal next() or previous() operations use the Forward or Reverse rules, respectively, to move directly from one boundary position to another.

The preceding() and following() functions first apply a safe rule, then apply a normal Forward or Reverse rule. The preceding() and following() functions can start from any arbitrary location in the input text.

Note Note: Earlier versions of ICU (prior to 3.0) worked with only a Forward rule and a safe Reverse rule. While the rule builder will still recognize rules written in this form, their use is deprecated and strongly discouraged.

A rule input file is divided into sections, one for each type of rule:

# This shows the general layout of a break rule file
#
# The order of the four sections doesn't matter, so long as they all appear.
#
# Variable definitions can appear anywhere, so long as they are defined before
# their first use in a rule. Variable definitions carry forward across section # boundaries.
#

!!forward;
#   forward rules go here.

!!reverse;
#    Reverse rules go here.

!!safe_forward;
#    Safe Forward rules go here.


!!safe_reverse;
#    Safe Reverse rules go here.

Variables

A set of break rules may define and use variables, which are convenient when subexpressions reappear more than once, or to simplify complex expressions by allowing parts to be separately defined and named. Use of variables within a set of rules has no effect on the efficiency of the resulting break iterator.

!!chain

ICU boundary rules can be written in two ways: chained or non-chained.

With non-chained rules, each rule (regular expression) stands by itself, matching a segment of text between two boundary positions. When moving to the next boundary, the single rule with the longest match defines the boundary position.

This is very much like traditional regular expression behavior.

Non-chained rule matching behavior is the default for ICU break rules.

Chaining allows boundary positions to be determined by an arbitrary number of the boundary rules, applied in an arbitrary sequence. Any character in the text that completes a match for one rule can function as a chaining point, and simultaneously be the beginning character of a match for any other rule. Matching continues in this way until the longest possible match is obtained.

Chaining from one rule to the next can occur at any point that the first rule of the pair matches. The longest match of each individual rule is not required, and if chaining from a shorter match of an intermediate rule results in a longer overall match, that is what will happen.

Chained rules are closer in flavor to the rules definitions in the Unicode Consortium text boundary specifications. Line Break boundaries, in particular, were not really possible to implement accurately with traditional, non-chained regular expression.

!!chain; in a rule file enables rule chaining. !!chain applies to all rule sections, and must appear before the first section.

The !!LBCMNoChain; statement modifies chaining behavior by preventing chaining from one rule to another from occurring on any character whose Line Break property is Combining Mark. This option is subject to change or removal, and should not be used in general. Within ICU, it is used only with the line break rules. We hope to replace it with something more general.


Rule Status Values

Break rules can be tagged with a number, which is called the rule status. After a boundary has been located, the status number of the specific rule that determined the boundary position is available to the application through the function getRuleStatus().

For the predefined word boundary rules, status values are available to distinguish between boundaries associated with words, numbers, and those around spaces or punctuation. Similarly for line break boundaries, status values distinguish between mandatory line endings (new line characters) and break opportunities that are appropriate points for line wrapping. Refer to the ICU API documentation for the C header file ubrk.h or to Java class RuleBasedBreakIterator for a complete list of the predefined boundary classifications.

When creating custom sets of break rules, integer status values can be associated with boundary rules in whatever way will be convenient for the application. There is no need to remain restricted to the predefined values and classifications from the standard rules.

It is possible for a set of break rules to contain more than a single rule that produces some boundary in an input text. In this event, getRuleStatus() will return the numerically largest status value from the matching rules, and the alternate function getRuleStatusVec() will return a vector of the values from all of the matching rules.

In the source form of the break rules, status numbers appear at end of a rule, and are enclosed in {braces}.

Rule Syntax

Here is the syntax for the boundary rules.


Rule Name Rule Values Notes
rules statement+
statement assignment | rule | control
control (“!!forward” | “!!reverse” | “!!safe_forward” | “!!safe_reverse” | “!!chain” | “!!LBCMNoChain”) ';'
assignment variable '=' expr ';' 5
rule '!'? expr ('{'number'}')? ';' 8, 9
number [0-9]+ 1
break-point '/'
expr expr-q | expr '|' expr | expr expr 3
expr-q term | term '*' | term '?' | term '+'
term rule-char | unicode-set | variable | quoted-sequence | '(' expr ')' | break-point
rule-special any printing ascii character except letters or numbers | white-space
rule-char any non-escaped character that is not rule-special | '.' | any escaped character except '\p' or '\P'
variable '$' name-start-char name-char* 7
name-start-char '_' | \p{L}
name-char name-start-char | \p{N}
quoted-sequence ''' (any char except single quote or line terminator or two adjacent single quotes)+ '''
escaped-char See “Character Quoting and Escaping” in the UnicodeSet chapter

Unicode set See UnicodeSet 4
comment unescaped '#' [any char except new-line]* new-line 2
s unescaped \p{Z}, tab, LF, FF, CR, NEL 6
new-line LF, CR, NEL 2

Notes:

  1. The number associated with a rule that actually determined a break position is available to the application after the break has been returned. These numbers are not Perl regular expression repeat counts.

  2. Comments are recognized and removed separately from otherwise parsing the rules. They may appear wherever a space would be allowed (and ignored.)

  3. The implicit concatenation of adjacent terms has higher precedence than the '|' operation. "ab|cd" is interpreted as "(ab)|(cd)", not as "a(b|c)d" or "(((ab)|c)d)"

  4. The syntax for UnicodeSet is defined (and parsed) by the UnicodeSet class. It is not repeated here.

  5. For $variables that will be referenced from inside of a UnicodeSet, the definition must consist only of a Unicode Set. For example, when variable $a is used in a rule like [$a$b$c], then this definition of $a is ok “$a=[:Lu:];” while this one “$a=abcd;” would cause an error when $a was used.

  6. Spaces are allowed nearly anywhere, and are not significant unless escaped. Exceptions to this are noted.

  7. No spaces are allowed within a variable name. The variable name $dictionary is special. If defined, it must be a Unicode Set, the characters of which will trigger the use of word dictionary based boundaries.

  8. A leading '!' on a rule is a deprecated syntax for specifying a reverse rule. Putting reverse rules in the !!reverse section is now preferred.

  9. {nnn} appearing at the end of a rule is a Rule Status number, not a repeat count as it would be with conventional regular expression syntax.

EBNF Syntax used for the RBBI rules syntax description

a? zero or one instance of a
a+ one or more instances of a
a* zero or more instances of a
a | b either a or b, but not both
'a' "a" the literal string between the quotes

Additional Sample Code

C/C++: See icu/source/samples/break/ in the ICU source distribution for code samples showing the use of ICU boundary analysis.

Details about Dictionary-Based Break Iteration

(This section originally from August 2012.)

Certain Unicode characters have a "dictionary" bit set in the break iteration rules, and text made up of these characters cannot be handled by the rules-based break iteration code for lines or words. Rather, they must be handled by a dictionary-based approach. The ICU approach is as follows:

Once the Dictionary bit is detected, the set of characters with that bit is handed off to "dictionary code." This code then inspects the characters more carefully, and splits them by script (Thai, Khmer, Chinese, Japanese, Korean). If text in this script has not yet been handled, it loads the appropriate dictionary from disk, and initializes a specialized "BreakEngine" class for that script.

There are three such specialized classes: Thai, Khmer and CJK.

Thai and Khmer use very similar approaches. They look through a dictionary that is not weighted by word frequency, and attempt to find the longest total "match" that can be made in the text.

For Chinese and Japanese text, on the other hand, we have a unified dictionary (due to the fact that both use some of the same characters, it is difficult to distinguish them) that contains information about word frequencies. The algorithm to match text then uses dynamic programming to find the set of breaks it considers "most likely" based on the frequency of the words created by the breaks. This algorithm could also be used for Thai and Khmer, but we do not have sufficient data to do so. This algorithm could also be used for Korean, but once again we do not have the data to do so.

Code of interest is in source/common/dictbe.{h, cpp}, source/common/brkeng.{h, cpp}, source/common/dictionarydata.{h, cpp}. The dictionaries use the BytesTrie and  UCharsTrie as their data store. The binary form of these dictionaries is produced by the gendict tool, which has source in source/tools/gendict.

In order to add new dictionary implementations, a few changes have to be made. First, you should create a new subclass of DictionaryBreakEngine or LanguageBreakEngine in dictbe.cpp that implements your algorithm. Then, in brkeng.cpp, you should add logic to create this dictionary break engine if we strike the appropriate script - which should only be 3 or so lines of code at the most. Lastly, you should add the correct data file. If your data is to be represented as a .dict file - as is recommended, and in fact required if you don't want to make substantial code changes to the engine loader - you need to simply add a file in the correct format for gendict to the source/data/brkitr directory, and add its name to the list of BRK_DICT_SOURCE in source/data/brkitr/brkfiles.mk. This will cause your dictionary (say, foo.txt) to be added as a UCharsTrie dictionary with the name foo.dict. If you want your dictionary to be a BytesTrie dictionary, you will need to specify a transform within the Makefile. To do so, find the part of source/data/Makefile.in and source/data/makedata.mak that deals with thaidict.dict and khmerdict.dict and add a similar set of lines for your script. Lastly, in source/data/brkitr/root.txt, add a line to the dictionaries {} section of the form:

shortscriptname:process(dependency){“dictionaryname.dict”}

(for example, for Katakana: Kata:process(dependency){“cjdict.dict”} )

Make sure to add appropriate tests for the new implementation.

Comments