Boundary Analysis

Overview of Text Boundary Analysis

Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text. Examples of this process include:

  1. Locating appropriate points to word-wrap text to fit within specific margins while displaying or printing.

  2. Locating the beginning of a word that the user has selected.

  3. Counting characters, words, sentences, or paragraphs.

  4. Determining how far to move the text cursor when the user hits an arrow key (Some characters require more than one position in the text store and some characters in the text store do not display at all).

  5. Making a list of the unique words in a document.

  6. Figuring out if a given range of text contains only whole words.

  7. Capitalizing the first letter of each word.

  8. Locating a particular unit of the text (For example, finding the third word in the document).

The BreakIterator classes were designed to support these kinds of tasks. The BreakIterator objects maintain a location between two characters in the text. This location will always be a text boundary. Clients can move the location forward to the next boundary or backward to the previous boundary. Clients can also check if a particular location within a source text is on a boundary or find the boundary which is before or after a particular location.

Four Types of BreakIterator

ICU BreakIterators can be used to locate the following kinds of text boundaries:

  1. Character Boundary

  2. Word Boundary

  3. Line-break Boundary

  4. Sentence Boundary

Each type of boundary is found in accordance with the rules specified by Unicode Standard Annex #29, Unicode Text Segmentation (https://unicode.org/reports/tr29/ ) or Unicode Standard Annex #14, Unicode Line Breaking Algorithm (https://unicode.org/reports/tr14/)


Character Boundary

The character-boundary iterator locates the boundaries according to the rules defined in http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries. These boundaries try to match what a user would think of as a "character"—a basic unit of a writing system for a language—which may be more than just a single Unicode code point. 

The letter Ä, for example, can be represented in Unicode either with a single code-point value or with two code-point values (one representing the A and another representing the umlaut). The character-boundary iterator will treat either representation as a single character.

End-user characters, as described above, are also called grapheme clusters, in an attempt to limit the confusion caused by multiple meanings for the word "character".

Word Boundary

The word-boundary iterator locates the boundaries of words, for purposes such as double click selection or "Find whole words" operations.

Words boundaries are identified according to the rules in https://www.unicode.org/reports/tr29/#Word_Boundaries, supplemented by a word dictionary for text in Chinese, Japanese, Thai or Khmer. The rules used for locating word breaks take into account the alphabets and conventions used by different languages.

Here's an example of a sentence, showing the boundary locations that will be identified by a word break iterator:

TODO:

Line-break Boundary

The line-break iterator locates positions that would be appropriate points to wrap lines when displaying the text. The boundary rules are define here: https://www.unicode.org/reports/tr14/

This example shows the differences in the break locations produced by word and line break iterators

TODO:

Sentence Boundary

A sentence-break iterator locates sentence boundaries according to the rules defined here: https://www.unicode.org/reports/tr29/#Sentence_Boundaries


Dictionary-Based BreakIterator

Some languages are written without spaces, and word and line breaking requires more than rules over character sequences. ICU provides dictionary support for word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese. 

Use of the dictionaries is automatic when text in one of the dictionary languages is encountered. There is no separate API, and no extra programming steps required by applications making use of the dictionaries.


Usage

To locate boundaries in a document, create a BreakIterator using the BreakIterator::create***Instance family of methods in C++, or the ubrk_open() function (C). "***" is Character, Word, Line or Sentence, depending on the type of iterator wanted. These factory methods also take a parameter that specifies the locale for the language of the text to be processed.

When creating a BreakIterator, a locale is also specified, and the behavior of the BreakIterator obtained may be specialized in some way for that locale. For most locales the default break iterator behavior is used. 

Applications also may register customized BreakIterators for use in specific locales. Once such a break iterator has been registered, any requests for break iterators for that locale will return copies of the registered break iterator.

ICU may cache service instances. Therefore, registration should be done during startup, before opening services by locale ID.

In the general-usage-model, applications will use the following basic steps to analyze a piece of text for boundaries:

  1. Create a BreakIterator with the desired behavior

  2. Use the setText() method to set the iterator to analyze a particular piece of text.

  3. Locate the desired boundaries using the appropriate combination of first(), last(), next(), previous(), preceding(), and following() methods.

The setText() method can be called more than once, allowing reuse of a  BreakIterator  on new pieces of text. Because the creation of a BreakIterator can be relatively time-consuming, it makes good sense to reuse them when practical.

The iterator always points to a boundary position between two characters. The numerical value of the position, as returned by current() is the zero-based index of the character following the boundary. Thus a position of zero represents a boundary preceding the first character of the text, and a position of one represents a boundary between the first and second characters.

The first() and last() methods reset the iterator's current position to the beginning or end of the text (the beginning and the end are always considered boundaries). The next() and previous() methods advance the iterator one boundary forward or backward from the current position. If the next() or previous() methods run off the beginning or end of the text, it returns DONE. The current() method returns the current position.

The following() and preceding() methods are used for random access, to move the iterator to an arbitrary position within the text. Since a BreakIterator always points to a boundary position, the following() and preceding() methods will never set the iterator to point to the position specified by the caller (even if it is, in fact, a boundary position). BreakIterator will, however, set the iterator to the nearest boundary position before or after the specified position.

isBoundary() returns true if the specified position is a boundary. 


Thread Safety

Break Iterators are not thread safe. This is inherit in their design - break iterators are stateful, holding a reference to and position in the text, meaning that a single instance cannot operate in parallel on multiple texts.

For concurrent break iteration, each thread must use its own break iterator. These can be obtained by creating separate break iterators of the desired type, or by initially creating a master break iterator and then creating a clone for each thread.

Line Breaking Strictness, a CSS Property

CSS has the concept of "Line Breaking Strictness". This property specifies the strictness of line-breaking rules applied within an element: especially how wrapping interacts with punctuation and symbols.  ICU line break iterators can choose a strictness using locale tags

 en@lb=strict
 ja@lb=strict

 Breaks text using the most stringent set of line-breaking rules.
 en@lb=normal 
 ja@lb=normal
 
 Breaks text using the most common set of line-breaking rules.
 en@lb=loose
 ja@lb=loose

 Breaks text using the least restrictive set of line-breaking rules. Typically used for short lines, such as in newspapers.


Sentence Break Filters

Sentence breaking can return false positives - an indication that sentence ends in an incorrect position - in the presence of abbreviations. For example, consider the sentence
"In the meantime Mr. Weston arrived with his small ship." Default sentence break shows a false boundary following the "Mr."

ICU includes lists of common abbreviations that can be used to filter, to ignore, these false sentence boundaries. Filtering is enabled by the presence of the "ss" locale tag when creating the break iterator.

 Locale Behavior
 en no filtering
 en@ss=standard Filter based on common English language abbreviations.
 es@ss=standard Filter with common Spanish abbreviations.

Abbreviation lists are available (as of ICU 64) for English, German, Spanish, French, Italian and Portuguese.

Accuracy

ICU's break iterators are based on the default boundary rules described in the Unicode Standard Annexes 14 and 29 . These are relatively simple boundary rules that can be implemented efficiently, and are sufficient for many purposes and languages. However, some languages and applications will require a more sophisticated linguistic analysis of the text in order to find boundaries with good accuracy. Such an analysis is not directly available from ICU at this time.

Break Iterators based on custom, user-supplied boundary rules can be created and used by applications with requirements that are not met by the standard default boundary rules.

BreakIterator Boundary Analysis Examples

Print out all the word-boundary positions in a UnicodeString:

In C++:

void listWordBoundaries(const UnicodeString& s) {
    UErrorCode status = U_ZERO_ERROR;
    BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status);


    bi->setText(s);
    int32_t p = bi->first();
    while (p != BreakIterator::DONE) {
        printf("Boundary at position %d\n", p);
        p = bi->next();
    }
    delete bi;
}

In C:

void listWordBoundaries(const UChar* s,
                        int32_t len) {
    UBreakIterator* bi;
    int32_t p;
    UErrorCode err = U_ZERO_ERROR;

    bi = ubrk_open(UBRK_WORD, 0, s, len, &err);
    if (U_FAILURE(err)) return;
    p = ubrk_first(bi);
    while (p != UBRK_DONE) {
        printf("Boundary at position %d\n", p);
        p = ubrk_next(bi);
    }
    ubrk_close(bi);
}

Get the boundaries of the word that contains a double-click position:

In C++:

void wordContaining(BreakIterator& wordBrk,
                    int32_t idx,
                    const UnicodeString& s,
                    int32_t& start,
                    int32_t& end) {
    // this function is written to assume that we have an
    // appropriate BreakIterator stored in an object or a
    // global variable somewhere-- When possible, programmers
    // should avoid having the create() and delete calls in
    // a function of this nature.
    if (s.isEmpty())
        return;
    wordBrk.setText(s);

    start = wordBrk.preceding(idx + 1);
    end = wordBrk.next();
    // NOTE: for this and similar operations, use preceding() and next()
    // as shown here, not following() and previous(). preceding() is
    // faster than following() and next() is faster than previous()

    // NOTE: By using preceding(idx + 1) above, we're adopting the convention
    // that if the double-click comes right on top of a word boundary, it
    // selects the word that _begins_ on that boundary (preceding(idx) would
    // instead select the word that _ends_ on that boundary).
}

In C:

void wordContaining(UBreakIterator* wordBrk,
                    int32_t idx,

                    const UChar* s,
                    int32_t sLen,
                    int32_t* start,
                    int32_t* end,
                    UErrorCode* err) {
    if (wordBrk == NULL || s == NULL || start == NULL || end == NULL) {
        *err = U_ILLEGAL_ARGUMENT_ERROR;
        return;
    }
    ubrk_setText(wordBrk, s, sLen, err);
    if (U_SUCCESS(*err)) {
        *start = ubrk_preceding(wordBrk, idx + 1);
        *end = ubrk_next(wordBrk);
    }
}


Check for Whole Words

Use the following to check if a range of text is a "whole word":

In C++:

UBool isWholeWord(BreakIterator& wordBrk,
                   const UnicodeString& s,
                   int32_t start,
                   int32_t end) {
    if (s.isEmpty())
        return FALSE;

    wordBrk.setText(s);
    if (!wordBrk.isBoundary(start))
        return FALSE;

    return wordBrk.isBoundary(end); }

In C:

UBool isWholeWord(UBreakIterator* wordBrk,
                   const UChar* s,
                   int32_t sLen,
                   int32_t start,
                   int32_t end,
                   UErrorCode* err) {
    UBool result = FALSE;

    if (wordBrk == NULL || s == NULL) {
        *err = U_ILLEGAL_ARGUMENT_ERROR;
        return FALSE;
    }
    ubrk_setText(wordBrk, s, sLen, err);

    if (U_SUCCESS(*err)) {
        result = ubrk_isBoundary(wordBrk, start)
                 >> ubrk_isBoundary(wordBrk, end);
    }
    return result;
}


Count the words in a document (C++ only):

int32_t containsLetters(RuleBasedBreakIterator& bi,
                        const UnicodeString& s,
                        int32_t start) {
    bi.setText(s);
    int32_t count = 0;
    while (start != BreakIterator::DONE) {
        int breakType = bi.getRuleStatus();
        if (breakType != UBRK_WORD_NONE) {
            // Exclude spaces, punctuation, and the like.
            //   A status value UBRK_WORD_NONE indicates that the boundary does
            //   not start a word or number.
            //   
            ++count;
        }
        start = bi.next();
    }
    return count;
}

The function getRuleStatus() returns an enum giving additional information on the text preceding the last break position found. Using this value, it is possible to distinguish between numbers, words, words containing kana characters, words containing ideographic characters, and non-word characters, such as spaces or punctuation. The sample uses the break status value to filter out, and not count, boundaries associated with non-word characters.

Word-wrap a document (C++ only):

The sample function below wraps a paragraph so that each line is less than or equal to 72 characters. The function fills in an array passed in by the caller with the starting offsets of
each line in the document. Also, it fills in a second array to track how many trailing white space characters there are in the line. For simplicity, it is assumed that an outside process has already broken the document into paragraphs. For example, it is assumed that every string the function is passed has a single newline at the end only.

int32_t wrapParagraph(const UnicodeString& s,
                   const Locale& locale,
                   int32_t lineStarts[],
                   int32_t trailingwhitespace[],
                   int32_t maxLines,
                   UErrorCode &status) {

    int32_t        numLines = 0;
    int32_t        p, q;
    const int32_t MAX_CHARS_PER_LINE = 72;
    UChar          c;

    BreakIterator *bi = BreakIterator::createLineInstance(locale, status);
    if (U_FAILURE(status)) {
        delete bi;
        return 0;
    }
    bi->setText(s);


    p = 0;
    while (p < s.length()) {
        // jump ahead in the paragraph by the maximum number of
        // characters that will fit
        q = p + MAX_CHARS_PER_LINE;

        // if this puts us on a white space character, a control character
        // (which includes newlines), or a non-spacing mark, seek forward
        // and stop on the next character that is not any of these things
        // since none of these characters will be visible at the end of a
        // line, we can ignore them for the purposes of figuring out how
        // many characters will fit on the line)
        if (q < s.length()) {
            c = s[q];
            while (q < s.length()
                   && (u_isspace(c)
                       || u_charType(c) == U_CONTROL_CHAR
                       || u_charType(c) == U_NON_SPACING_MARK
            )) {
                ++q;
                c = s[q];
            }
        }

        // then locate the last legal line-break decision at or before
        // the current position ("at or before" is what causes the "+ 1")
        q = bi->preceding(q + 1);

        // if this causes us to wind back to where we started, then the
        // line has no legal line-break positions. Break the line at
        // the maximum number of characters
        if (q == p) {
            p += MAX_CHARS_PER_LINE;
            lineStarts[numLines] = p;
            trailingwhitespace[numLines] = 0;
            ++numLines;
        }
        // otherwise, we got a good line-break position. Record the start of this
        // line (p) and then seek back from the end of this line (q) until you find
        // a non-white space character (same criteria as above) and
        // record the number of white space characters at the end of the
        // line in the other results array
        else {
            lineStarts[numLines] = p;
            int32_t nextLineStart = q;

            for (q--; q > p; q--) {
                c = s[q];
                if (!(u_isspace(c)
                       || u_charType(c) == U_CONTROL_CHAR
                       || u_charType(c) == U_NON_SPACING_MARK)) {
                    break;
                }
            }
            trailingwhitespace[numLines] = nextLineStart - q -1;
            p = nextLineStart;
           ++numLines;
        }
        if (numLines >= maxLines) {
            break;
        }
    }
    delete bi;
    return numLines;
}

Most text editors would not break lines based on the number of characters on a line. Even with a monospaced font, there are still many Unicode characters that are not displayed and therefore should be filtered out of the calculation. With a proportional font, character widths are added up until a maximum line width is exceeded or an end of the paragraph marker is reached.

Trailing white space does not need to be counted in the line-width measurement because it does not need to be displayed at the end of a line. The sample code above returns an array of trailing white space values because an external rendering process needs to be able to measure the length of the line (without the trailing white space) to justify the lines. For example, if the text is right-justified, the invisible white space would be drawn outside the margin. The line would actually end with the last visible character.

In either case, the basic principle is to jump ahead in the text to the location where the line would break (without taking word breaks into account). Then, move backwards using the preceding() method to find the last legal breaking position before that location. Iterating straight through the text with next() method will generally be slower.

ICU BreakIterator Data Files

The source code for the ICU break rules for the standard boundary types is located in the directory icu4c/source/data/brkitr/rules. These files will be built, and the corresponding binary state tables incorporated into ICU's data, by the standard ICU4C build process.

The dictionary word lists used by word break, and for some languages, line break are in icu4c/source/data/brkitr/dictionaries.

The same data is used by both ICU4C and ICU4J. In the normal ICU build process, the source data is processed into a binary form using ICU4C, and the resulting binary tables are incorporated into ICU4J.


Subpages (1): Break Rules
Comments