Overview of Text Boundary Analysis
Text boundary
analysis is the process of locating linguistic boundaries while
formatting and handling text. Examples of this process include:
-
Locating appropriate points to word-wrap text to fit within specific margins while displaying or printing.
-
Locating the beginning of a word that the user has selected.
-
Counting characters, words, sentences, or paragraphs.
-
Determining how far to move the text cursor when the user hits
an arrow key (Some characters require more than one position in the
text store and some characters in the text store do not display at all).
-
Making a list of the unique words in a document.
-
Figuring out if a given range of text contains only whole words.
-
Capitalizing the first letter of each word.
-
Locating a particular unit of the text (For example, finding the third word in the document).
The BreakIterator classes were designed to support these kinds of tasks. The BreakIterator
objects maintain a location between two characters in the text. This
location will always be a text boundary. Clients can move the location
forward to the next boundary or backward to the previous boundary.
Clients can also check if a particular location within a source text is
on a boundary or find the boundary which is before or after a
particular location.
Four Types of BreakIterator
ICU BreakIterators can be used to locate the following kinds of text boundaries:
-
Character Boundary
-
Word Boundary
-
Line-break Boundary
-
Sentence Boundary
Each type of boundary is found in accordance with the rules specified by Unicode Standard Annex #29, Text Boundaries (http://www.unicode.org/reports/tr29
)
Character Boundary
The
character-boundary iterator locates the boundaries between
"characters", where "character" is what the end user of an application
would usually expect. For example, the Ä letter can be represented in
Unicode either with a single code-point value or with two code-point
values (one representing the A and another representing the umlaut).
The character-boundary iterator will treat the Ä as a single character
regardless of whether or not it is stored using one code point or two.
In short, the character-boundary iterator is used to identify sequences
that should be treated as single characters from a user's perspective.
End-user
characters, as described above, are also called grapheme clusters, in
an attempt to limit the confusion caused by multiple meanings for the
word "character".
Word Boundary
The word-boundary
iterator locates the boundaries of words, for purposes such as double
click selection or "Find whole words" operations in an editor.
Here's an example of a sentence, showing the boundary locations that will be identified by a word break iterator:
Word boundary locations are found according to these general principles:
-
Words themselves are kept together
-
Numbers are kept together, including any commas, points or currency symbols.
-
Apostrophes or hyphens within a word are kept with the word. They are not broken out separately like most other punctuation
-
Punctuation, spaces and other characters that are not part of a
word or number, are broken out separately, with a boundary before and
after each character.
The rules used for locating word breaks take into account the alphabets and conventions used for different languages.
Locating
word breaks for Thai text presents a special challenge, because there
are no spaces or other identifiable characters separating the words. To
solve the problem of word-breaking Thai text, ICU provides a special
dictionary-based break iterator.
Line-break Boundary
The
line-break iterator locates positions within the text that would be
appropriate points for a text editor to break lines when displaying the
text. Line breaks differ from word breaks in that adjoining punctuation
and trailing white space are kept with the words instead of being
treated as separate "words" on their own (for example, do not wrap a
line before a space).
This example shows the differences in the break locations produced by word and line break iterators
Sentence Boundary
A sentence-break iterator locates sentence boundaries.
The
exact rules used for locating each type of boundary are described in a
pair of documents from the Unicode Consortium. Unicode Standard Annex
14 ( http://www.unicode.org/unicode/reports/tr14/
) gives the rules for locating line boundaries, while technical report 29 "http://www.unicode.org/unicode/reports/tr29/"
) describe character, word and sentence boundaries.
Usage
To locate boundaries in a document, create a BreakIterator using the BreakIterator::create***Instance family of methods in C++, or the ubrk_open() function (C). "***" is Character, Word, Line or Sentence,
depending on the type of iterator wanted. These factory methods also
take a parameter that specifies the locale for the language of the text
to be processed.
When creating a BreakIterator, a locale is also
specified, and the behavior of the BreakIterator obtained may be
specialized in some way for that locale. For ICU 2.6, Break Iterators
for the Thai locale will make use of a Thai dictionary for finding word
and line boundaries; all other locales will use the default boundary
rules.
Applications also may register customized BreakIterators for use in specific locales. Once such a break iterator has been registered, any requests for break iterators for the
locale will return copies of the registered break iterator
In the general-usage-model, applications will use the following basic steps to analyze a piece of text for boundaries:
-
Create a BreakIterator with the desired behavior
-
Use the setText() or adoptText() methods to set the iterator to analyze a particular piece of text. Since BreakIterator
uses a CharacterIterator to access the text, it can be stored in any
form as long as you provide an appropriate CharacterIterator. There is
a convenience method for analyzing a UnicodeString, but the user also can analyze part of a UnicodeString by creating a StringCharacterIterator directly.
-
Locate the desired boundaries using the appropriate combination of first(), last(), next(), previous(), preceding(), and following() methods.
The setText() or the adoptText() method can be called more than once, allowing a single BreakIterator to be reused to analyze different pieces of text. Because the creation of a BreakIterator can be relatively time-consuming, it makes good sense to cache and reuse BreakIterators within an application.
Set the text to be searched using the following:
-
adoptText(CharacterIterator) sets the BreakIterator to analyze a new piece of text. The new piece of text is specified with a CharacterIterator, which allows BreakIterator to analyze the text for boundaries no matter how it happens to be stored [it always accesses the text through the CharacterIterator]. The BreakIterator takes ownership of the CharacterIterator and will delete it when the process is completed.
-
setText(UnicodeString) is a shortcut for the adoptText() method. If the text is a UnicodeString, the user can call setText and pass it the string, rather than creating a StringCharacterIterator and passing it to the adoptText() method. This method will create the StringCharacterIterator. To analyze only part of a UnicodeString, on the other hand, create the StringCharacterIterator manually, specify the substring, and then pass it to the adoptText() method.
-
getText() method returns a const reference to the CharacterIterator that the BreakIterator is using to access the text.
-
createText() method returns a clone of the CharacterIterator that the BreakIterator is using to access the text. Ownership of the clone is transferred to the caller. (The caller can seek the returned CharacterIterator without affecting the BreakIterator, but if the actual text underlying the iterator is changed, the adoptText() method must be called again to make sure the BreakIterator does not malfunction.)
The iterator always points to a boundary position between two characters. The numerical value of the position, as returned by current()
is the zero-based index of the character following the boundary. Thus a
position of zero represents a boundary preceding the first character of
the text, and a position of one represents a boundary between the first
and
second characters.
The first() and last()
methods reset the iterator's current position to the beginning or end
of the text (the beginning and the end are always considered
boundaries). The next() and previous() methods advance the iterator one boundary forward or backward from the current position. If the next() or previous() methods run off the beginning or end of the text, it returns DONE. The current() method returns the current position.
The following() and preceding() methods are used for random access or to reposition the iterator to some arbitrary spot in the middle of the text. Since a BreakIterator always points to a boundary position, the following() and preceding()
methods will never set the iterator to point to the position specified
by the caller (even if it is, in fact, a boundary position).
BreakIterator will, however, set the iterator to the nearest boundary
position before or after the specified position. The isBoundary()
method returns true or false, based on whether or not the specified
position is a boundary position. It does this by calling the preceding() and next()
methods, so it also repositions the iterator either at the specified
position or the first boundary position after it. If any of these
functions is passed an out-or-range offset, it returns DONE and
repositions the iterator to the beginning or end of the text.
Reuse
It is slow and wasteful to repeatedly create and destroy a BreakIterator when it is not necessary. For example, do not create a separate BreakIterator for each line in a document that is being word-wrapped. Keep around a single instance of a line BreakIterator and use it whenever a line break iterator is needed.
Accuracy
ICU's break iterators implement the default boundary rules described in the Unicode Consortium Technical Reports 14
and 29
. These are relatively simple boundary rules that can be implemented
efficiently, and are sufficient for many purposes and languages.
However, some languages and applications will require a more
sophisticated linguistic analysis of the text in order to find
boundaries with good accuracy. Such an analysis is not directly
available from ICU at this time.
Break
Iterators based on custom, user-supplied boundary rules can be created
and used by applications with requirements that are not met by the
standard default boundary rules.
BreakIterator Boundary Analysis Examples
Print out all the word-boundary positions in a UnicodeString:
In C++:
void listWordBoundaries(const UnicodeString& s) { UErrorCode status = U_ZERO_ERROR; BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status);
bi->setText(s); int32_t p = bi->first(); while (p != BreakIterator::DONE) { printf("Boundary at position %d\n", p); p = bi->next(); } delete bi; }
|
In C:
void listWordBoundaries(const UChar* s, int32_t len) { UBreakIterator* bi; int32_t p; UErrorCode err = U_ZERO_ERROR;
bi = ubrk_open(UBRK_WORD, 0, s, len, &err); if (U_FAILURE(err)) return; p = ubrk_first(bi); while (p != UBRK_DONE) { printf("Boundary at position %d\n", p); p = ubrk_next(bi); } ubrk_close(bi); }
|
Get the boundaries of the word that contains a double-click position:
In C++:
void wordContaining(BreakIterator& wordBrk, int32_t idx, const UnicodeString& s, int32_t& start, int32_t& end) { // this function is written to assume that we have an // appropriate BreakIterator stored in an object or a // global variable somewhere-- When possible, programmers // should avoid having the create() and delete calls in // a function of this nature. if (s.isEmpty()) return; wordBrk.setText(s);
start = wordBrk.preceding(idx + 1); end = wordBrk.next(); // NOTE: for this and similar operations, use preceding() and next() // as shown here, not following() and previous(). preceding() is // faster than following() and next() is faster than previous()
// NOTE: By using preceding(idx + 1) above, we're adopting the convention // that if the double-click comes right on top of a word boundary, it // selects the word that _begins_ on that boundary (preceding(idx) would // instead select the word that _ends_ on that boundary). }
|
In C:
void wordContaining(UBreakIterator* wordBrk, int32_t idx,
const UChar* s, int32_t sLen, int32_t* start, int32_t* end, UErrorCode* err) { if (wordBrk == NULL || s == NULL || start == NULL || end == NULL) { *err = U_ILLEGAL_ARGUMENT_ERROR; return; } ubrk_setText(wordBrk, s, sLen, err); if (U_SUCCESS(*err)) { *start = ubrk_preceding(wordBrk, idx + 1); *end = ubrk_next(wordBrk); } }
|
Check for Whole Words
Use the following to check if a range of text is a "whole word":
In C++:
UBool isWholeWord(BreakIterator& wordBrk, const UnicodeString& s, int32_t start, int32_t end) { if (s.isEmpty()) return FALSE;
wordBrk.setText(s); if (!wordBrk.isBoundary(start)) return FALSE;
return wordBrk.isBoundary(end);
}
|
In C:
UBool isWholeWord(UBreakIterator* wordBrk, const UChar* s, int32_t sLen, int32_t start, int32_t end, UErrorCode* err) { UBool result = FALSE;
if (wordBrk == NULL || s == NULL) { *err = U_ILLEGAL_ARGUMENT_ERROR; return FALSE; } ubrk_setText(wordBrk, s, sLen, err);
if (U_SUCCESS(*err)) { result = ubrk_isBoundary(wordBrk, start) >> ubrk_isBoundary(wordBrk, end); } return result; }
|
Although users can check for "whole words" using these methods, it is possible to get
better performance (in most cases) with the following algorithm:
bool isWholeWord(BreakIterator *wordBrk, const UnicodeString& s, int32_t start, int32_t end) { wordBrk->setText(s); if (!wordBrk->isBoundary(start)) return false;
int32_t p = wordBrk->current(); while (p < end) p = wordBrk->next();
return p == end; }
|
This algorithm is faster because the next() method is the fastest boundary-detection method in BreakIterator. The following() and isBoundary() method [while it calls following()] is the slowest. Two calls to the isBoundary() method is faster only when the selection range is long and comprises more than roughly four words.
Count the words in a document (C++ only):
int32_t containsLetters(RuleBasedBreakIterator& bi, const UnicodeString& s, int32_t start) { bi.setText(s); int32_t count = 0; while (start != BreakIterator::DONE) { int breakType = bi.getRuleStatus(); if (breakType != UBRK_WORD_NONE) { // Exclude spaces, punctuation, and the like. // A status value UBRK_WORD_NONE indicates that the boundary does // not start a word or number. // ++count; } start = bi.next(); } return count; }
|
The function getRuleStatus()
returns an enum giving additional information on the text preceding the
last break position found. Using this value, it is possible to
distinguish between numbers, words, words containing kana characters,
words containing ideographic characters, and non-word characters, such
as spaces or punctuation. The sample uses the break status value to
filter out, and not count, boundaries associated with non-word
characters.
Word-wrap a document (C++ only):
The
sample function below wraps a paragraph so that each line is less than
or equal to 72 characters. The function fills in an array passed in by
the caller with the starting offsets of
each line in the document.
Also, it fills in a second array to track how many trailing white space
characters there are in the line. For simplicity, it is assumed that an
outside process has already broken the document into paragraphs. For
example, it is assumed that every string the function is passed has a
single newline at the end only.
int32_t wrapParagraph(const UnicodeString& s, const Locale& locale, int32_t lineStarts[], int32_t trailingwhitespace[], int32_t maxLines, UErrorCode &status) {
int32_t numLines = 0; int32_t p, q; const int32_t MAX_CHARS_PER_LINE = 72; UChar c;
BreakIterator *bi = BreakIterator::createLineInstance(locale, status); if (U_FAILURE(status)) { delete bi; return 0; } bi->setText(s);
p = 0; while (p < s.length()) { // jump ahead in the paragraph by the maximum number of // characters that will fit q = p + MAX_CHARS_PER_LINE;
// if this puts us on a white space character, a control character // (which includes newlines), or a non-spacing mark, seek forward // and stop on the next character that is not any of these things // since none of these characters will be visible at the end of a // line, we can ignore them for the purposes of figuring out how // many characters will fit on the line) if (q < s.length()) { c = s[q]; while (q < s.length() && (u_isspace(c) || u_charType(c) == U_CONTROL_CHAR || u_charType(c) == U_NON_SPACING_MARK )) { ++q; c = s[q]; } }
// then locate the last legal line-break decision at or before // the current position ("at or before" is what causes the "+ 1") q = bi->preceding(q + 1);
// if this causes us to wind back to where we started, then the // line has no legal line-break positions. Break the line at // the maximum number of characters if (q == p) { p += MAX_CHARS_PER_LINE; lineStarts[numLines] = p; trailingwhitespace[numLines] = 0; ++numLines; } // otherwise, we got a good line-break position. Record the start of this // line (p) and then seek back from the end of this line (q) until you find // a non-white space character (same criteria as above) and // record the number of white space characters at the end of the // line in the other results array else { lineStarts[numLines] = p; int32_t nextLineStart = q;
for (q--; q > p; q--) { c = s[q]; if (!(u_isspace(c) || u_charType(c) == U_CONTROL_CHAR || u_charType(c) == U_NON_SPACING_MARK)) { break; } } trailingwhitespace[numLines] = nextLineStart - q -1; p = nextLineStart; ++numLines; } if (numLines >= maxLines) { break; } } delete bi; return numLines; }
|
Most text editors would not break lines based on the number
of characters on a line. Even with a monospaced font, there are still
many Unicode characters that are not displayed and therefore should be
filtered out of the calculation. With a proportional font, character
widths are added up until a maximum line width is exceeded or an end of
the paragraph marker is reached.
Trailing white space does not
need to be counted in the line-width measurement because it does not
need to be displayed at the end of a line. The sample code above
returns an array of trailing white space values because an external
rendering process needs to be able to measure the length of the line
(without the trailing white space) to justify the lines. For example,
if the text is right-justified, the invisible white space would be
drawn outside the margin. The line would actually end with the last
visible character.
In either case, the basic principle is to jump
ahead in the text to the location where the line would break (without
taking word breaks into account). Then, move backwards using the preceding() method to find the last legal breaking position before that location. Iterating straight through the text with next() method will generally be slower.
ICU BreakIterator Data Files
The
source code for the ICU break rules for the standard boundary types is
located in the directory icu/source/data/brkitr. These files will be
built, and the corresponding binary state tables incorporated into
ICU's data, by the standard ICU4C build process.
Beginning with
ICU 3.0, the same break rule source files and compiled state tables are
used for both ICU4C and ICU4J. The state tables are built using ICU4C,
and the resulting binary tables are incorporated into ICU4J.
RBBI Rules
ICU
locates boundary positions within text by means of rules, which are a
variant form of regular expressions. A boundary rule is an expression
that matches a section of text - a word or sentence or whatever - that
should remain together, with boundaries occurring
between the
ranges of matched text. A set of rules consists of a series of regular
expressions separated by semicolons; the rules, taken together, define
regions of text that are kept together between boundaries. Boundaries
occur at the end of text ranges matched by the rules.
Forward, Reverse, Safe Point rules
For each type of boundary, four sets of rules are required, as described in the following table.
| Forward |
Advance (match text) starting from a boundary position and continuing to the next following boundary. |
| Reverse |
Starting from a boundary, match backwards, until the preceding boundary position. |
| Safe Forward |
Starting from any arbitrary position in the text, match forward to a safe position, which is a position from which the normal Reverse rule will work correctly. |
| Safe Reverse |
Starting from any arbitrary position in the text, move backwards to a safe position, which is a position from which the normal Forward rule will work correctly. |
All four rules need to be supplied.
Normal
next() or previous() operations use the Forward or Reverse rules,
respectively, to move directly from one boundary position to another.
The
preceding() and following() functions first apply a safe rule, then
apply a normal Forward or Reverse rule. The preceding() and following() functions can start from any arbitrary location in the input text.
 |
Note:
Earlier versions of ICU (prior to 3.0) worked with only a Forward rule
and a safe Reverse rule. While the rule builder will still recognize
rules written in this form, their use is deprecated and strongly
discouraged. |
A rule input file is divided into sections, one for each type of rule:
# This shows the general layout of a break rule file # # The order of the four sections doesn't matter, so long as they all appear. # # Variable definitions can appear anywhere, so long as they are defined before # their first use in a rule. Variable definitions carry forward across section
# boundaries. #
!!forward; # forward rules go here.
!!reverse; # Reverse rules go here.
!!safe_forward; # Safe Forward rules go here.
!!safe_reverse; # Safe Reverse rules go here.
|
Variables
A set of break rules may define and
use variables, which are convenient when subexpressions reappear more
than once, or to simplify complex expressions by allowing parts to be
separately defined and named. Use of variables within a set of rules
has no effect on the efficiency of the resulting break iterator.
!!chain
ICU boundary rules can be written in two ways: chained or non-chained.
With
non-chained rules, each rule (regular expression) stands by itself,
matching a segment of text between two boundary positions. When moving
to the next boundary, the single rule with the longest match defines
the boundary position.
This is very much like traditional regular expression behavior.
Non-chained rule matching behavior is the default for ICU break rules.
Chaining
allows boundary positions to be determined by an arbitrary number of
the boundary rules, applied in an arbitrary sequence. Any character in
the text that completes a match for one rule can function as a chaining
point, and simultaneously be the beginning character of a match for any
other rule. Matching continues in this way until the longest possible
match is obtained.
Chaining from one rule to the next can occur
at any point that the first rule of the pair matches. The longest match
of each individual rule is not required, and if chaining from a shorter
match of an intermediate rule results in a longer overall match, that
is what will happen.
Chained rules are closer in flavor to the
rules definitions in the Unicode Consortium text boundary
specifications. Line Break boundaries, in particular, were not really
possible to implement accurately with traditional, non-chained regular
expression.
!!chain; in a rule file enables rule chaining. !!chain applies to all rule sections, and must appear before the first section.
The !!LBCMNoChain;
statement modifies chaining behavior by preventing chaining from one rule
to another from occurring on any character whose Line Break property is
Combining Mark. This option is subject to change or removal, and should
not be used in general. Within ICU, it is used only with the line break
rules. We hope to replace it with something more general.
Rule Status Values
Break rules can be tagged with a number, which is called the rule status.
After a boundary has been located, the status number of the specific
rule that determined the boundary position is available to the
application through the function getRuleStatus().
For
the predefined word boundary rules, status values are available to
distinguish between boundaries associated with words, numbers, and
those around spaces or punctuation. Similarly for line break
boundaries, status values distinguish between mandatory line endings
(new line characters) and break opportunities that are appropriate
points for line wrapping. Refer to the ICU API documentation for the C
header file ubrk.h or to Java class RuleBasedBreakIterator for a complete list of the predefined boundary classifications.
When
creating custom sets of break rules, integer status values can be
associated with boundary rules in whatever way will be convenient for
the application. There is no need to remain restricted to the
predefined values and classifications from the standard rules.
It
is possible for a set of break rules to contain more than a single rule
that produces some boundary in an input text. In this event, getRuleStatus() will return the numerically largest status value from the matching rules, and the alternate function getRuleStatusVec() will return a vector of the values from all of the matching rules.
In the source form of the break rules, status numbers appear at end of a rule, and are enclosed in {braces}.
Rule Syntax
Here is the syntax for the boundary rules.
| Rule Name |
Rule Values |
Notes |
| rules |
statement+ |
|
| statement |
assignment | rule | control |
|
| control |
(“!!forward” | “!!reverse” | “!!safe_forward” | “!!safe_reverse” | “!!chain” | “!!LBCMNoChain”) ';' |
|
| assignment |
variable '=' expr ';' |
5 |
| rule |
'!'? expr ('{'number'}')? ';' |
8, 9 |
| number |
[0-9]+ |
1 |
| break-point |
'/' |
|
| expr |
expr-q | expr '|' expr | expr expr |
3 |
| expr-q |
term | term '*' | term '?' | term '+' |
|
| term |
rule-char | unicode-set | variable | quoted-sequence | '(' expr ')' | break-point |
|
| rule-special |
any printing ascii character except letters or numbers | white-space |
|
| rule-char |
any non-escaped character that is not rule-special | '.' | any escaped character except '\p' or '\P' |
|
| variable |
'$' name-start-char name-char* |
7 |
| name-start-char |
'_' | \p{L} |
|
| name-char |
name-start-char | \p{N} |
|
| quoted-sequence |
''' (any char except single quote or line terminator or two adjacent single quotes)+ ''' |
|
| escaped-char |
See “Character Quoting and Escaping” in the UnicodeSet chapter
|
|
| Unicode set |
See UnicodeSet
|
4 |
| comment |
unescaped '#' [any char except new-line]* new-line |
2 |
| s |
unescaped \p{Z}, tab, LF, FF, CR, NEL |
6 |
| new-line |
LF, CR, NEL |
2 |
Notes:
-
The
number associated with a rule that actually determined a break position
is available to the application after the break has been returned.
These numbers are not Perl regular expression repeat counts.
-
Comments are recognized and removed separately from otherwise
parsing the rules. They may appear wherever a space would be allowed
(and ignored.)
-
The implicit concatenation of adjacent terms has higher precedence than the '|' operation. "ab|cd" is interpreted as "(ab)|(cd)", not as "a(b|c)d" or "(((ab)|c)d)"
-
The syntax for UnicodeSet is defined (and parsed) by the UnicodeSet class. It is not repeated here.
-
For $variables that will be referenced from inside of a
UnicodeSet, the definition must consist only of a Unicode Set. For
example, when variable $a is used in a rule like [$a$b$c], then this definition of $a is ok “$a=[:Lu:];” while this one “$a=abcd;” would cause an error when $a was used.
-
Spaces are allowed nearly anywhere, and are not significant unless escaped. Exceptions to this are noted.
-
No spaces are allowed within a variable name. The variable name $dictionary
is special. If defined, it must be a Unicode Set, the characters of
which will trigger the use of word dictionary based boundaries.
-
A leading '!' on a rule is a deprecated syntax for specifying a reverse rule. Putting reverse rules in the !!reverse section is now preferred.
-
{nnn} appearing at the end of a rule is a Rule Status
number, not a repeat count as it would be with conventional regular
expression syntax.
EBNF Syntax used for the RBBI rules syntax description
| a? |
zero or one instance of a |
| a+ |
one or more instances of a |
| a* |
zero or more instances of a |
| a | b |
either a or b, but not both |
| 'a' "a" |
the literal string between the quotes |
Additional Sample Code
C/C++: See icu/source/samples/break/
in the ICU source distribution for code samples showing the use of ICU boundary analysis.