Overview
ICU's Regular Expressions package provides
applications with the ability to apply regular expression matching to
Unicode string data. The regular expression patterns and behavior are
based on Perl's regular expressions. The C++ programming API for using
ICU regular expressions is loosely based on the JDK 1.4 package
java.util.regex, with some extensions to adapt it for use in a C++
environment. A plain C API is also provided.
The ICU Regular
expression API supports operations including testing for a pattern
match, searching for a pattern match, and replacing matched text.
Capture groups allow subranges within an overall match to be
identified, and to appear within replacement text.
A Perl-inspired split() function that breaks a string into fields based on a delimiter pattern is also included.
ICU Regular Expressions conform to
Unicode Technical Standard #18 , Unicode Regular Expressions, level 1, and in addition include Default Word boundaries and Name Properties from level 2.
A
detailed description of regular expression patterns and pattern
matching behavior is not included in this user guide. The best
reference for this topic is the book "Mastering Regular Expressions,
Second Edition" by Jeffrey E. F. Friedl, O'Reilly & Associates; 2nd
edition (July 15, 2002). Matching behavior can sometimes be surprising,
and this book is highly recommended for anyone doing significant work
with regular expressions.
Using ICU Regular Expressions
The ICU C++ Regular Expression API includes two classes, RegexPattern and RegexMatcher, that parallel the classes from the Java JDK package java.util.regex. A RegexPattern represents a compiled regular expression while RegexMatcher associates a RegexPattern
and an input string to be matched, and provides API for the various
find, match and replace operations. In most cases, however, only the
class RegexMatcher is needed, and the existence of class RegexPattern can safely be ignored.
The first step in using a regular expression is typically the creation of a RegexMatcher object from the source (string) form of the regular expression.
RegexMatcher
holds a pre-processed (compiled) pattern and a reference to an input
string to be matched, and provides API for the various find, match and
replace operations. RegexMatchers
can be reset and reused with new input, thus avoiding object creation
overhead when performing the same matching operation repeatedly on
different strings.
The following code will create a RegexMatcher from a string containing a regular expression, and then perform a simple find() operation.
#include <unicode/regex.h>
UErrorCode status = U_ZERO_ERROR;
...
RegexMatcher *matcher = new RegexMatcher("abc+", 0, status); if (U_FAILURE(status)) { // Handle any syntax errors in the regular expression here ... }
UnicodeString stringToTest = "Find the abc in this string"; matcher->reset(stringToTest);
if (matcher->find()) { // We found a match. int startOfMatch = matcher->start(status); // string index of start of match. ... }
|
Several types of matching tests are available
| Function | Description |
|---|
| matches() | True if the pattern matches the entire string. from the start through to the last character. |
| lookingAt() | True if the pattern matches at the start of the string. The match need not include the entire string. |
| find() | True
if the pattern matches somewhere within the string. Successive calls to
find() will find additional matches, until the string is exhausted. |
If
additional text is to be checked for a match with the same pattern,
there is no need to create a new matcher object; just reuse the
existing one.
myMatcher->reset(anotherString); if (myMatcher->matches(status)) { // We have a with the new string. }
|
Note that matching happens directly in the string supplied by
the application. This reduces the overhead when resetting a matcher to
an absolute minimum – the matcher need only store a reference to the
new string – but it does mean that the application must be careful not
to modify or delete the string while the matcher is holding a reference
to the string.
After finding a match, additional information is
available about the range of the input matched, and the contents of any
capture groups. Note that, for simplicity, any error parameters have
been omitted. See the API reference
for complete a complete description of the API.
| Function | Description |
|---|
| start() | Return the index of the start of the matched region in the input string . |
| end() | Return the index of the first character following the match. |
| group() | Return a UnicodeString containing the text that was matched. |
| start(n) | Return the index of the start of the text matched by the nth capture group. |
| end(n) | Return the index of the first character following the text matched by the nth capture group. |
| group(n) | Return a UnicodeString containing the text that was matched by the nth capture group.. |
Regular Expression Metacharacters
| Character | Description |
|---|
| \a | Match a BELL, \u0007 |
| \A | Match at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input. |
| \b, outside of a [Set] | Match
if the current position is a word boundary. Boundaries occur at the
transitions between word (\w) and non-word (\W) characters, with
combining marks ignored. For better word boundaries, see ICU Boundary Analysis
. |
| \b, within a [Set] | Match a BACKSPACE, \u0008. |
| \B | Match if the current position is not a word boundary. |
| \cX | Match a control-X character. |
| \d | Match any character with the Unicode General Category of Nd (Number, Decimal Digit.) |
| \D | Match any character that is not a decimal digit. |
| \e | Match an ESCAPE, \u001B. |
| \E | Terminates a \Q ... \E quoted sequence. |
| \f | Match a FORM FEED, \u000C. |
| \G | Match if the current position is at the end of the previous match. |
| \n | Match a LINE FEED, \u000A. |
| \N{UNICODE CHARACTER NAME} | Match the named character. |
| \p{UNICODE PROPERTY NAME} | Match any character with the specified Unicode Property. |
| \P{UNICODE PROPERTY NAME} | Match any character not having the specified Unicode Property. |
| \Q | Quotes all following characters until \E. |
| \r | Match a CARRIAGE RETURN, \u000D. |
| \s | Match a white space character. White space is defined as [\t\n\f\r\p{Z}]. |
| \S | Match a non-white space character. |
| \t | Match a HORIZONTAL TABULATION, \u0009. |
| \uhhhh | Match the character with the hex value hhhh. |
| \Uhhhhhhhh | Match
the character with the hex value hhhhhhhh. Exactly eight hex digits
must be provided, even though the largest Unicode code point is
\U0010ffff. |
| \w | Match a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]. |
| \W | Match a non-word character. |
| \x{hhhh} | Match the character with hex value hhhh. From one to six hex digits may be supplied. |
| \xhh | Match the character with two digit hex value hh |
| \X | Match a Grapheme Cluster
. |
| \Z | Match if the current position is at the end of input, but before the final line terminator, if one exists. |
| \z | Match if the current position is at the end of input. |
| \n | Back
Reference. Match whatever the nth capturing group matched. n must be a
number > 1 and < total number of capture groups in the pattern.
|
| \0ooo | Match an Octal character. 'ooo' is from one to three octal digits. 0377 is the largest allowed Octal character. The leading zero is required; it distinguishes Octal constants from back references.
|
| [pattern] | Match any one character from the set. See UnicodeSet
for a full description of what may appear in the pattern |
| . | Match any character. |
| ^ | Match at the beginning of a line. |
| $ | Match at the end of a line. |
| \ | Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . / |
Regular Expression Operators
| Operator | Description |
|---|
| | | Alternation. A|B matches either A or B. |
| * | Match 0 or more times. Match as many times as possible. |
| + | Match 1 or more times. Match as many times as possible. |
| ? | Match zero or one times. Prefer one. |
| {n} | Match exactly n times |
| {n,} | Match at least n times. Match as many times as possible. |
| {n,m} | Match between n and m times. Match as many times as possible, but not more than m. |
| *? | Match 0 or more times. Match as few times as possible. |
| +? | Match 1 or more times. Match as few times as possible. |
| ?? | Match zero or one times. Prefer zero. |
| {n}? | Match exactly n times |
| {n,}? | Match at least n times, but no more than required for an overall pattern match |
| {n,m}? | Match between n and m times. Match as few times as possible, but not less than n. |
| *+ | Match
0 or more times. Match as many times as possible when first
encountered, do not retry with fewer even if overall match fails
(Possessive Match) |
| ++ | Match 1 or more times. Possessive match. |
| ?+ | Match zero or one times. Possessive match. |
| {n}+ | Match exactly n times |
| {n,}+ | Match at least n times. Possessive Match. |
| {n,m}+ | Match between n and m times. Possessive Match. |
| ( ... ) | Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match. |
| (?: ... ) | Non-capturing
parentheses. Groups the included pattern, but does not provide
capturing of matching text. Somewhat more efficient than capturing
parentheses. |
| (?> ... ) | Atomic-match
parentheses. First match of the parenthesized subexpression is the only
one tried; if it does not lead to an overall pattern match, back up the
search for a match to a position before the "(?>" |
| (?# ... ) | Free-format comment (?# comment ). |
| (?= ... ) | Look-ahead
assertion. True if the parenthesized pattern matches at the current
input position, but does not advance the input position. |
| (?! ... ) | Negative
look-ahead assertion. True if the parenthesized pattern does not match
at the current input position. Does not advance the input position. |
| (?<= ... ) | Look-behind
assertion. True if the parenthesized pattern matches text preceding the
current input position, with the last character of the match being the
input character just before the current position. Does not alter the
input position. The length of possible strings matched by the
look-behind pattern must not be unbounded (no * or + operators.) |
| (?<! ... ) | Negative
Look-behind assertion. True if the parenthesized pattern does not match
text preceding the current input position, with the last character of
the match being the input character just before the current position.
Does not alter the input position. The length of possible strings
matched by the look-behind pattern must not be unbounded (no * or +
operators.) |
| (?ismwx-ismwx: ... ) | Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled. |
| (?ismwx-ismwx) | Flag
settings. Change the flag settings. Changes apply to the portion of the
pattern following the setting. For example, (?i) changes to a case
insensitive match. |
Replacement Text
The
replacement text for find-and-replace operations may contain references
to capture-group text from the find. References are of the form $n, where n is the number of the capture group.
| Character | Descriptions |
|---|
| $n | The
text of capture group n will be substituted for $n. n must be >= 0
and not greater than the number of capture groups. A $ not followed by
a digit has no special meaning, and will appear in the substitution
text as itself, a $. |
| \ | Treat
the following character as a literal, suppressing any special meaning.
Backslash escaping in substitution text is only required for '$' and
'\', but may be used on any other character without bad effects. |
Flag Options
The
following flags control various aspects of regular expression matching.
The flag values may be specified at the time that an expression is
compiled into a RegexPattern object, or they may be specified within
the pattern itself using the (?ismx-ismx) pattern options.
 | The UREGEX_CANON_EQ option is not yet available. |
| Flag (pattern) | Flag (API Constant) | Description |
|---|
| UREGEX_CANON_EQ | If set, matching will take the canonical equivalence of characters into account. NOTE: this flag is not yet implemented. |
| i | UREGEX_CASE_INSENSITIVE | If set, matching will take place in a case-insensitive manner. |
| x | UREGEX_COMMENTS | If set, allow use of white space and #comments within patterns |
| s | UREGEX_DOTALL | If
set, a "." in a pattern will match a line terminator in the input text.
By default, it will not. Note that a carriage-return / line-feed pair
in text behave as a single line terminator, and will match a single "."
in a RE pattern |
| m | UREGEX_MULTILINE | Control
the behavior of "^" and "$" in a pattern. By default these will only
match at the start and end, respectively, of the input text. If this
flag is set, "^" and "$" will also match at the start and end of each
line within the input text. |
| w | UREGEX_UWORD | Controls
the behavior of \b in a pattern. If set, word boundaries are found
according to the definitions of word found in Unicode UAX 29, Text
Boundaries. By default, word boundaries are identified by means of a
simple classification of characters as either “word” or “non-word”,
which approximates traditional regular expression behavior. The results
obtained with the two options can be quite different in runs of spaces
and other non-word characters. |
Using split()
ICU's
split() function is similar in concept to Perl's – it will split a
string into fields, with a regular expression match defining the field
delimiters and the text between the delimiters being the field content
itself.
Suppose you have a string of words separated by spaces
UnicodeString s = “dog cat giraffe”;
This code will extract the individual words from the string.
UErrorCode status = U_ZERO_ERROR; RegexMatcher m(“\\s+”, 0, status); const int maxWords = 10; UnicodeString words[maxWords]; int numWords = m.split(s, words, maxWords, status);
|
After the split(),
| Variable | value |
|---|
| numWords | 3 |
| words[0] | “dog” |
| words[1] | “cat” |
| words[2] | “giraffe” |
| words[3 to 9] | “” |
The field delimiters, the spaces from the original string, do not appear in the output strings.
Note that, in this example, “words”
is a local, or stack array of actual UnicodeString objects. No heap
allocation is involved in initializing this array of empty strings (C++
is not Java!). Local UnicodeString arrays like this are a very good fit
for use with split(); after extracting the fields, any values that need
to be kept in some more permanent way can be copied to their ultimate
destination.
If the number of fields in a string being split
exceeds the capacity of the destination array, the last destination
string will contain all of the input string data that could not be
split, including any embedded field delimiters. This is similar to
split() in Perl.
If the pattern expression contains capturing
parentheses, the captured data ($1, $2, etc.) will also be saved in the
destination array, interspersed with the fields themselves.
If,
in the “dog cat giraffe” example, the pattern had been “(\s+)” instead
of “\s+”, split() would have produced five output strings instead of
three. Words[1] and words[3] would have been the spaces.
Find and Replace
Description of AppendReplacement() and AppendTail(). To be added.