OverviewString searching, also known as string matching, is a very important subject in the wider domain of text processing and analysis. Many software applications use the basic string search algorithm in the implementations on most operating systems. With the popularity of Internet, the quantity of available data from different parts of the world has increased dramatically within a short time. Therefore, a string search algorithm that is language-aware has become more important. A bitwise match that uses the u_strstr (C), UnicodeString::indexOf (C++) or String.indexOf (Java) APIs will not yield the correct result specific to a particular language's requirements. The APIs will not yield the correct result because all the issues that are important to language-sensitive collation are also applicable to text searching. The following lists those issues which are applicable to text searching:
ICU String Search ModelThe ICU string search service provides similar APIs to the other text iterating services. Allowing users to specify the starting position and direction within the text string to be searched. For more information, please see Boundary Analysis chapter. The user can locate one or all occurrences of a pattern in a string. For a given collator, a pattern match is located at the offsets <start, end> in a string if the collator finds that the sub-string between the start and end is equal. The string search service supports two different types of canonical match behavior. Let S' be the sub-string of a text string S between the offsets start and end <start, end>.
One restriction is to be noted for option 1. Currently there are no composite characters that consists of a character with combining class greater than 0 before a character with combining class equals to 0. However, if such a character exists in the future, the string search service may not work correctly with option 1 when such characters are encountered. Furthermore, option 1 could generate more than one "encompassing" matches. For example, in Danish, 'å' (\u00e5) and 'aa' are considered equivalent. So the pattern "baad" will match "a--båd--man" (a--b\u00e5d--man) at the start offset at 3 and the end offset 5. However, the start offset can be 1 or 2 and the end offset can be 6 or 7, because "-" (hyphen) is ignorable for a certain collation. The ICU implementation always returns the offsets of the shortest match sub-string. To be more exact, the string search added a "tightest" match condition. In other words, if the pattern matches at offsets <start, end> as well as offsets <start + 1, end>, the offsets <start, end> are not considered a match. Likewise, if the pattern matches at offsets <start, end> as well as offsets <start, end + 1>, the offsets <start, end + 1> are not considered a match. Therefore, when the option 1 is chosen in Danish collator, 'baad' will match in the string "a--båd--man" (a--b\u00e5d--man) ONLY at offsets <3,5>. The default behavior is that described in option 2 above. To obtain the behavior described in option 1, you must set the normalization mode to ON in the collator used for search.
The string search service also supports two varieties of “asymmetric search” as described in Section 8.2 Asymmetric Search in UTS #10 Unicode Collation Collation Algorithm. With asymmetric search, for example, unaccented characters are treated as “wildcards” that may match any character with the same primary weight, this behavior can be applied just to characters in the search pattern, or to characters in both the search pattern and the searched text. With the former behavior, searching with French behavior for 'e' might match 'e', 'è', 'é', 'ê', and so one, while search for 'é' would only match 'é'. Both a locale or collator can be used to specify the language-sensitive rules for searches. When a locale is specified, a collator will be created internally and the StringSearch instance that is created is responsible for the ownership of the collator. All the collation attributes will be considered during the string search operation. However, the users only can set the collator attributes using the collator APIs. Normalization is usually done within collation and the process is outside the scope of the string search service. As in other iterator interfaces, the string search service provides APIs to perform string matching for the first pattern occurrence, immediate next, previous match, and the last pattern occurrence. There are also options to allow for overlapping matching. For example, in English, if the string is "ababab" and the pattern is "abab", overlapping matching produces results of offsets <0, 3> and <2, 5>. Otherwise, the mutually exclusive matching produces the result offset <0, 3> only. To find a whole word match, the user can provide a locale-specific BreakIterator object to a StringSearch instance to correctly locate the word boundaries. For example, if "c" exists in the string "abc", a match is returned. However, the behavior can be overwritten by supplying a word BreakIterator. The minimum unit of match is aligned to an extended grapheme cluster in the ICU string search service implementation defined by UAX #29 Unicode Text Segmentation. Therefore, all matches will begin and end on extended grapheme cluster boundaries. If the given input search pattern starts with non-base character, no matches will be returned.When there are contractions in the collation sequence and the contraction happens to span across the boundary of a match, it is not considered a match. For example, in traditional Spanish where 'ch' is a contraction, the "har" pattern will not match in the string "uno charo". Boundaries that are discontiguous contractions will yield a match result similar to those described above, where the end of the match returned will be one character before the immediate following base letter. In addition, only the first match will be located if a pattern contains only combining marks and the search string contains more than one occurrences of the pattern consecutively. For example, if the user searches for the pattern "´" (\u00b4) in the string "A´´B", (A\u00b4\u00b4B) the result will be offsets <1, 2>. ExampleIn C:
In C++:
In Java:
Performance and Other ImplicationsThe ICU string
search service is designed to be on top of the ICU collation service.
Therefore, all the performance implications that apply to a collator
are also applicable to the string search service. To obtain the best
performance, use the default collator attributes described in the Performance and Storage Implications on Attributes
section (§) in the Collation Service Architecture chapter. In addition, users need to be aware of the following StringSearch specific considerations: Search AlgorithmICU4C releases up to 3.8 used the Boyer-Moore search algorithm in the string search service. There were some known issues in these previous releases (See ICU tickets #5024, #5382, #5420). In ICU4C 4.0, the string search service was updated with the simple linear search algorithm, which locates a match by shifting a cursor in the target text one by one, and these issues were fixed. In ICU4C 4.0.1, the Boyer-Moore search code was reintroduced as a separated API set as a technology preview. The Boyer-Moore searching algorithm is based on automata or combinatorial properties of strings and pre-processes the pattern and known to be much faster than the linear search when search pattern length is longer. According to performance evaluation between these two implementations, the Boyer-Moore search is faster than the linear search when the pattern text is longer than 3 or 4 characters.Change Iterating DirectionThe
ICU string search service provides a set of very dynamic APIs that
allow users to change the iterating direction randomly. For example,
users can search for a particular word going forward by calling the usearch_next (C), StringSearch::next (C++) or StringSearch.next (Java) APIs and then search backwards at any point of the search operation by calling the usearch_previous (C), StringSearch::previous (C++) or StringSearch.previous (Java) APIs. Another way to change the iterating direction is by calling the usearch_reset (C), StringSearch::previous (C++) or StringSearch.previous (Java) APIs. Though the direction change can occur without calling the reset APIs first, this operation comes with a reduction in speed.
Thai and Lao Character BoundariesIn
collation, certain Thai and Lao vowels are swapped with the next
character. For example, the text string "A ขเ" (A \u0e02\u0e40) is
processed internally in collation as Case Level SearchCase level string search is currently done with the strength set to tertiary. When searching with the strength set to primary and the case level attribute turned on, results given may not be correct. The case level attribute is different from tertiary strength in that accents are ignored but case differences are not. Suppose you wanted to search for “A” in the text “ABC\u00C5a”. The match found should be at 0 and 3 if using the case level attribute. However, searching with the case level attribute turned on finds matches at 0, 3, and 4, which includes the lower case 'a'. To ensure that case level differences are not ignored, string search must be done with at least tertiary strength. |