of 23 /23
String matching algorithms

# String matching algorithms - deepak garg · 2012-06-11 · String matching algorithms . Deliverables String Basics Naïve String matching Algorithm Boyer Moore Algorithm Rabin-Karp

• Author
others

• View
3

0

Embed Size (px)

### Text of String matching algorithms - deepak garg · 2012-06-11 · String matching algorithms ....

• String matching algorithms

• Deliverables

String Basics

Naïve String matching Algorithm

Boyer Moore Algorithm

Rabin-Karp Algorithm

Knuth-Morris- Pratt Algorithm

Copyright @ gdeepak.Com® 2 6/11/2012 7:22 PM

• String Basics

A string is a sequence of characters

Examples of strings: C++ program, HTML document, DNA sequence, Digitized image

An alphabet S is the set of possible characters for a family of strings

Example of alphabets: ASCII (used by C and C++), Unicode (used by Java), {0, 1}, {A, C, G, T}

Copyright @ gdeepak.Com® 3 6/11/2012 7:22 PM

• String Basics

Let P be a string of size m

A substring P[i .. j] of P is the subsequence of P consisting of the characters with ranks between i and j

A prefix of P is a substring of the type P[0 .. i]

A suffix of P is a substring of the type P[i ..m - 1]

Given strings T (text) and P (pattern), pattern matching problem consists of finding a substring of T equal to P Applications: Text editors, Search engines, Biological research

Copyright @ gdeepak.Com® 4 6/11/2012 7:22 PM

• Brute Force String Matching

Naive-String-Matcher(T, P)

1. n ← length[T]

2. m ← length[P]

3. for s ← 0 to n - m

4. do if P[1 .. m]=T[s+1..s+m]

5. then print "Pattern occurs with shift" s

Worst case O(m*n)

T = aaa … ah

P = aaah

Match Q with A, not matching, so shift the pattern by one and so on.

6/11/2012 7:22 PM

• Boyer-Moore Algorithm

It uses two heuristics

Looking-glass heuristic: Compare P with T moving backwards

Character-jump heuristic: When a mismatch occurs at T[i] = c

If P contains c, shift P to align the last occurrence of c in P with T[i]

Else, shift P to align P[0] with T[i + 1]

Copyright @ gdeepak.Com® 6 6/11/2012 7:22 PM

• Boyer Moore Algorithm

• Boyer-Moore’s runs in time O(nm + s)

• Example of worst case: T = aaa …a P = baaa

• Worst case may occur in images and DNA sequences but unlikely in English text so BM is better for English Text

1

a p a t t e r n m a t c h i n g a l g o r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

2

3

4

5

6

7891011

Character shift

M 0

H 1

T 2

I 3

R 4

6/11/2012 7:22 PM

• Boyer Moore Algorithm Algorithm BoyerMooreMatch(T, P, S)

L lastOccurenceFunction(P, S )

i m - 1

j m - 1

repeat

if T[i] = P[j]

if j = 0

return i { match at i }

else

i i - 1

j j - 1

else

{ character-jump }

l L[T[i]]

i i + m – min(j, 1 + l)

j m - 1

until i > n - 1

return -1 { no match }

Copyright @ gdeepak.Com® 8 6/11/2012 7:22 PM

• Rabin-Karp Algorithm

• It speeds up testing of equality of the pattern to the substrings in the text by using a hash function. A hash function is a function which converts every string into a numeric value, called its hash value; e.g. hash("hello")=5. It exploits the fact that if two strings are equal, their hash values are also equal.

• There are two problems: Different strings can also result in the same hash value and there is extra cost of calculating the hash for each group of strings.

• Worst case is O(mn)

6/11/2012 7:22 PM

• Rabin Karp Example

Here we are matching 31415 in the given string. We can have various valid hits but there may be few valid matches.

6/11/2012 7:22 PM

• KMP Algorithm

• It never re-compares a text symbol that has matched a pattern symbol. As a result, complexity of the searching phase of the KMP = O(n). Preprocessing phase has a complexity of O(m). Since m< n, the overall complexity of is O(n). A border of x is a substring that is both proper prefix and proper suffix of x. We call its length b the width of the border.

• Let x=abacab Proper prefixes of x are ε, a, ab, aba, abac, abaca

• The proper suffixes of x are ε, b, ab, cab, acab, bacab

• The borders of x are ε, ab

Copyright @ gdeepak.Com® 11 6/11/2012 7:22 PM

• Border Concept

if s is the widest border of x, the next-widest border r of x is obtained as the widest border of s

Copyright @ gdeepak.Com® 12 6/11/2012 7:22 PM

• Border Extension

Let x be a string and a є A a symbol. A border r of x can be extended by a, if ra is a border of xa

Copyright @ gdeepak.Com® 13 6/11/2012 7:22 PM

• Border calculation

In the pre-processing phase an array b of length m+1 is computed. Each entry b[i] contains the width of the widest border of the prefix of length i of the pattern (i = 0, ..., m). Since the prefix ε of length i = 0 has no border, we set b[0] = -1.

Copyright @ gdeepak.Com® 14 6/11/2012 7:22 PM

• KMP-Preprocess Algorithm void kmpPreprocess()

{

int i=0, j=-1;

b[i]=j;

while (i=0 && p[i]!=p[j])

j=b[j];

i++;

j++;

b[i]=j;

}

}

• For pattern p = ababaa the widths of the borders in array b have the following values. For instance we have b[5] = 3, since the prefix ababa of length 5 has a border of width 3

Copyright @ gdeepak.Com® 15 6/11/2012 7:22 PM

• Preprocessing

• pre-processing algorithm could be applied to the string pt instead of p. If borders up to a width of m are computed only, then a border of width m of some prefix x of pt corresponds to a match of the pattern in t (provided that the border is not self-overlapping)

Copyright @ gdeepak.Com® 16 6/11/2012 7:22 PM

• KMP Search Algorithm void kmpSearch()

{

int i=0, j=0;

while (i=0 && t[i]!=p[j])

j=b[j];

i++;

j++;

if (j==m)

{

report(i-j);

j=b[j];

} } }

Copyright @ gdeepak.Com® 17 6/11/2012 7:22 PM

• KMP Search Algorithm

• When in inner while loop a mismatch at position j occurs, the widest border of the matching prefix of length j of the pattern is considered. Resuming comparisons at position b[j], the width of the border, yields a shift of the pattern such that the border matches. If again a mismatch occurs, the next-widest border is considered, and so on, until there is no border left or the next symbol matches. Then we have a new matching prefix of the pattern and continue with the outer while loop.

• If all m symbols of the pattern have matched the corresponding text window (j = m), a function report is called for reporting the match at position i-j. Afterwards, the pattern is shifted as far as its widest border allows

Copyright @ gdeepak.Com® 18 6/11/2012 7:22 PM

• 6/11/2012 7:22 PM Copyright @ gdeepak.Com® 19

a b a c a a b a c c a b a c a b a a b b

a b a c

a b a c

a b a c

a b a c

a b a c

a b

a b

a b

a b

a b

1 2 3 4 5 6

7

8 9 10 11 12

13

14 15 16 17 18

b a c a b

0 1 0 1 2

1 2 3 4 5 x 0

P[x] a

f(x) 0 19

Copyright @ gdeepak.Com® 20 6/11/2012 7:22 PM

• Question 1

How many nonempty prefixes of the string p=“aaabbaaa” are also suffixes of P?

Copyright @ gdeepak.Com® 21 6/11/2012 7:22 PM

• Question 2

What is the longest proper prefix of the string cgtacgttcgtacg that is also the suffix of this string.

Copyright @ gdeepak.Com® 22 6/11/2012 7:22 PM

• Question 3

What is the complexity of the KMP Algorithm, if we have a main string of length s and we wish to find the pattern of length p.

A) O( s+p)

B) O(p)

C) O(sp)

D) O(s)

Copyright @ gdeepak.Com® 23 6/11/2012 7:22 PM

Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Technology
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents