Transcript
Page 1: Exact String Matching Algorithms

Exact String Matching Algorithms

Presented ByDr. Shazzad Hosain

Asst. Prof. EECS, NSU

Page 2: Exact String Matching Algorithms

Exact Matching: What’s the Problem

1 1 2 34 5 67 8 90 1 2

T = bbabaxababayP = aba

P occurs in T starting at locations 3, 7, and 9P may overlap, as found at 7 and 9.

Page 3: Exact String Matching Algorithms

The Naive Method

• Problem is to find if a pattern P[1..m] occurs within text T[1..n]

• Let P = abxyabxz and T = xabxyabxyabxz• Where m = 8 and n = 13

Page 4: Exact String Matching Algorithms

The Naive Method

• If P = aaa and T = aaaaaaaaaa then n=3, m=10• In worst case exactly n(m-n+1) comparisons• In this case 24 comparisons in the order of θ (mn).

Page 5: Exact String Matching Algorithms

The Naive AlgorithmChar text[], pat[] ;int n, m ;{ int i, j, k, lim ; lim=n-m+1 ; for (i=1 ; i<=lim ; i++) /* search */ { k=i ; for (j=1 ; j<=m && text[k]==pat[j]; j++) k++; if (j>m) Report_match_at_position(i-j+1); }}• The worst-case bound can be reduced to O(m+n)• For applications with n = 1000 and m = 10,000,000 the

improvement is significant.

Page 6: Exact String Matching Algorithms

The Smart Algorithm

• Reasoning of this sort is the key to shifting by more than one character

Instead of

Skips over three comparisons

If you know first character of P (namely a) does not occur again at P until position 5 of P12345 678

Page 7: Exact String Matching Algorithms

The Smarter Algorithm

Instead of

Skips over three comparisons

Instead of Starts at

Skips another three

Page 8: Exact String Matching Algorithms

The Smart Algorithms

• Knuth-Morris-Pratt (KMP) Alogorithm• Boyer-Moore Algorithm• Reduced run-time to O(n+m)

Additional knowledge requires preprocessing of stringsUsually P is much shorter than TSo P is preprocessed

Page 9: Exact String Matching Algorithms

The Preprocessing Approach

• Usually P is preprocessed instead of T• Sometimes T is preprocessed, e.g. suffix tree• The preprocessing methods are similar in

spirit, but often quite different in detail and conceptual difficulty

• Fundamental preprocessing of P is independent of any particular algorithm

• Each algorithm uses this information

Page 10: Exact String Matching Algorithms

Basic String Definitions/Notations

• Let, S be the string• S[i..j] is the substring of S starting at position i and

ending at position j, S[i..j] is empty if i > j

1 1 2 34 5 67 8 90 1 2

S = bbabaxababayS[3..7] = abaxaS[1..4] = bbab

• |S| is the length of the string. Here, |S| = 12• S[1..i] is prefix of S that ends at position i

Prefix

• S[i..|S|] is the suffix of S that begins at position iS[9..12] = abay Suffix

Page 11: Exact String Matching Algorithms

• A proper prefix, suffix or substring of S is, respectively, a prefix, suffix or substring that is not the entire string S, not the empty string.

• For any string S, S(i) denotes the ith character of S

Basic String Definitions/Notations

Page 12: Exact String Matching Algorithms

12

Preprocessing

• Goal: To gather the information needed for speeding up the algorithm

• Definitions:– Zi: For i>1, the length of the longest substring of S that

starts at i and matches a prefix of S– Z-box: for any position i >1 where Zi>0, the Z-box at i starts

at i and ends at i+Zi-1– ri; For every i>1, ri is the right-most endpoint of the Z-boxes

that begin at or before i– li; For every i>1, li is the left endpoint of the Z-box ends at ri

Page 13: Exact String Matching Algorithms

PreprocessingZi(S) = The longest prefix of S[i..|S|] that matches a prefix of S, where i > 1

1 12 3 456 7 8 901

S = aabcaabxaaz

Z5(S) = Z6(S) = Z7(S) = Z8(S) = 0Z9(S) = 2 (aab…aaz)

3 (aabc…aabx…)1 (aa…ab…)

We will use Zi in place of Zi(S)

Z Boxfor i > 1, where Zi is greater than zero

Figure 1.2: From Gusfield

Page 14: Exact String Matching Algorithms

The li and ri of Z-Box

40 50 55 62 70 78 82 85 89 95

ri = the right-most endpoint of the Z-boxes that begin at or before position i.

li = the left end of the Z-box that ends at ri.

r78 = 95 l78 = 78r82 = 95 l82 = 78r52 = 50 l52 = 40r75 = 85 l75 = 70

Page 15: Exact String Matching Algorithms

15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0Z-box

a a b a a b c a x a a b a a b c y ri: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16

li: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10

Preprocessing

Page 16: Exact String Matching Algorithms

16

Z-AlgorithmGoal: To calculate Zi for an input string S in a linear time

Starting from i=2, calculate Z2, r2 and l2

For i=3; i<n; i++In iteration k, calculate Zk, rk and lk based on Zj, rjand lj for j=2,…,k-1

For iteration k, the algorithm only need rk-1 and lk-1. Thus, there is no need to keep all ri and li. We use r, and l to denote rk-1 and lk-1

Page 17: Exact String Matching Algorithms

17

Z-Algorithm

a’

k rl

a bk’ r’l’b’

k’=k-l+1; r’=r-l+1; a=a’; b=b’

k rl

In iteration k: (I) if k<=r

a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

abb’

a’

Page 18: Exact String Matching Algorithms

18

k rl

a bk’ r’l’

a’b’g’ g

A) If |g’|<|b’|, that is, Z k’< r-k+1, Z k = Z k’

g’’ x y y

g=g’=g’’; x≠y

a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Z: 0 1 0 3 1 0 0 1 0 7 1 0 3

abgb’

a’g’’ g’

Z-Algorithm

Page 19: Exact String Matching Algorithms

19

Z-Algorithm

k rl

a b

k’ r’l’

a’b’

g’ g

B) If |g’|>|b’|, that is, Z k’ >r-k+1, Zk =|b|, i.e., r-k+1

g’’ y

=b b’=b’’g’=g’’;

x ≠y (because a is a Z box)

b’’ xx

Zk =|b|, i.e., r-k+1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16S: a a b a a b c a x a a b a a c dZ: 0 1 0 3 1 0 0 1 0 6 1 0 2 1 0 0

a ba’ b’

g’g’’ b’’

Page 20: Exact String Matching Algorithms

20

Z-Algorithm

k rl

a b

k’ r’l’

a’b’

g’ g

C) If |g’|=|b’|, that is, Z k’ =r-k+1, Zk ≥|b|, i.e., ≥ r-k+1

g’’ y

=b b’=b’’g=g’=g’’; x ≠y (because a is a Z box)z ≠x (because g’ is a Z box)z ?? y

b’’ xz

Compare S[r+1,...] with S[ |b| +1,…] until a mismatch occurs. Update Zk, r, and l

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16S: a a b a a e c a x a a b a a b dZ: 0 1 0 2 1 0 0 1 0 6 1 0 3 1 0 0

a ba’ b’

g’g’’

Page 21: Exact String Matching Algorithms

21

Z-Algorithm

krl

(II) if k>r

Compare the characters starting at k+1 with those starting at 1.

Update r, and l if necessary

Page 22: Exact String Matching Algorithms

22

Z-Algorithm

Input: Pattern POutput: Zi

Z Algorithm

Calculate Z2, r2 and l2 specifically by comparisons. R= r2 and l=l2

for i=3; i<n; i++ if k<=r if Z k-l+1 <r-k+1, then Z k = Z k-l+1

else if Z k-l+1 > r-k+1 Z k = r-k+1else compare the characters starting at r+1 with those starting at |b|

+1. Update r, and l if necessary else Compare the characters starting at k to those starting at 1. Update r, and l if necessary

Page 23: Exact String Matching Algorithms

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16

l: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10

Preprocessing

Page 24: Exact String Matching Algorithms

24

Z-AlgorithmTime complexity#mismatches <= number of iterations, n#matches• Let q be the number of matches at iteration k, then we need to increase r by at least q• r<=n• Thus total #match <=nT=O( #matches + #mismatches +#iterations)=O(n)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16

l: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10

#m: 0 1 0 3 0 0 0 1 0 7 0 0 0 0 0 0 0#mis: 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1

Page 25: Exact String Matching Algorithms

25

Simplest Linear Time Exact Matching Algorithm

Input: Pattern P, Text TOutput: Occurrences of P in TAlgorithm Simplest

S=P$T, where $ is a character that do not appear in P and TFor i=2; i<|S|; i++ Calculate Zi

If Zi=|P|, then report that there is an occurrence of P in T starting

at i-|P|-1 of

T=O(|P|+|T|+1)=O(n+m)

Page 26: Exact String Matching Algorithms

26

Simplest Linear Time Exact Matching Algorithm

• Take only O (n) extra space• Alphabet-independent linear time

k rl

a bk’ r’l’

a’ b’ $

Page 27: Exact String Matching Algorithms

Reference

• Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms


Recommended