27
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Exact String Matching Algorithms

Embed Size (px)

DESCRIPTION

Exact String Matching Algorithms. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. Exact Matching: What’s the Problem. 1 1 2 34 5 67 8 90 1 2 T = bbabaxababay P = aba. P occurs in T starting at locations 3, 7, and 9 P may overlap, as found at 7 and 9. The Naive Method. - PowerPoint PPT Presentation

Citation preview

Page 1: Exact String Matching Algorithms

Exact String Matching Algorithms

Presented ByDr. Shazzad Hosain

Asst. Prof. EECS, NSU

Page 2: Exact String Matching Algorithms

Exact Matching: What’s the Problem

1 1 2 34 5 67 8 90 1 2

T = bbabaxababayP = aba

P occurs in T starting at locations 3, 7, and 9P may overlap, as found at 7 and 9.

Page 3: Exact String Matching Algorithms

The Naive Method

• Problem is to find if a pattern P[1..m] occurs within text T[1..n]

• Let P = abxyabxz and T = xabxyabxyabxz• Where m = 8 and n = 13

Page 4: Exact String Matching Algorithms

The Naive Method

• If P = aaa and T = aaaaaaaaaa then n=3, m=10• In worst case exactly n(m-n+1) comparisons• In this case 24 comparisons in the order of θ (mn).

Page 5: Exact String Matching Algorithms

The Naive AlgorithmChar text[], pat[] ;int n, m ;{ int i, j, k, lim ; lim=n-m+1 ; for (i=1 ; i<=lim ; i++) /* search */ { k=i ; for (j=1 ; j<=m && text[k]==pat[j]; j++) k++; if (j>m) Report_match_at_position(i-j+1); }}• The worst-case bound can be reduced to O(m+n)• For applications with n = 1000 and m = 10,000,000 the

improvement is significant.

Page 6: Exact String Matching Algorithms

The Smart Algorithm

• Reasoning of this sort is the key to shifting by more than one character

Instead of

Skips over three comparisons

If you know first character of P (namely a) does not occur again at P until position 5 of P12345 678

Page 7: Exact String Matching Algorithms

The Smarter Algorithm

Instead of

Skips over three comparisons

Instead of Starts at

Skips another three

Page 8: Exact String Matching Algorithms

The Smart Algorithms

• Knuth-Morris-Pratt (KMP) Alogorithm• Boyer-Moore Algorithm• Reduced run-time to O(n+m)

Additional knowledge requires preprocessing of stringsUsually P is much shorter than TSo P is preprocessed

Page 9: Exact String Matching Algorithms

The Preprocessing Approach

• Usually P is preprocessed instead of T• Sometimes T is preprocessed, e.g. suffix tree• The preprocessing methods are similar in

spirit, but often quite different in detail and conceptual difficulty

• Fundamental preprocessing of P is independent of any particular algorithm

• Each algorithm uses this information

Page 10: Exact String Matching Algorithms

Basic String Definitions/Notations

• Let, S be the string• S[i..j] is the substring of S starting at position i and

ending at position j, S[i..j] is empty if i > j

1 1 2 34 5 67 8 90 1 2

S = bbabaxababayS[3..7] = abaxaS[1..4] = bbab

• |S| is the length of the string. Here, |S| = 12• S[1..i] is prefix of S that ends at position i

Prefix

• S[i..|S|] is the suffix of S that begins at position iS[9..12] = abay Suffix

Page 11: Exact String Matching Algorithms

• A proper prefix, suffix or substring of S is, respectively, a prefix, suffix or substring that is not the entire string S, not the empty string.

• For any string S, S(i) denotes the ith character of S

Basic String Definitions/Notations

Page 12: Exact String Matching Algorithms

12

Preprocessing

• Goal: To gather the information needed for speeding up the algorithm

• Definitions:– Zi: For i>1, the length of the longest substring of S that

starts at i and matches a prefix of S– Z-box: for any position i >1 where Zi>0, the Z-box at i starts

at i and ends at i+Zi-1– ri; For every i>1, ri is the right-most endpoint of the Z-boxes

that begin at or before i– li; For every i>1, li is the left endpoint of the Z-box ends at ri

Page 13: Exact String Matching Algorithms

PreprocessingZi(S) = The longest prefix of S[i..|S|] that matches a prefix of S, where i > 1

1 12 3 456 7 8 901

S = aabcaabxaaz

Z5(S) = Z6(S) = Z7(S) = Z8(S) = 0Z9(S) = 2 (aab…aaz)

3 (aabc…aabx…)1 (aa…ab…)

We will use Zi in place of Zi(S)

Z Boxfor i > 1, where Zi is greater than zero

Figure 1.2: From Gusfield

Page 14: Exact String Matching Algorithms

The li and ri of Z-Box

40 50 55 62 70 78 82 85 89 95

ri = the right-most endpoint of the Z-boxes that begin at or before position i.

li = the left end of the Z-box that ends at ri.

r78 = 95 l78 = 78r82 = 95 l82 = 78r52 = 50 l52 = 40r75 = 85 l75 = 70

Page 15: Exact String Matching Algorithms

15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0Z-box

a a b a a b c a x a a b a a b c y ri: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16

li: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10

Preprocessing

Page 16: Exact String Matching Algorithms

16

Z-AlgorithmGoal: To calculate Zi for an input string S in a linear time

Starting from i=2, calculate Z2, r2 and l2

For i=3; i<n; i++In iteration k, calculate Zk, rk and lk based on Zj, rjand lj for j=2,…,k-1

For iteration k, the algorithm only need rk-1 and lk-1. Thus, there is no need to keep all ri and li. We use r, and l to denote rk-1 and lk-1

Page 17: Exact String Matching Algorithms

17

Z-Algorithm

a’

k rl

a bk’ r’l’b’

k’=k-l+1; r’=r-l+1; a=a’; b=b’

k rl

In iteration k: (I) if k<=r

a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

abb’

a’

Page 18: Exact String Matching Algorithms

18

k rl

a bk’ r’l’

a’b’g’ g

A) If |g’|<|b’|, that is, Z k’< r-k+1, Z k = Z k’

g’’ x y y

g=g’=g’’; x≠y

a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Z: 0 1 0 3 1 0 0 1 0 7 1 0 3

abgb’

a’g’’ g’

Z-Algorithm

Page 19: Exact String Matching Algorithms

19

Z-Algorithm

k rl

a b

k’ r’l’

a’b’

g’ g

B) If |g’|>|b’|, that is, Z k’ >r-k+1, Zk =|b|, i.e., r-k+1

g’’ y

=b b’=b’’g’=g’’;

x ≠y (because a is a Z box)

b’’ xx

Zk =|b|, i.e., r-k+1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16S: a a b a a b c a x a a b a a c dZ: 0 1 0 3 1 0 0 1 0 6 1 0 2 1 0 0

a ba’ b’

g’g’’ b’’

Page 20: Exact String Matching Algorithms

20

Z-Algorithm

k rl

a b

k’ r’l’

a’b’

g’ g

C) If |g’|=|b’|, that is, Z k’ =r-k+1, Zk ≥|b|, i.e., ≥ r-k+1

g’’ y

=b b’=b’’g=g’=g’’; x ≠y (because a is a Z box)z ≠x (because g’ is a Z box)z ?? y

b’’ xz

Compare S[r+1,...] with S[ |b| +1,…] until a mismatch occurs. Update Zk, r, and l

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16S: a a b a a e c a x a a b a a b dZ: 0 1 0 2 1 0 0 1 0 6 1 0 3 1 0 0

a ba’ b’

g’g’’

Page 21: Exact String Matching Algorithms

21

Z-Algorithm

krl

(II) if k>r

Compare the characters starting at k+1 with those starting at 1.

Update r, and l if necessary

Page 22: Exact String Matching Algorithms

22

Z-Algorithm

Input: Pattern POutput: Zi

Z Algorithm

Calculate Z2, r2 and l2 specifically by comparisons. R= r2 and l=l2

for i=3; i<n; i++ if k<=r if Z k-l+1 <r-k+1, then Z k = Z k-l+1

else if Z k-l+1 > r-k+1 Z k = r-k+1else compare the characters starting at r+1 with those starting at |b|

+1. Update r, and l if necessary else Compare the characters starting at k to those starting at 1. Update r, and l if necessary

Page 23: Exact String Matching Algorithms

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16

l: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10

Preprocessing

Page 24: Exact String Matching Algorithms

24

Z-AlgorithmTime complexity#mismatches <= number of iterations, n#matches• Let q be the number of matches at iteration k, then we need to increase r by at least q• r<=n• Thus total #match <=nT=O( #matches + #mismatches +#iterations)=O(n)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16

l: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10

#m: 0 1 0 3 0 0 0 1 0 7 0 0 0 0 0 0 0#mis: 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1

Page 25: Exact String Matching Algorithms

25

Simplest Linear Time Exact Matching Algorithm

Input: Pattern P, Text TOutput: Occurrences of P in TAlgorithm Simplest

S=P$T, where $ is a character that do not appear in P and TFor i=2; i<|S|; i++ Calculate Zi

If Zi=|P|, then report that there is an occurrence of P in T starting

at i-|P|-1 of

T=O(|P|+|T|+1)=O(n+m)

Page 26: Exact String Matching Algorithms

26

Simplest Linear Time Exact Matching Algorithm

• Take only O (n) extra space• Alphabet-independent linear time

k rl

a bk’ r’l’

a’ b’ $

Page 27: Exact String Matching Algorithms

Reference

• Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms