33
1 Boyer-Moore Charles Yan 2007

1 Boyer-Moore Charles Yan 2007. 2 Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

1

Boyer-Moore

Charles Yan2007

2

Exact Matching

Boyer-Moore (worst-case: linear time, Typical: sublinear time )

Aho-Corasik (A set of pattern)

3

Boyer-Moore

Idea 1: Right-to-left comparison

12345678901234567T: xpbctbxabpqxctbpqP: tpabxab

4

Boyer-Moore

12345678901234567T: spbctbsabpqsctbpqP: tpabsab

Idea 2: Bad character ruleR(x): The right-most occurrence of x in P. R(x)=0 if x does not

occur. R(t)=1, R(s)=5.i: the position of mismatch in P. i=3k: the counterpart in T. k=5. T[k]=tThe bad character rule says P should be shifted right by max{1, i-

R(T[k])}. i.e., if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j] show be below T[k] after the shifting.

P: tpabxab

5

Boyer-Moore

The idea of bad character rule is to shift P by more than one characters when possible.

But is has no effect if j>i Unfortunately, it is often the case that j>i

12345678901234567T: spbctbsatpqsctbpqP: tpabsatP: tpabsat

6

Boyer-Moore

Let x=T[k], the mismatched character in T.

Idea 3: Extended bad character rule says P should be shifted right so that the closest x to the left of position i in P is below T[K].

12345678901234567T: spbctbsatpqsctbpqP: tpabsatP: tpabsat

7

Boyer-Moore

To use extended bad character rule we need: For each position i of P, for each character x in the alphabet, the position of the closest occurrence of x to the left of i.

Approach 1: Two dimensional array. n*| |

Space and time: expensive

8

Boyer-Moore

Approach two: scan P from right to left and for each x maintain a list positions where x occurs (in decreasing order).

P: tpabsat t7,1 a6,3 …

When P[i] is mismatched with T[k], (let x=T[k]), scan the x’s list, find the first number (let it be j) that is less than i and shift P to right so that P[j] is below T[k].

If no such j is found then shift P past T[k]Space and time: Linear

12345678901234567T: spbctbsatpqsctbpqP: tpabsatP: tpabsat

9

Boyer-Moore

Idea 3: Strong good suffix rule

t is a suffix of P that match with a substring t of Tx≠yt’ is the right-most copy of t in P such that t’ is not a suffix of P

and z≠y

x t

y tt’z

T

P

10

Boyer-Moore

The strong good suffix rule says (1) if t’ exist then shift P to right such that t’ in P is below t in T

123456789012345678

T: prstabstubabvqxrst

P: qcabdabdab

x t

y tt’

y tt’

z

z

T

P

P: qcabdabdab

P: qcabdabdab

11

Boyer-Moore

Extended bad character rule focuses on characters.Strong good rule focuses on substrings.

How to get the information needed for the strong good suffix rule? i.e., for a t, how do we find t`?

12

Boyer-Moore

L’(i): For each i, L’(i) is the largest position less than n such that substring P[i,…,n] matches a suffix of P[1,…, ’(i) ] with the additional requirement that the character preceding that suffix is not equal to character P[i-1].

If there is no such a position, L’(i) =0.Let t= P[i,…,n], then L’(i) is the right end-position of t’.

x t

y tt’

y tt’

z

z

T

P

niL’(i)

T: prstabstubabvqxrstP: qcabdabdab 1234567890L’(9)=4, L’(10)=0, L’(8)=?, L’(7)=? L’(6)=?

13

Boyer-Moore

Let t= P[i,…,n], then L’(i) is the right end-position of t’.

Thus to use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n.

For pattern P,

Nj is the length of the longest substring that end at j and that is also a suffix of P.

tt’j

xyP

t=t’;j=|t’|=|t|;x≠y

14

Boyer-Moore

Nj is the length of the longest substring that end at j and that is also a suffix of P.

Zi: the length of the longest substring of P that starts at i and matches a prefix of P

tt’j

xy

t t’ xyi

15

Boyer-Moore

N is the reverse of Z!

P: the pattern

Pr the string obtained by reversing P

Then Nj (P)=Zn-j+1 (Pr)

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0P: q c a b d a b d a b Pr: b a d b a d b a c qNj: 0 0 0 2 0 0 5 0 0 0 Zi 0 0 0 5 0 0 2 0 0 0

t t’ xyi

tt’j

xy

16

Boyer-Moore

For pattern P,

Nj (for j=1,…,n) can be calculated in O(n) using the Z algorithm.

Why do we need to define Nj ?

To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n.

We can get L’(i) from Nj !

x t

y tt’

y tt’

z

z

T

P

niL’(i)

17

Boyer-Moore

For position i, let t=P[i,…n].

L’(i) is the largest position j less than n such that Nj=|t|

y tt’zPniL’(i)

t’’

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0P: q c a b d a b d a b Pr: b a d b a d b a c qNj: 0 0 0 2 0 0 5 0 0 0 Zi 0 0 0 5 0 0 2 0 0 0L’(i): 0 0 0 0 0 7 0 0 4 0

18

Boyer-Moore

How to obtain L’(i) from Nj in linear time?

Input: Pattern POutput: L’(i) for i=1,…,nAlgorithm

Calculate Nj for j=1,…,n based on Z algorithmfor i=1; i<=n; i++

L’(i)=0;for j=1; j<n; j++

i=n-Nj+1 L’(i)=j;

y tt’zP

niL’(i)

j

19

Boyer-Moore

The strong good suffix rule says (1) if t’ exist then shift P to right such that t’ in P is below t in T

123456789012345678T: prstabstubabvqxrstP: qcabdabdab i=9; L’(9)=4

x t

y tt’

y tt’

z

z

T

P

P: qcabdabdab

i nL’(i)

L’(i) i n

20

Boyer-Moore

The strong good suffix rule:(1) If a mismatch occurs at position i-1 of P and L’(i)>0 (i.e. t’

exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right.

(2) What if a mismatch occurs at position i-1 of P and L’(i)=0 (i.e. t’ does not exists)? We can shift P as least like this

x t

y t

y t

T

P

i nP

i n

21

Boyer-Moore

But we can do more than that!

x t

y t

y t

T

P

i nP

i n

22

Boyer-Moore

Observation 1   If  is a prefix of P is also a suffix of P, then…

x t

y t

y t

T

P

i nP

i n’

23

Boyer-Moore

Observation 2: If there are more than one candidates of , then shift P by the least amount

x t

y t

y t

T

P

P1

’ y tP2

24

Boyer-Moore

The strong good suffix rule: When a mismatch occurs at position i-1 of P

(1) If L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right.

(2) Else if L’(i)=0 (i.e. t’ does not exists)? We can shift P past the left end of t by the least amount such a prefix of the shifted pattern matches a suffix of t.

x t

y t

y t

T

P

i nP

i n’

25

Boyer-Moore

l’(i) : the length of the largest suffix of P[i,…,n], that is also a prefix of P. If none exists, then l’(i)=0.

l’(i) is length of the overlap between the unshifted and shifted patterns.

x t

y t

y t

T

P

P1

’ y tP2

il’(i)l’(i)

26

Boyer-Moore

l’(i) equals the largest j≤|P[i,…n]|, such that Nj=j

1. Nj=j then is a prefix of P is also a suffix of P

2. and we want the largest j

y tPi

l’(i)

Pj

j2j1

27

Boyer-Moore

l’(i) equals the largest j≤|P[i,…n]|, such that Nj=j

1 2 3 4 5 6 7 8 9 0P: a b d a b a b d a b Nj: 0 2 0 0 5 0 2 0 0 0l’(i): 5 5 5 5 5 5 2 2 2 0

28

Boyer-Moore

How to calculate l’(i) from Nj in linear time ?

29

Boyer-MooreThe strong good suffix rule: When a mismatch occurs at position i-1 of P

(1) If L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right.

(2) Else if L’(i)=0 (i.e. t’ does not exists)? We can shift P past the left end of t by the least amount such a prefix of the shifted pattern matches t, that is by n-l’(i) positions to the right.

x t

y t

y t

T

P

i nP

i n’

x ty tt’

y tt’

z

z

TP

i nL’(i)

L’(i) i n

l’(i)

30

Boyer-Moore

What if a match is found? Shift P by one position…but…

Shift P by the least amount such a prefix of the shifted pattern matches a suffix of t, that is, shift P to the right by n-l’(2)

y t

T

P

P

31

Boyer-MooreThe strong good suffix rule: When a mismatch occurs at position i-1 of P

(1) If L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right.

(2) Else if L’(i)=0 (i.e. t’ does not exists)? We can shift P past the left end of t by the least amount such a prefix of the shifted pattern matches t, that is by n-l’(i) positions to the right.

(3) If a match is found, then shift P to the right by n-l’(2)

x ty t

y t

TP

Pi n’

x ty tt’

y tt’

z

z

TP

i nL’(i)

l’(i)

32

Boyer-Moore

The extended bad character rule vs. the strong

good suffix rule

123456789012345678T: prstabstubabvqxrstP: qcabdabdab

P: qcabdabdabP: qcabdabdab

123456789012345678T: prstabstuqabvqxrstP: qcabdabdab

P: qcabdabdabP: qcabdabdab

33

Boyer-Moore

Shift P by the largest amount given by either of rules. That results in the Boyer-Moore algorithm!

Input: Text T, and pattern P; Output: Find the occurrences of P in TAlgorithm Boyer-Moore

Compute L’(i), L`(i), and R(x)k=n;while (k≤m) do

i=nh=kwhile i>0 and P[i]=T[h] do

i--;h--;

if i=0report an occurrence of P in T ending at position k;k=k+n-l`(2)

else shift P (increase k) by the maximum amount determined by the extended bad character rule and the good suffix rule.

tt

T

Pi

kh