of 75/75

1 Approximate Boyer- Moore String Matching Source : SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp.243-260 J. Tarhio and E. Ukkonen Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen

View

46Download

8

Embed Size (px)

DESCRIPTION

Approximate Boyer-Moore String Matching. Source : SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp.243-260 J. Tarhio and E. Ukkonen Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen. The k mismatches problem The k differences problem. Definition of the k mismatches problem. - PowerPoint PPT Presentation

Approximate Boyer-Moore String MatchingSource : SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp.243-260J. Tarhio and E. Ukkonen Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen

The k mismatches problemThe k differences problem

Definition of the k mismatches problemGiven a pattern string P of length m and a text string T of length n, we would like to find all approximate occurrences P in T with at most k mismatches.

If k=1, then

Text

Pattern

a

b

Consider the following situation where a pattern P is matching with a windows W of T and there are already (k+1) mismatches:

W

k+1 mismatches

T

P

Since there are already (k+1) mismatches, we must move the pattern. The following is obvious:

P must be moved to such an extent that there are at most k mismatches between a suffix S of W and a substring S of P.

S

S

k+1 mismatches

T

P

Our trick is as follows: Consider the (k+1)-suffix of W. There are two cases:

Case 1: There is one character in this (k+1)-suffix which exists in P in such a way as shown below. Move the pattern to match these characters. Note that in such a situation, there are at most k mismatches between the (k+1)-suffix and its corresponding substring in P.

x

(k+1)-suffix

x

T

P

(k+1)-suffix

T

x

P

x

Case 2: No such a character exists. Move the pattern in such a way that the k-prefix of P aligns with the k-suffix of W as shown below. Under such a situation, again, there are at most k-mismatches between the k-suffix of W and k-prefix of P.

k-prefix

(k+1)-suffix

T

P

The generalization of the BM algorithm for the k mismatches problem will be very natural: for k=0 the generalized algorithm is exact string matching.

Recall that the k mismatches problem asks for finding all occurrences of P in T such that in at most k positions of P, T and P have different characters.

We just scan the pattern from right to the left until we have found k+1 mismatches (unsuccessful search) or the pattern ends (successful search).

Preprocessing phase for approximate matchingDk tableThe value Dk for a particular alphabet is defined as the rightmost position of that character in the pattern 1 and the end position i where i=[m..m-k].Example : Let k=1, m=8, a

A C G *D1[i=8, a]1 6 2 8

i12345678P:GCAGAGAG

A C G *D1[i=7, a]2 5 1 8

j1234567P:GCAGAGA

A C G *D1[i=8, a]1 6 2 8D1[i=7, a]2 5 1 8

Algorithm for preprocessing phaseP = p1p2pm,T = t1t2tnPreprocessingFor a Do For j=m downto m-k Do Begindk[j,a] mFind a character a that it is close to pj. If it is found, we calculate the distance between the position of the character a and j and insert it into dk[j,a].

Algorithm for searching phaseP = p1p2pm,T = t1t2tnSearchingj=m;While j n+ k Do Begin h=j; i=m; mismatch=0;While i>0 and mismatch k Do Begind=min(dk[i, th], dk[i-1, th-1]);If thpi Then mismatch=mismatch+1;i= i- 1; h= h-1 End of while;If mismatch k Then report match at position j;j= j+ d Endof while

Complete example for approximate string matchingExample 1:Let k=1, m=4, n=17

T:TTAACGTAATGCAGCTAP:AGCT

Example 1 (1/6)

T:TTAACGTAATGCAGCTAP:AGCT

Example 1 (2/6)

T:TTAACGTAATGCAGCTAP:AGCT

Example 1 (3/6)

T:TTAACGTAATGCAGCTAP:AGCT

Example 1 (4/6)

T:TTAACGTAATGCAGCTAP:AGCT

Example 1 (5/6)

T:TTAACGTAATGCAGCTAP:AGCT

Example 1 (6/6)j 16 + p , j 16+ 3, j 19jump out of while loop

T:TTAACGTAATGCAGCTAP:AGCT

Example 2:Let k=1, m=8, n=24

T:GCATCGCAGAGAGTATACAGTACGP:GCAGAGAG

Example 2 (1/14)

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example 2 (3/14)

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example 2 (4/14)

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example 2 (5/14)Then report match at position j;j 13 + p , j 13+ 2, j 15

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example 2 (6/14)

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example 2 (7/14)

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example 2 (8/14)

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example 2 (9/14)

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example 2 (11/14)

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example 2 (13/14)

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example 2 (14/14)If h = 0 Then report match at position j;j 24 + p , j 24+ 2, j 26jump out of while loop

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Time complexitypreprocessing phase in O(m+ kc) time and O(kc) space complexity.searching phase in O(mn) time complexity.

Definition of the k differences problemGiven a pattern string P of length m and a text string T of length n, we would like to find all approximate occurrences P in T with edit distance not larger than k.

The basic approach to solve the problem is to find the edit distance for T(1, i) and P for every i [Ukk85b] : Let Edit be an m+1 by n+1 table such that Edit(i, j) is the minimum edit distance between p1p2pj and any substring of T ending at ti.

Table Edit must be completely evaluated column-by-column in time O(mn).

If we can find out all occurrences of i where Edit(T(1, i), P) cannot be smaller than k. We may skip this i.

This paper is based upon Rule 7 proposed by Professor Lee.

Rule 7If k characters in String A do not appear in String B, Distance(A,B) is not smaller than k.

In the scanning phase, we define some terms first.

A diagonal h of Edit for h=-m,, n, consists of all Edit(i, j) such that i- j=h.

For every Edit(i, j), there is a minimizing arc from Edit(i-1, j) to Edit(i, j) if Edit(i, j)=Edit(i-1, j)+1, from Edit(i, j-1) to Edit(i, j) if Edit(i, j-1)+1, and from Edit(i-1, j-1) to Edit(i, j) if Edit(i, j)=Edit(i-1, j-1) where pj=ti or if Edit(i, j)=Edit(i-1, j-1)+1 where pjti. The costs of the arcs are 1, 1, 0 and 1, respectively.

Edit(i, j-1)

Edit(i-1, j)

Edit(i-1, j-1)

Edit(i, j)

pjti

pjti

pjti

pj=ti

Minimizing arc

Deletion

Insertion

Substitution

A minimizing path is any path that consists of minimizing arcs and leads from an entry Edit(i, 0) on the first row of Edit to an entry Edit(h, m) on the last row of Edit. A minimizing path is successful if it leads to an entry Edit(h, m)k.

Lemma 1: The entries on a successful minimizing path M are contained in k+1 successive diagonals of Edit.Proof : Each addition of a diagonal comes from either an insertion or deletion. If there are more than (k+1) diagonals, there must be more than (k+1) operations, either deletions or insertions. Thus there cannot be more than (k+1) diagonals.

Text

Pattern

t1t2...

...tn-1tn

pm

p1

p2

pm-1

...

...

Successive diagonals

A successful minimizing path

T:ABCABBAP:CBABACS:C-AB--P:CBABAC EDIT(P, S)=3There are (k+1)=3+1=4 successive diagonals because there are three deletions.Successive diagonals

ABCABBACBABAC

- T:BCABDABP:CBADBk =3S:C-ABDABP:CBA-D-B EDIT(P, S)=3There are 1+2=3
By Lemma 1, for each diagonal d, any successful minimizing path starting at the top of this diagonal will have a bandwidth of 1+k+k=2k+1

t1t2...

...tn

pm

p1

p2

pm-1

...

...

M

Bandwidth k of Edit

h

k

k

2k+1

T:ABCABBAP:CBABACk=3S:C-AB--P:CBABAC EDIT(P, S)=3ResultSuccessive diagonalsThe successful minimizing path is only in the bandwidth 7 of Edit.k=3k=3

ABCABBACBABAC

For the width of bandwidth k of Edit, we give it a name, call k-environment.

For each j=1, , m, let the k-environment of the pattern symbol pj be the string Cj=pj-kpj+k, where pa= for a m.

P

pj-k...

k-environment

pj+k

pj-1pjpj+1...

The longest vertical path in any minimizing path has length not greater than 2k+1. We only have to determine whether ti appears in the k environment of pj.

t1t2...

...tn

pm

p1

p2

pm-1

...

...

Bandwidth k of Edit

h

2k+1

ti

pj

Given T=ATGCGAGAGAT, P=GCAGAGAGATG, and k=2. We select t5, t8 and t11 three characters.

The 2-environment of t5 is C5=p3p4p5p6p7=AGAGA.The 2-environment of t8 is C8=p6p7p8p9p10=GAGAT.The 2-environment of t11 is C11=p9p10p11=ATG.

G

P

C

A

T

C

G

C

A

G

A

G

A

T

p5

p8

p11

G

A

G

A

G

A

T

G

A

T

t5

t8

t11

We now obtain a stronger version of Rule 7.

Lemma 2: Let a successful minimizing path M go through some entry on a diagonal h of Edit. Then for at most k indexes j, 1j m, character th+j does not occur in the k environment of Cj.

A formal proof can be found in the paper. In the following, we give some physical feeling of it.

In this case, although there are two mismatches, by deleting a which mismatches x, we may achieve a perfect match. Thus the edit distance between T and P may still be 1.k=1

T

P

x

y

a

x

b

y

In this case, it can be seen that deleting one character in P will not result in a perfect match. Thus, the edit distance between T and P must be larger than 1. k=1

T

P

x

y

b

c

b

c

a

a

The shift table is based on table Dk. We determine the first diagonal after h, say h+d, where at least one of the characters th+m, th+m-1, , th+m-k matches with corresponding character of P. Finally, the maximum of k+1 and d is the length of the shift.

The algorithm explains when a possible occurrence of P in T was found, DP approach is immediately used to find alignment result.

AlgorithmInput: P = p1p2pm,T = t1t2tn and kOutput: All occurrence P in TInitially, the start position h of T =0, i=h+m;While i n+ k do begin j=m; bad=0; While i>k and bad k do beginIf ti does not occur in Cj then bad=bad+1j=j-1;i=i-1 end; If bad k then W is a sequence from th-k to th+m.Using dynamic programming to align W with POutput alignment result. We calculate shift steps d=min(Dk[i, tr], Dk[i-1, tr-1],); h=h+max(k+1,d) end;

Complete example for approximate string matchingFor example :Let k=1, m=8, n=24

T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example(1/15)k=1>kt8=A appears in P(7,8)t7=C does not appear in P(6,8)t6=G appears in P(5,7)t5=C does not appear in P(4,6)

Shifting is needed now.

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example(2/15)>kk=1t9=G appears in P(7,8)t8=A appears in P(6,8)t7=C does not appear in P(5,7)t6=G appears in P(4,6)t5=C does not appear in P(3,5)

Shifting is needed now.

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example(3/15)>kk=1t11=G appears in P(7,8)t10=A appears in P(6,8)t9=G appears in P(5,7)t8=A appears in P(4,6)t7=C does not appear in P(3,5)t6=G appears in P(2,4)t5=C appears in P(1,3)t4=T does not appear in P(1,2) Shifting is needed now.

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example(4/15)GCAGAGAGGCAGAGAGOutput :W= CGCAGAGAGTP= GCAGAGAGk=1

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

CGCAGAGAGTGCAGAGAG

0000000000011011010101211011111113221011212242101121252101122621011272101282101

Example(5/15)GCAGAGA-GCAGAGAGOutput :W= CGCAGAGAGTP= GCAGAGAG

CGCAGAGAGTGCAGAGAG

0000000000011011010101211011111113221011212242101121252101122621011272101282101

Example(6/15)-CAGAGAGGCAGAGAGOutput :W= CGCAGAGAGTP= GCAGAGAG

CGCAGAGAGTGCAGAGAG

0000000000011011010101211011111113221011212242101121252101122621011272101282101

Example(7/15)CGCAGAGAG-GCAGAGAGOutput :W= CGCAGAGAGTP= GCAGAGAG

CGCAGAGAGTGCAGAGAG

0000000000011011010101211011111113221011212242101121252101122621011272101282101

Example(8/15)GCAGAGAGTGCAGAGAG-Output :W= CGCAGAGAGTP= GCAGAGAG

CGCAGAGAGTGCAGAGAG

0000000000011011010101211011111113221011212242101121252101122621011272101282101

Example(9/15)>kk=1t15=A appears in P(7,8)t14=T does not appear in P(6,8)t13=G appears in P(5,7)t12=A appears in P(4,6)t11=G appears in P(3,5)t10=A appears in P(2,4)t9=G appears in P(1,3)t8=A does not appear in P(1,2) Shifting is needed now.

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example(10/15)>kk=1t16=T does not appear in P(7,8)t15=A appears in P(6,8)t14=T does not appear in P(5,7)

Shifting is needed now.

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example(11/15)>kk=1t18=C does not appear in P(7,8)t17=G appears in P(6,8)t16=T does not appear in P(5,7)

Shifting is needed now.

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example(12/15)>kk=1t19=A appears in P(7,8)t18=C does not appear in P(6,8)t17=G appears in P(5,7)t16=T does not appear in P(4,6)

Shifting is needed now.

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example(13/15)>kk=1t20=G appears in P(7,8)t19=A appears in P(6,8)t18=C does not appear in P(5,7)t17=G appears in P(4,6)t16=T does not appear in P(3,5)

Shifting is needed now.

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example(14/15)>kk=1t22=G appears in P(7,8)t21=A appears in P(6,8)t20=G appears in P(5,7)t19=A appears in P(4,6)t18=G does not appear in P(3,5)t17=G appears in P(2,4)t16=T does not appear in P(1,3)Shifting is needed now.

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

Example(15/15)GCAGAGCGGCAGAGAGjump out of while loopW= TGCAGAGCGP= GCAGAGAGResult :k=1

123456789101112131415161718192021222324T:GCATCGCAGAGAGTATGCAGAGCGP:GCAGAGAG

TGCAGAGCGGCAGAGAG

000000000011011010102210111101321011211421011215210122621012721128221

Time complexitypreprocessing phase and searching phase in O(mn/k) time and O(||n) space complexity.

References[Bae89a] R. Baeza-Yates, Efficient Text Searching. Ph.D. Thesis, Report CS-89-17, University of Waterloo, Computer Science Department, 1989.[Bae89b] R. Baeza-Yates, String searching algorithms revisited. In: Proceedings of the Workshop on Algorithms and Data Structures(ed. F. Dehne et al.), Lecture Notes in Computer Science 382, Springer-Verlag, Berlin, 1989, pp.7596.[BoM77] R. Boyer and S. Moore, A fast string searching algorithm. Communcations of the ACM 20, 1977, pp.762772.[ChL90] W. Chang and E. Lawler, Approximate string matching in sublinear expected time. In: Proceedings of the 31st IEEE Annual Symposium on Foundations of Computer Science, 1990, pp.116124.[Fel65] W. Feller, An Introduction to Probability Theory and Its Applications. Vol. I. John Wiley & Sons, 1965.

References[Fel66] W. Feller, An Introduction to Probability Theory and Its Applications. Vol. II. John Wiley & Sons, 1966.[GaG86] Z. Galil and R. Giancarlo, Improved string matching with k mismatches. SIGACT News ,Vol. 17, 1986, pp.5254.[GaG88] Z. Galil and R. Giancarlo, Data structures and algorithms for approximate string matching. Journal of Complexity, Vol. 4, 1988, pp.3372.[GaP89] Z. Galil and K. Park, An improved algorithm for approximate string matching. Proceedings of the 16t International Colloquium on Automata, Languages and Programming, Lecture Notes in Computer Science 372, Springer-Verlag, Berlin, 1989, pp.394404.[GrL89] R. Grossi and F. Luccio, Simple and efficient string matching with k mismatches. Information Processing Letters, Vol. 33, 1989, pp.113120.[Hor80] N. Horspool, Practical fast searching in strings. Software Practice & Experience, Vol. 10, 1980, pp.501506.[JTU90] P. Jokinen, J. Tarhio and E. Ukkonen, A comparison of approximate string matching algorithms. In preparation.[Kos88] S. R. Kosaraju, Efficient string matching. Extended abstract. Johns Hopkins University, 1988.

References[KMP77] D. Knuth, J. Morris and V. Pratt, Fast pattern matching in strings. SIAM Journal on Computing, Vol. 6, 1977, pp.323350.[LaV88] G. Landau and U. Vishkin, Fast string matching with k differences. Journal of Computer and System Sciences, Vol. 37 (1988), 6378.[LaV89] G. Landau and U. Vishkin, Fast parallel and serial approximate string matching. Journal of Algorithms, Vol. 10 (1989), pp.157169.[Sel80] P. Sellers, The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms, Vol. 1, 1980, pp.359372.[Ukk85a] E. Ukkonen, Algorithms for approximate string matching.Information Control, Vol. 64, 1985, pp.100118.[Ukk85b] E. Ukkonen, Finding approximate patterns in strings. Journal of Algorithms, Vol. 6, 1985, pp.132137.[UkW90] E. Ukkonen and D. Wood, Fast approximate string matching with suffix automata. Report A-1990-4, Department of Computer Science, University of Helsinki, 1990.[WaF75] R. Wagner and M. Fischer, The string-to-string correction problem. Journal of the ACM, Vol. 21, 1975, pp.168173.

THANK YOU