32
1 Rules for Approximate String Matching R.C.T. Lee

1 Rules for Approximate String Matching R.C.T. Lee

Embed Size (px)

Citation preview

Page 1: 1 Rules for Approximate String Matching R.C.T. Lee

1

Rules for Approximate String Matching

R.C.T. Lee

Page 2: 1 Rules for Approximate String Matching R.C.T. Lee

2

Rule 1

Consider two substrings A1 and A2 as shown below:

A1 P1 S1

A2 P2 S2

If ed(A1, A2) ≦k and S1=S2, then ed(P1, P2) ≦k.

Page 3: 1 Rules for Approximate String Matching R.C.T. Lee

3

• Rule 1:[AKLLLR2000], [H2005], [HHLS2006], [JB2000], [LV89], [NB99], [NB2000], [S80], [TU93], and [WM92].

Page 4: 1 Rules for Approximate String Matching R.C.T. Lee

4

Rule 2

If ed(A, B) ≦k, then the length of A must be between m-k and m+k.

A

B

m

Page 5: 1 Rules for Approximate String Matching R.C.T. Lee

5

• Rule 2: [FN2004], [NB99], [NB2000] and [TU93].

Page 6: 1 Rules for Approximate String Matching R.C.T. Lee

6

Rule 3

If S1 contain S1’ completely and the distance between S1’ and any substring of P is larger than k, then ed(S1, P)>k.

S1

P

S1’

Page 7: 1 Rules for Approximate String Matching R.C.T. Lee

7

• Rule 3: [ALP2004].

Page 8: 1 Rules for Approximate String Matching R.C.T. Lee

8

Rule 4

For any substring S1 in T, if there exists a substring S2 in P to the left of S1, ed(S1, S2) ≦k and S2 is the rightmost such substring, then move P to align S1 and S2.

T S1

P S2

P S2

Page 9: 1 Rules for Approximate String Matching R.C.T. Lee

9

• Rule 4: [ALP2004].

Page 10: 1 Rules for Approximate String Matching R.C.T. Lee

10

Based upon Rule 3 and Rule 2, we have Rule 5

If the window size is (m-k) and there exists a substring S

1 in the window such that the distance between S1 and any substring of P is larger than k, then we can safely move P as follows:

T S1

P

m-k

T S1

m-k

P

Page 11: 1 Rules for Approximate String Matching R.C.T. Lee

11

If Rule 5 is not satisfied, it means the following:

For every substring S1 in T, there exists a substring S2 in P such that ed(S1, S2) ≦k.

Page 12: 1 Rules for Approximate String Matching R.C.T. Lee

12

T S1

P

m-k

Rule 5-1

If Rule 5 is not satisfied, we can only move1 step as follows:

T S1

P

m-k

Page 13: 1 Rules for Approximate String Matching R.C.T. Lee

13

• Rule 5: [HN2005].

Page 14: 1 Rules for Approximate String Matching R.C.T. Lee

14

Rule 6

Hamming Distance(A, B) Edit Distance(≧ A, B).

Page 15: 1 Rules for Approximate String Matching R.C.T. Lee

15

• Rule 6: [AKLLLR2000], [FN2004] and [TU93].

Page 16: 1 Rules for Approximate String Matching R.C.T. Lee

16

Rule 7

For strings A and B, if there are k+1 characters which do not appear in B, then ed(A, B)>k.

Rule 7-1

Let A and B be two strings. Let there be k+1 characters a1, a2, …, ak+1 in A and ai is aligned with bi in B. If every ai does not appear in B[i-k, i+k], then ed(A, B)>k.

Page 17: 1 Rules for Approximate String Matching R.C.T. Lee

17

• Rule 7: [TU93].

Page 18: 1 Rules for Approximate String Matching R.C.T. Lee

18

Rule 8

Let there be two strings A and B. Let B be divided into j pieces B1, B2, …, Bj. If ed(A, B)>k, there is at least one substring Ai in A such that ed(Ai, Bi) . jk

Page 19: 1 Rules for Approximate String Matching R.C.T. Lee

19

Rule 8-1

Let A and B be two strings. Let B be divided into j pieces B1, B2, …, Bj. If for every Bi and every substring S of A, ed(S, Bi) , ed(A, B)>k. jk

Page 20: 1 Rules for Approximate String Matching R.C.T. Lee

20

Rule 8-2

Let A and B be two strings. Let the lengths of A and B be m+k and m repsectively. Let B be divided into j pieces B1, B2, …, Bj. Let AP be a prefix of A.

If for every Bi and every substring S of A, ed(S, Bi) , ed(AP, B)>k.

jk

Page 21: 1 Rules for Approximate String Matching R.C.T. Lee

21

• Rule 8: [NB99] and [NB2000].

Page 22: 1 Rules for Approximate String Matching R.C.T. Lee

22

Rule 9

Let A and B be two strings with lengths m+k and m respectively. Let A’ be the prefix of A with length m-k. Let there be j characters a1, a2, …, aj in A’. Let the number of times that ai appears in A and B be N(A’, ai) and N(B, ai) respectively. Let Ci=N(A’, ai)-N(B, ai). Let AP be any prefix of A.

If , ed(AP, B)>k. kC

iCi

0

Page 23: 1 Rules for Approximate String Matching R.C.T. Lee

23

Rule 9-1

Let A and B be two strings with lengths m+k and m respectively. Let there be j characters a1, a2, …, aj in A. Let the number of times that ai appears in A and B be N(A’, ai) and N(B, ai) respectively. Let Ci=N(B, ai)-N(A, ai). Let AP be any prefix of A.

If , ed(AP, B)>k. kC

iCi

0

Page 24: 1 Rules for Approximate String Matching R.C.T. Lee

24

Rule 10

Let P and T be two strings with lengths m and n respectively.If P matches with a substring P’ of T at position i, any substring S of T[i-k, i+m+k] has the probability of ed(S, P) ≦k.

T P’

P

m+2k

ii-k i+m+k

Page 25: 1 Rules for Approximate String Matching R.C.T. Lee

25

• Rule 10: [NB99].

Page 26: 1 Rules for Approximate String Matching R.C.T. Lee

26

Rule 11Let P and Q be two strings.Let P be divided as follows:

P1 P2… Pn

Let Qi be the substring in Q and that ed(Pi, Qi)is the smallest.

P1 P2 Pn…

Q2 QN…Q1

If .),(,),(1

kQPedkQPedN

iii

Page 27: 1 Rules for Approximate String Matching R.C.T. Lee

27

Application of Rule 11

TW

tnt2

Pn P1

P2

t1…

ed(ti,Pi) is the smallest.

If for some n, .),(,),(1

kPWedkPtedn

iii

Page 28: 1 Rules for Approximate String Matching R.C.T. Lee

28

• [AKLLLR2000] Text Indexing and Dictionary Matching with One Error , Amir, A., Keselman, D., Landau, G. M., Lewenstein, M., Lewenstein, N. and Rodeh, M. , Journal of Algorithms , Vol. 37 , 2000 , pp. 309-325 .

• [ALP2004] Faster Algorithms for String Matching with k Mismatches, Amir, A.,

Lewenstein, and Porat, E. Journal of Algorithms, Vol. 50, 2004, pp. 257-275.

• [FN2004] Average-Optimal Multiple Approximate String Matching, Kimmo Fredriksson , Gonzalo Navarro, ACM Journal of Experimental Algorithmics,

Vol 9, Article No. 1.4,2004, pp. 1-47.

Page 29: 1 Rules for Approximate String Matching R.C.T. Lee

29

• [GG86] Improved String Matching with k Mismatches, Galil, Z. and Giancarlo, R.,SIGACT News, Vol. 17, No. 4, 1986, pp. 52-54.

• [H2005] Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229.

• [HHLS2006] Approximate String Matching Using Compressed Suffix Arrays, Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249.

Page 30: 1 Rules for Approximate String Matching R.C.T. Lee

30

• [HN2005] Bit-parallel Witnesses and their Applications to Approximate String Matching, Heikki Hyyro and Gonzalo Navarro, Algorithmica, Vol 4, No. 3, 2005, pp.203-231.

• [JB2000] Approximate string matching using factor automata, Jan Holub, Borivoj Melichar, Theoretical Computer Science 249, 2000, pp. 305-311.

• [LV86] String Matching with k Mismatches by Using Kangaroo Method, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43, 1986, pp. 239-249.

Page 31: 1 Rules for Approximate String Matching R.C.T. Lee

31

• [LV89] Fast Parallel and Serial Approximate String Matching, G. Landau and U. Vishkin, Journal of algorithms, 10, 1989, pp.157-169.

• [NB99] Very fast and simple approximate string matching, G. Navarro and R. Baeza-Yates, Information Processing Letters, Vol. 72, 1999, pp.65-70.

• [NB2000] A Hybrid Indexing Method for Approximate String Matching, Gonzalo Navarro and Ricardo Baeza-Yates , 2000, No.1, Vol.1, pp.205-239.

Page 32: 1 Rules for Approximate String Matching R.C.T. Lee

32

• [S80] String Matching with Errors, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359-373.

• [TU93] Approximate Boyer-Moore String Matching, J. Tarhio and E. Ukkonen, SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp.243-260.

• [WM92] Fast Text Searching: Allowing Errors, Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91.