1 Rules for Approximate String Matching R.C.T. Lee

Preview:

Citation preview

1

Rules for Approximate String Matching

R.C.T. Lee

2

Rule 1

Consider two substrings A1 and A2 as shown below:

A1 P1 S1

A2 P2 S2

If ed(A1, A2) ≦k and S1=S2, then ed(P1, P2) ≦k.

3

• Rule 1:[AKLLLR2000], [H2005], [HHLS2006], [JB2000], [LV89], [NB99], [NB2000], [S80], [TU93], and [WM92].

4

Rule 2

If ed(A, B) ≦k, then the length of A must be between m-k and m+k.

A

B

m

5

• Rule 2: [FN2004], [NB99], [NB2000] and [TU93].

6

Rule 3

If S1 contain S1’ completely and the distance between S1’ and any substring of P is larger than k, then ed(S1, P)>k.

S1

P

S1’

7

• Rule 3: [ALP2004].

8

Rule 4

For any substring S1 in T, if there exists a substring S2 in P to the left of S1, ed(S1, S2) ≦k and S2 is the rightmost such substring, then move P to align S1 and S2.

T S1

P S2

P S2

9

• Rule 4: [ALP2004].

10

Based upon Rule 3 and Rule 2, we have Rule 5

If the window size is (m-k) and there exists a substring S

1 in the window such that the distance between S1 and any substring of P is larger than k, then we can safely move P as follows:

T S1

P

m-k

T S1

m-k

P

11

If Rule 5 is not satisfied, it means the following:

For every substring S1 in T, there exists a substring S2 in P such that ed(S1, S2) ≦k.

12

T S1

P

m-k

Rule 5-1

If Rule 5 is not satisfied, we can only move1 step as follows:

T S1

P

m-k

13

• Rule 5: [HN2005].

14

Rule 6

Hamming Distance(A, B) Edit Distance(≧ A, B).

15

• Rule 6: [AKLLLR2000], [FN2004] and [TU93].

16

Rule 7

For strings A and B, if there are k+1 characters which do not appear in B, then ed(A, B)>k.

Rule 7-1

Let A and B be two strings. Let there be k+1 characters a1, a2, …, ak+1 in A and ai is aligned with bi in B. If every ai does not appear in B[i-k, i+k], then ed(A, B)>k.

17

• Rule 7: [TU93].

18

Rule 8

Let there be two strings A and B. Let B be divided into j pieces B1, B2, …, Bj. If ed(A, B)>k, there is at least one substring Ai in A such that ed(Ai, Bi) . jk

19

Rule 8-1

Let A and B be two strings. Let B be divided into j pieces B1, B2, …, Bj. If for every Bi and every substring S of A, ed(S, Bi) , ed(A, B)>k. jk

20

Rule 8-2

Let A and B be two strings. Let the lengths of A and B be m+k and m repsectively. Let B be divided into j pieces B1, B2, …, Bj. Let AP be a prefix of A.

If for every Bi and every substring S of A, ed(S, Bi) , ed(AP, B)>k.

jk

21

• Rule 8: [NB99] and [NB2000].

22

Rule 9

Let A and B be two strings with lengths m+k and m respectively. Let A’ be the prefix of A with length m-k. Let there be j characters a1, a2, …, aj in A’. Let the number of times that ai appears in A and B be N(A’, ai) and N(B, ai) respectively. Let Ci=N(A’, ai)-N(B, ai). Let AP be any prefix of A.

If , ed(AP, B)>k. kC

iCi

0

23

Rule 9-1

Let A and B be two strings with lengths m+k and m respectively. Let there be j characters a1, a2, …, aj in A. Let the number of times that ai appears in A and B be N(A’, ai) and N(B, ai) respectively. Let Ci=N(B, ai)-N(A, ai). Let AP be any prefix of A.

If , ed(AP, B)>k. kC

iCi

0

24

Rule 10

Let P and T be two strings with lengths m and n respectively.If P matches with a substring P’ of T at position i, any substring S of T[i-k, i+m+k] has the probability of ed(S, P) ≦k.

T P’

P

m+2k

ii-k i+m+k

25

• Rule 10: [NB99].

26

Rule 11Let P and Q be two strings.Let P be divided as follows:

P1 P2… Pn

Let Qi be the substring in Q and that ed(Pi, Qi)is the smallest.

P1 P2 Pn…

Q2 QN…Q1

If .),(,),(1

kQPedkQPedN

iii

27

Application of Rule 11

TW

tnt2

Pn P1

P2

t1…

ed(ti,Pi) is the smallest.

If for some n, .),(,),(1

kPWedkPtedn

iii

28

• [AKLLLR2000] Text Indexing and Dictionary Matching with One Error , Amir, A., Keselman, D., Landau, G. M., Lewenstein, M., Lewenstein, N. and Rodeh, M. , Journal of Algorithms , Vol. 37 , 2000 , pp. 309-325 .

• [ALP2004] Faster Algorithms for String Matching with k Mismatches, Amir, A.,

Lewenstein, and Porat, E. Journal of Algorithms, Vol. 50, 2004, pp. 257-275.

• [FN2004] Average-Optimal Multiple Approximate String Matching, Kimmo Fredriksson , Gonzalo Navarro, ACM Journal of Experimental Algorithmics,

Vol 9, Article No. 1.4,2004, pp. 1-47.

29

• [GG86] Improved String Matching with k Mismatches, Galil, Z. and Giancarlo, R.,SIGACT News, Vol. 17, No. 4, 1986, pp. 52-54.

• [H2005] Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229.

• [HHLS2006] Approximate String Matching Using Compressed Suffix Arrays, Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249.

30

• [HN2005] Bit-parallel Witnesses and their Applications to Approximate String Matching, Heikki Hyyro and Gonzalo Navarro, Algorithmica, Vol 4, No. 3, 2005, pp.203-231.

• [JB2000] Approximate string matching using factor automata, Jan Holub, Borivoj Melichar, Theoretical Computer Science 249, 2000, pp. 305-311.

• [LV86] String Matching with k Mismatches by Using Kangaroo Method, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43, 1986, pp. 239-249.

31

• [LV89] Fast Parallel and Serial Approximate String Matching, G. Landau and U. Vishkin, Journal of algorithms, 10, 1989, pp.157-169.

• [NB99] Very fast and simple approximate string matching, G. Navarro and R. Baeza-Yates, Information Processing Letters, Vol. 72, 1999, pp.65-70.

• [NB2000] A Hybrid Indexing Method for Approximate String Matching, Gonzalo Navarro and Ricardo Baeza-Yates , 2000, No.1, Vol.1, pp.205-239.

32

• [S80] String Matching with Errors, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359-373.

• [TU93] Approximate Boyer-Moore String Matching, J. Tarhio and E. Ukkonen, SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp.243-260.

• [WM92] Fast Text Searching: Allowing Errors, Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91.

Recommended