Click here to load reader

View

46Download

0

Embed Size (px)

DESCRIPTION

A Hybrid Indexing Method for Approximate String Matching. Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates. Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh. The approximate string matching problem is: - PowerPoint PPT Presentation

A Hybrid Indexing Method for Approximate String Matching Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh

The approximate string matching problem is:

Given a text T of length n, a pattern P of length m (n > m), and a threshold k to the number of "errors" in the matches, find all occurrences of a pattern in a text with k errors.

This paper uses an exhaustive searching mechanism. We open a window T in T with size m+k (Rule 2) and try to determine whether we are sure that every prefix T of this window T has ed(T,P) > k.

If the answer is yes, we ignore this window; otherwise, we use dynamic programming to examine whether any prefix T of the window T has ed(T,P) k.

We use dynamic programming to compute the edit distance between two strings.A matrix C0|m|,0|n| is filled, where Cj,i represents the minimum number of operations need to match T1i to P1j. This is computed as follows

Cj,0 and C0,i represent the edit distance between a string of length j or i and the empty string.

example:T = surgeryP = surveyk = 2There are only three prefixes of T, namely surge, surger and surgery, whose edit distances with P=survey aresmaller than or equal to k=2.

surgery01234567s10123456u21012345r32101234v43211234e54322123y65433222

Let us now see how we can be sure that for a window T with size m+k , for every prefix T of T, ed(T,P) > k.

We present Lemma 1 of this paper as follows.

Lemma 1Let T in T and P be two strings such that ed(T, P) k. Let P = P1x1P2x2 xj-1Pj, for strings Pi and xi and for any j 1. Then, at least one string Pi appears in T with at most errors.

Thus, we always divide the pattern into j pieces. We shall point out how to divide later.

To be more precise, we may say that if ed(T,P) k, there exists a Pi in P and a T in T such that ed(Pi,T) .

Lemma 1 tells us that if for all Pi in P and every substring b in T, ed(Pi,b) > , then ed(P,T) > k.

Suppose that there is a window T with size m+k andfor all Pi in P and for every substring b in T, ed(Pi,b) > .

Then, we can be sure that for every prefix Tof T , for all Pi in P and every substring b in T, ed(Pi,b) > .TTTP

b

Pi

Let us define the following condition.

Condition A: For all Pi in P and every substring b inT, ed(Pi, b) > Thus, if Condition A is satisfied, then for every prefix T of T, ed(T,P)>k.

In such a case, we ignore T and shift P one step to the right.

Question, how can we be sure that the above condition is satisfied.

The approach:

For each Pi, we generate all possible modified strings Pi whose distances with Pi are smaller than or equal to k.

After generating all possible modified , we may use the suffix tree of T to find all occurrences of , for all i, in T with error less than .

We still have the following questions: Question 1. How to divide P into j pieces?Question 2. How to generate all modified Pis?Question 3. How to find the occurrences of Pis in T with edit distance less than or equal to .

Question 1: How to divide P into j pieces? It can be proved that an optimal method is to partition P into j pieces with , where is the alphabet size. We can get j pieces of P, and the size of every piece is around logn.

Question 2. How to generate all modified Pis?

The generation of all modified strings whose distanceswith P can be done trivially. One method can be found in [HHLS2006] which was reported by C. W. Lu.

Another method can be found in [HM2007] reportedBy L. C. Chen.

In this paper, the authors used the second method mentioned in [HM2007].

We can use non-deterministic finite automatons (NFA). A NFA is a five-tuple M=(Q, , , q0 , F), where Q is a finite set of states, is a finite input alphabet, is a mapping from Q( {}) into the set of subsets of Q, q0 Q is an initial state, and F Q is a set of final states.

P = abac, k = 2.The finite automaton M accepts Lk(P).Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.

2,0

3,0

4,0

1,0

0,0

2,1

3,1

4,1

1,1

2,2

3,2

4,2

a

b

a

c

c

a

c

a

One matched, no error.

One matched, one error.

Two matched, two errors.

Four matched, no error.

b

one error

two errors

P = abac, k = 2.The finite automaton M accepts Lk(P).Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.Recognize aa

2,0

3,0

4,0

1,0

0,0

2,1

3,1

4,1

1,1

2,2

3,2

4,2

a

b

a

c

c

a

c

a

b

Full example: T = GACACAGACCAAAGCAGn = 17 P = CAAGm = 4 k = 1

P = CAAG j = (m + k) / logn = (4 + 1) / log317 = 1.9388 Therefore, we partition P into two pieces.P1 = CAP2 = AG According to Lemma 1, at least one piece appears in substrings of T with at most = 0 error. This means that we want to find exact matching of P1 and P2.

NFA with k = 1 of P1 = CA:NFA with k = 1 of P2 = AG:

0,0

1,0

2,0

1,1

2,1

C

A

A

zero error

one error

0,0

1,0

2,0

1,1

2,1

A

G

G

zero error

one error

T = GACACGGACCAAAGCAG We construct the suffix tree of T.ACGCGC A A A G C A G $G G A C C A A A G C A G $A C G G A C C A A A G C A G $AA A G C A G $C G G A C C A A A G C A G $G $C A A A G C A G $G G A C C A A A G C A G $A CA C G G A C C A A A G C A G $C A A A G C A G $C A G $G A C C A A A G C A G $AG C A G $A G C A G $$C A G $111213141516$1710987654321

We only need to consider the tree level from root to = 3 .ACGCGCGAAACGC AG GA CC AG AAGA$C111213141516$171098654321,7T = GACACGGACCAAAGCAG

NFA of P1:NFA of P2T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

A

(not exact match)(not exact match)T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

A

Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

A

(exact match)Out of active states.We record positions 13 and 16 where AG occurs.T = GACACGGACCAAAGCAG 13 16k = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

G

T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

C

0,0

1,0

2,0

1,1

2,1

A

G

G

C

Out of active states.(exact match)We record positions 3, 10 and 15 where CA occurs.T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

C

0,0

1,0

2,0

1,1

2,1

A

G

G

(not exact match)(not exact match)T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

G

0,0

1,0

2,0

1,1

2,1

A

G

G

G

T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

G

0,0

1,0

2,0

1,1

2,1

A

G

G

G

Out of active states.Out of active states.T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

Out of active states.Out of active states.T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

G

After we find all probable positions in T, we verify every substring of those positions. The probable positions of T are: 3, 10, 13, 15, 1