Click here to load reader
View
46
Download
0
Embed Size (px)
DESCRIPTION
A Hybrid Indexing Method for Approximate String Matching. Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates. Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh. The approximate string matching problem is: - PowerPoint PPT Presentation
A Hybrid Indexing Method for Approximate String Matching Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh
The approximate string matching problem is:
Given a text T of length n, a pattern P of length m (n > m), and a threshold k to the number of "errors" in the matches, find all occurrences of a pattern in a text with k errors.
This paper uses an exhaustive searching mechanism. We open a window T in T with size m+k (Rule 2) and try to determine whether we are sure that every prefix T of this window T has ed(T,P) > k.
If the answer is yes, we ignore this window; otherwise, we use dynamic programming to examine whether any prefix T of the window T has ed(T,P) k.
We use dynamic programming to compute the edit distance between two strings.A matrix C0|m|,0|n| is filled, where Cj,i represents the minimum number of operations need to match T1i to P1j. This is computed as follows
Cj,0 and C0,i represent the edit distance between a string of length j or i and the empty string.
example:T = surgeryP = surveyk = 2There are only three prefixes of T, namely surge, surger and surgery, whose edit distances with P=survey aresmaller than or equal to k=2.
surgery01234567s10123456u21012345r32101234v43211234e54322123y65433222
Let us now see how we can be sure that for a window T with size m+k , for every prefix T of T, ed(T,P) > k.
We present Lemma 1 of this paper as follows.
Lemma 1Let T in T and P be two strings such that ed(T, P) k. Let P = P1x1P2x2 xj-1Pj, for strings Pi and xi and for any j 1. Then, at least one string Pi appears in T with at most errors.
Thus, we always divide the pattern into j pieces. We shall point out how to divide later.
To be more precise, we may say that if ed(T,P) k, there exists a Pi in P and a T in T such that ed(Pi,T) .
Lemma 1 tells us that if for all Pi in P and every substring b in T, ed(Pi,b) > , then ed(P,T) > k.
Suppose that there is a window T with size m+k andfor all Pi in P and for every substring b in T, ed(Pi,b) > .
Then, we can be sure that for every prefix Tof T , for all Pi in P and every substring b in T, ed(Pi,b) > .TTTP
b
Pi
Let us define the following condition.
Condition A: For all Pi in P and every substring b inT, ed(Pi, b) > Thus, if Condition A is satisfied, then for every prefix T of T, ed(T,P)>k.
In such a case, we ignore T and shift P one step to the right.
Question, how can we be sure that the above condition is satisfied.
The approach:
For each Pi, we generate all possible modified strings Pi whose distances with Pi are smaller than or equal to k.
After generating all possible modified , we may use the suffix tree of T to find all occurrences of , for all i, in T with error less than .
We still have the following questions: Question 1. How to divide P into j pieces?Question 2. How to generate all modified Pis?Question 3. How to find the occurrences of Pis in T with edit distance less than or equal to .
Question 1: How to divide P into j pieces? It can be proved that an optimal method is to partition P into j pieces with , where is the alphabet size. We can get j pieces of P, and the size of every piece is around logn.
Question 2. How to generate all modified Pis?
The generation of all modified strings whose distanceswith P can be done trivially. One method can be found in [HHLS2006] which was reported by C. W. Lu.
Another method can be found in [HM2007] reportedBy L. C. Chen.
In this paper, the authors used the second method mentioned in [HM2007].
We can use non-deterministic finite automatons (NFA). A NFA is a five-tuple M=(Q, , , q0 , F), where Q is a finite set of states, is a finite input alphabet, is a mapping from Q( {}) into the set of subsets of Q, q0 Q is an initial state, and F Q is a set of final states.
P = abac, k = 2.The finite automaton M accepts Lk(P).Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.
2,0
3,0
4,0
1,0
0,0
2,1
3,1
4,1
1,1
2,2
3,2
4,2
a
b
a
c
c
a
c
a
One matched, no error.
One matched, one error.
Two matched, two errors.
Four matched, no error.
b
one error
two errors
P = abac, k = 2.The finite automaton M accepts Lk(P).Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.Recognize aa
2,0
3,0
4,0
1,0
0,0
2,1
3,1
4,1
1,1
2,2
3,2
4,2
a
b
a
c
c
a
c
a
b
Full example: T = GACACAGACCAAAGCAGn = 17 P = CAAGm = 4 k = 1
P = CAAG j = (m + k) / logn = (4 + 1) / log317 = 1.9388 Therefore, we partition P into two pieces.P1 = CAP2 = AG According to Lemma 1, at least one piece appears in substrings of T with at most = 0 error. This means that we want to find exact matching of P1 and P2.
NFA with k = 1 of P1 = CA:NFA with k = 1 of P2 = AG:
0,0
1,0
2,0
1,1
2,1
C
A
A
zero error
one error
0,0
1,0
2,0
1,1
2,1
A
G
G
zero error
one error
T = GACACGGACCAAAGCAG We construct the suffix tree of T.ACGCGC A A A G C A G $G G A C C A A A G C A G $A C G G A C C A A A G C A G $AA A G C A G $C G G A C C A A A G C A G $G $C A A A G C A G $G G A C C A A A G C A G $A CA C G G A C C A A A G C A G $C A A A G C A G $C A G $G A C C A A A G C A G $AG C A G $A G C A G $$C A G $111213141516$1710987654321
We only need to consider the tree level from root to = 3 .ACGCGCGAAACGC AG GA CC AG AAGA$C111213141516$171098654321,7T = GACACGGACCAAAGCAG
NFA of P1:NFA of P2T = GACACGGACCAAAGCAGk = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
A
0,0
1,0
2,0
1,1
2,1
A
G
G
A
(not exact match)(not exact match)T = GACACGGACCAAAGCAGk = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
A
0,0
1,0
2,0
1,1
2,1
A
G
G
A
Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
0,0
1,0
2,0
1,1
2,1
A
G
G
A
(exact match)Out of active states.We record positions 13 and 16 where AG occurs.T = GACACGGACCAAAGCAG 13 16k = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
0,0
1,0
2,0
1,1
2,1
A
G
G
G
T = GACACGGACCAAAGCAGk = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
C
0,0
1,0
2,0
1,1
2,1
A
G
G
C
Out of active states.(exact match)We record positions 3, 10 and 15 where CA occurs.T = GACACGGACCAAAGCAGk = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
A
0,0
1,0
2,0
1,1
2,1
A
G
G
Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
C
0,0
1,0
2,0
1,1
2,1
A
G
G
(not exact match)(not exact match)T = GACACGGACCAAAGCAGk = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
G
0,0
1,0
2,0
1,1
2,1
A
G
G
G
T = GACACGGACCAAAGCAGk = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
G
0,0
1,0
2,0
1,1
2,1
A
G
G
G
Out of active states.Out of active states.T = GACACGGACCAAAGCAGk = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
0,0
1,0
2,0
1,1
2,1
A
G
G
Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
A
0,0
1,0
2,0
1,1
2,1
A
G
G
Out of active states.Out of active states.T = GACACGGACCAAAGCAGk = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
0,0
1,0
2,0
1,1
2,1
A
G
G
Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1
0,0
1,0
2,0
1,1
2,1
C
A
A
0,0
1,0
2,0
1,1
2,1
A
G
G
G
After we find all probable positions in T, we verify every substring of those positions. The probable positions of T are: 3, 10, 13, 15, 1