# A Hybrid Indexing Method for Approximate String Matching

• View
46

0

Embed Size (px)

DESCRIPTION

A Hybrid Indexing Method for Approximate String Matching. Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates. Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh. The approximate string matching problem is: - PowerPoint PPT Presentation

### Text of A Hybrid Indexing Method for Approximate String Matching

• A Hybrid Indexing Method for Approximate String Matching Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh

• The approximate string matching problem is:

Given a text T of length n, a pattern P of length m (n > m), and a threshold k to the number of "errors" in the matches, find all occurrences of a pattern in a text with k errors.

• This paper uses an exhaustive searching mechanism. We open a window T in T with size m+k (Rule 2) and try to determine whether we are sure that every prefix T of this window T has ed(T,P) > k.

If the answer is yes, we ignore this window; otherwise, we use dynamic programming to examine whether any prefix T of the window T has ed(T,P) k.

• We use dynamic programming to compute the edit distance between two strings.A matrix C0|m|,0|n| is filled, where Cj,i represents the minimum number of operations need to match T1i to P1j. This is computed as follows

Cj,0 and C0,i represent the edit distance between a string of length j or i and the empty string.

• example:T = surgeryP = surveyk = 2There are only three prefixes of T, namely surge, surger and surgery, whose edit distances with P=survey aresmaller than or equal to k=2.

surgery01234567s10123456u21012345r32101234v43211234e54322123y65433222

• Let us now see how we can be sure that for a window T with size m+k , for every prefix T of T, ed(T,P) > k.

We present Lemma 1 of this paper as follows.

• Lemma 1Let T in T and P be two strings such that ed(T, P) k. Let P = P1x1P2x2 xj-1Pj, for strings Pi and xi and for any j 1. Then, at least one string Pi appears in T with at most errors.

Thus, we always divide the pattern into j pieces. We shall point out how to divide later.

• To be more precise, we may say that if ed(T,P) k, there exists a Pi in P and a T in T such that ed(Pi,T) .

• Lemma 1 tells us that if for all Pi in P and every substring b in T, ed(Pi,b) > , then ed(P,T) > k.

Suppose that there is a window T with size m+k andfor all Pi in P and for every substring b in T, ed(Pi,b) > .

Then, we can be sure that for every prefix Tof T , for all Pi in P and every substring b in T, ed(Pi,b) > .TTTP

b

Pi

• Let us define the following condition.

Condition A: For all Pi in P and every substring b inT, ed(Pi, b) > Thus, if Condition A is satisfied, then for every prefix T of T, ed(T,P)>k.

In such a case, we ignore T and shift P one step to the right.

• Question, how can we be sure that the above condition is satisfied.

The approach:

For each Pi, we generate all possible modified strings Pi whose distances with Pi are smaller than or equal to k.

After generating all possible modified , we may use the suffix tree of T to find all occurrences of , for all i, in T with error less than .

• We still have the following questions: Question 1. How to divide P into j pieces?Question 2. How to generate all modified Pis?Question 3. How to find the occurrences of Pis in T with edit distance less than or equal to .

• Question 1: How to divide P into j pieces? It can be proved that an optimal method is to partition P into j pieces with , where is the alphabet size. We can get j pieces of P, and the size of every piece is around logn.

• Question 2. How to generate all modified Pis?

The generation of all modified strings whose distanceswith P can be done trivially. One method can be found in [HHLS2006] which was reported by C. W. Lu.

Another method can be found in [HM2007] reportedBy L. C. Chen.

In this paper, the authors used the second method mentioned in [HM2007].

• We can use non-deterministic finite automatons (NFA). A NFA is a five-tuple M=(Q, , , q0 , F), where Q is a finite set of states, is a finite input alphabet, is a mapping from Q( {}) into the set of subsets of Q, q0 Q is an initial state, and F Q is a set of final states.

• P = abac, k = 2.The finite automaton M accepts Lk(P).Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.

2,0

3,0

4,0

1,0

0,0

2,1

3,1

4,1

1,1

2,2

3,2

4,2

a

b

a

c

c

a

c

a

One matched, no error.

One matched, one error.

Two matched, two errors.

Four matched, no error.

b

one error

two errors

• P = abac, k = 2.The finite automaton M accepts Lk(P).Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.Recognize aa

2,0

3,0

4,0

1,0

0,0

2,1

3,1

4,1

1,1

2,2

3,2

4,2

a

b

a

c

c

a

c

a

b

• Full example: T = GACACAGACCAAAGCAGn = 17 P = CAAGm = 4 k = 1

• P = CAAG j = (m + k) / logn = (4 + 1) / log317 = 1.9388 Therefore, we partition P into two pieces.P1 = CAP2 = AG According to Lemma 1, at least one piece appears in substrings of T with at most = 0 error. This means that we want to find exact matching of P1 and P2.

• NFA with k = 1 of P1 = CA:NFA with k = 1 of P2 = AG:

0,0

1,0

2,0

1,1

2,1

C

A

A

zero error

one error

0,0

1,0

2,0

1,1

2,1

A

G

G

zero error

one error

• T = GACACGGACCAAAGCAG We construct the suffix tree of T.ACGCGC A A A G C A G \$G G A C C A A A G C A G \$A C G G A C C A A A G C A G \$AA A G C A G \$C G G A C C A A A G C A G \$G \$C A A A G C A G \$G G A C C A A A G C A G \$A CA C G G A C C A A A G C A G \$C A A A G C A G \$C A G \$G A C C A A A G C A G \$AG C A G \$A G C A G \$\$C A G \$111213141516\$1710987654321

• We only need to consider the tree level from root to = 3 .ACGCGCGAAACGC AG GA CC AG AAGA\$C111213141516\$171098654321,7T = GACACGGACCAAAGCAG

• NFA of P1:NFA of P2T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

A

• (not exact match)(not exact match)T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

A

• Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

A

• (exact match)Out of active states.We record positions 13 and 16 where AG occurs.T = GACACGGACCAAAGCAG 13 16k = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

G

• T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

C

0,0

1,0

2,0

1,1

2,1

A

G

G

C

• Out of active states.(exact match)We record positions 3, 10 and 15 where CA occurs.T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

• Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

C

0,0

1,0

2,0

1,1

2,1

A

G

G

• (not exact match)(not exact match)T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

G

0,0

1,0

2,0

1,1

2,1

A

G

G

G

• T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

G

0,0

1,0

2,0

1,1

2,1

A

G

G

G

• Out of active states.Out of active states.T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

• Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

• Out of active states.Out of active states.T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

• Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1

0,0

1,0

2,0

1,1

2,1

C

A

A

0,0

1,0

2,0

1,1

2,1

A

G

G

G

• After we find all probable positions in T, we verify every substring of those positions. The probable positions of T are: 3, 10, 13, 15, 1 ##### Accelerating approximate string matching in heterogeneous ... · PDF fileAccelerating approximate string matching in heterogeneous computing platforms Joao Pedro Silva Rodrigues˜
Documents ##### Efficient Approximate Search on String Collections Part II yaobin/teaching/2012db2/ Approximate Search on String Collections ... Prefix filter (CGK06) ... Those lists are fairly unimportant
Documents ##### ³d, 2008 Intelligent Text Processing lecture 2 Multiple and approximate string matching. Full-text indexing: suffix tree, suffix array Szymon Grabowski
Documents ##### ACCURACY OF APPROXIMATE STRING JOINS USING GRAMS OF APPROXIMATE STRING JOINS USING GRAMS Oktie Hassanzadeh ... Variable-length grams [VGRAM-VLDBâ€™07] ... Stsalney Morgan cncorporsated
Documents ##### Space-Constrained Gram-Based Indexing for Efﬁcient ... · PDF file Space-Constrained Gram-Based Indexing for Efﬁcient Approximate String Search (Technical Report) Alexander Behm
Documents ##### Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga
Documents ##### Approximate String Matching by End-Users using Active Learning .Approximate String Matching by End-Users
Documents ##### AudioDB: Scalable approximate nearest-neighbor search with automatic radius-bounded indexing
Documents ##### Space-Constrained Gram-Based Indexing for Efficient Approximate String Search
Documents ##### String Indexing for Patterns with Wildcards - imm.dtu. phbi/files/publications
Documents ##### Average-Optimal Single and Multiple Approximate String Matching
Documents ##### APPROXIMATE BOYER-MOORE STRING MATCHING tarhio/papers/abm.pdf · PDF file The Boyer-Moore idea applied in exact string matching is generalized to approximate string matching. Two
Documents ##### Approximate String Matching With Dynamic Programming and Suffix Trees
Documents ##### Sublinear Approximate String Matching - · PDF file Sublinear Approximate String Matching Robert Z. West Department of Informatics Technische Universit¨at Mu¨nchen Joint Advanced
Documents Documents