Click here to load reader

A Hybrid Indexing Method for Approximate String Matching

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

A Hybrid Indexing Method for Approximate String Matching. Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates. Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh. The approximate string matching problem is: - PowerPoint PPT Presentation

Text of A Hybrid Indexing Method for Approximate String Matching

  • A Hybrid Indexing Method for Approximate String Matching Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh

  • The approximate string matching problem is:

    Given a text T of length n, a pattern P of length m (n > m), and a threshold k to the number of "errors" in the matches, find all occurrences of a pattern in a text with k errors.

  • This paper uses an exhaustive searching mechanism. We open a window T in T with size m+k (Rule 2) and try to determine whether we are sure that every prefix T of this window T has ed(T,P) > k.

    If the answer is yes, we ignore this window; otherwise, we use dynamic programming to examine whether any prefix T of the window T has ed(T,P) k.

  • We use dynamic programming to compute the edit distance between two strings.A matrix C0|m|,0|n| is filled, where Cj,i represents the minimum number of operations need to match T1i to P1j. This is computed as follows

    Cj,0 and C0,i represent the edit distance between a string of length j or i and the empty string.

  • example:T = surgeryP = surveyk = 2There are only three prefixes of T, namely surge, surger and surgery, whose edit distances with P=survey aresmaller than or equal to k=2.

    surgery01234567s10123456u21012345r32101234v43211234e54322123y65433222

  • Let us now see how we can be sure that for a window T with size m+k , for every prefix T of T, ed(T,P) > k.

    We present Lemma 1 of this paper as follows.

  • Lemma 1Let T in T and P be two strings such that ed(T, P) k. Let P = P1x1P2x2 xj-1Pj, for strings Pi and xi and for any j 1. Then, at least one string Pi appears in T with at most errors.

    Thus, we always divide the pattern into j pieces. We shall point out how to divide later.

  • To be more precise, we may say that if ed(T,P) k, there exists a Pi in P and a T in T such that ed(Pi,T) .

  • Lemma 1 tells us that if for all Pi in P and every substring b in T, ed(Pi,b) > , then ed(P,T) > k.

    Suppose that there is a window T with size m+k andfor all Pi in P and for every substring b in T, ed(Pi,b) > .

    Then, we can be sure that for every prefix Tof T , for all Pi in P and every substring b in T, ed(Pi,b) > .TTTP

    b

    Pi

  • Let us define the following condition.

    Condition A: For all Pi in P and every substring b inT, ed(Pi, b) > Thus, if Condition A is satisfied, then for every prefix T of T, ed(T,P)>k.

    In such a case, we ignore T and shift P one step to the right.

  • Question, how can we be sure that the above condition is satisfied.

    The approach:

    For each Pi, we generate all possible modified strings Pi whose distances with Pi are smaller than or equal to k.

    After generating all possible modified , we may use the suffix tree of T to find all occurrences of , for all i, in T with error less than .

  • We still have the following questions: Question 1. How to divide P into j pieces?Question 2. How to generate all modified Pis?Question 3. How to find the occurrences of Pis in T with edit distance less than or equal to .

  • Question 1: How to divide P into j pieces? It can be proved that an optimal method is to partition P into j pieces with , where is the alphabet size. We can get j pieces of P, and the size of every piece is around logn.

  • Question 2. How to generate all modified Pis?

    The generation of all modified strings whose distanceswith P can be done trivially. One method can be found in [HHLS2006] which was reported by C. W. Lu.

    Another method can be found in [HM2007] reportedBy L. C. Chen.

    In this paper, the authors used the second method mentioned in [HM2007].

  • We can use non-deterministic finite automatons (NFA). A NFA is a five-tuple M=(Q, , , q0 , F), where Q is a finite set of states, is a finite input alphabet, is a mapping from Q( {}) into the set of subsets of Q, q0 Q is an initial state, and F Q is a set of final states.

  • P = abac, k = 2.The finite automaton M accepts Lk(P).Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.

    2,0

    3,0

    4,0

    1,0

    0,0

    2,1

    3,1

    4,1

    1,1

    2,2

    3,2

    4,2

    a

    b

    a

    c

    c

    a

    c

    a

    One matched, no error.

    One matched, one error.

    Two matched, two errors.

    Four matched, no error.

    b

    one error

    two errors

  • P = abac, k = 2.The finite automaton M accepts Lk(P).Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.Recognize aa

    2,0

    3,0

    4,0

    1,0

    0,0

    2,1

    3,1

    4,1

    1,1

    2,2

    3,2

    4,2

    a

    b

    a

    c

    c

    a

    c

    a

    b

  • Full example: T = GACACAGACCAAAGCAGn = 17 P = CAAGm = 4 k = 1

  • P = CAAG j = (m + k) / logn = (4 + 1) / log317 = 1.9388 Therefore, we partition P into two pieces.P1 = CAP2 = AG According to Lemma 1, at least one piece appears in substrings of T with at most = 0 error. This means that we want to find exact matching of P1 and P2.

  • NFA with k = 1 of P1 = CA:NFA with k = 1 of P2 = AG:

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    zero error

    one error

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

    zero error

    one error

  • T = GACACGGACCAAAGCAG We construct the suffix tree of T.ACGCGC A A A G C A G $G G A C C A A A G C A G $A C G G A C C A A A G C A G $AA A G C A G $C G G A C C A A A G C A G $G $C A A A G C A G $G G A C C A A A G C A G $A CA C G G A C C A A A G C A G $C A A A G C A G $C A G $G A C C A A A G C A G $AG C A G $A G C A G $$C A G $111213141516$1710987654321

  • We only need to consider the tree level from root to = 3 .ACGCGCGAAACGC AG GA CC AG AAGA$C111213141516$171098654321,7T = GACACGGACCAAAGCAG

  • NFA of P1:NFA of P2T = GACACGGACCAAAGCAGk = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    A

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

    A

  • (not exact match)(not exact match)T = GACACGGACCAAAGCAGk = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    A

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

    A

  • Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

    A

  • (exact match)Out of active states.We record positions 13 and 16 where AG occurs.T = GACACGGACCAAAGCAG 13 16k = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

    G

  • T = GACACGGACCAAAGCAGk = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    C

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

    C

  • Out of active states.(exact match)We record positions 3, 10 and 15 where CA occurs.T = GACACGGACCAAAGCAGk = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    A

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

  • Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    C

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

  • (not exact match)(not exact match)T = GACACGGACCAAAGCAGk = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    G

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

    G

  • T = GACACGGACCAAAGCAGk = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    G

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

    G

  • Out of active states.Out of active states.T = GACACGGACCAAAGCAGk = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

  • Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    A

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

  • Out of active states.Out of active states.T = GACACGGACCAAAGCAGk = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

  • Out of active states.(not exact match)T = GACACGGACCAAAGCAGk = 1

    0,0

    1,0

    2,0

    1,1

    2,1

    C

    A

    A

    0,0

    1,0

    2,0

    1,1

    2,1

    A

    G

    G

    G

  • After we find all probable positions in T, we verify every substring of those positions. The probable positions of T are: 3, 10, 13, 15, 1