# Two Different Approximate String Matching Problems and Their Algorithms

• View
39

0

Embed Size (px)

DESCRIPTION

Two Different Approximate String Matching Problems and Their Algorithms. Speakers: C. W. Lu and Y. K. Shie Advisor: Richard Chia-Tung Lee. Two different definitions of approximate string matching problem: - PowerPoint PPT Presentation

### Text of Two Different Approximate String Matching Problems and Their Algorithms

• Two Different Approximate String Matching Problems and Their AlgorithmsSpeakers: C. W. Lu and Y. K. ShieAdvisor: Richard Chia-Tung Lee

• Two different definitions of approximate string matching problem:Given a text , a pattern and a error bound k, find all the substrings of T whose edit distances with P are less than or equal to k. (Denoted as Problem 1)Given a text , a pattern and a error bound k, find all positions i of T such that there exists a suffix of T(1, i) whose edit distances with P are less than or equal to k. (Denoted as Problem 2)

• An example of Problem 1:T: a b a b c d b c d dP: abcdk = 1Output: T(2, 6)=babcdT(4, 6)=bcdT(3, 6)=abcd T(6, 9)=dbcdT(3, 7)=abcdb T(7, 9)=bcd

12345678910

• An example of Problem 2:T: a b a b c d b c d dP: abcdk = 1Output: Positions of T: 6, 7 and 9.

12345678910

• Computing the edit distance between two strings X and Y by using dynamic programming method:(Delete)(Insert)(Substitute)Let us denote this method to be DP1.

• Example:We can find the edit distances between all prefixes of Y and all prefixes of X from this table.

i012345678jYaccgatgc0X0123456781a1012345672a2112334563a3222334564c4322344555g5433234456a654432345

• Problem 1 can be solved by computing the edit distance between T(i, i+m-1+k) and P for all 0
• T: a b a b c d b cP: a b c dk = 1m+kOutput:

12345678

• T: a b a b c d b cP: a b c dk = 1m+kOutput: T(2, 6)=babcd

12345678

• T: a b a b c d b cP: a b c dk = 1m+kOutput: T(3, 5)=abcT(3, 6)=abcdT(3, 7)=abcdb

12345678

• T: a b a b c d b cP: a b c dk = 1m+kOutput: T(4, 6)=bcd

12345678

• T: a b a b c d b cP: a b c dk = 1m+kOutput:

12345678

• T: a b a b c d b cP: a b c dk = 1m+kOutput:

12345678

• Some algorithms try to avoid exhaustive computing in this way.

For example, Navarro and Baeza-Yates Algorithm [NB2000], Fredriksson and Navarro Algorithm [FN2004], Lu and Lee Algorithm [LL2008]. We shall explain those algorithm later.

• Another approach which does not use this sliding window approach is the Wu and Manber Algorithm [WM92].

Actually, in [WM92], the idea proposed in [NB2000] was mentioned, but barely, as if this is trivial.

• To solve Problem 2, we may use another DP algorithm, called DP2, which will be explained as follows.

• Given strings Y and X, computing the minimal ED(S, X) where S is a suffix of the substring Y(1, i) for all 0

• Example:

i012345678jYaccgatgc0X0000000001a1011101112a2112211223a3222322234c4322333325g5433234336a654432344

• Table (5,6)=2 indicates that there is a substring, namely accga, ending at Location 5 of Y, whose edit distance with P is the smallest, among all substrings of T ending at location 5.

i012345678jYaccgatgc0X0000000001a1011101112a2112211223a3222322234c4322333325g5433234336a654432344

• Example:We need to trace back if we want to know which substring of Y is our solution.

i012345678jYaccgatgc0X0000000001a1011101112a2112211223a3222322234c4322333325g5433234336a654432344

• If we set k=3, then from the above table, we can see that locations 4,5,6 are the solutions for Problem 2. If k=4, the solutions are 2~8.

i012345678jYaccgatgc0X0000000001a1011101112a2112211223a3222322234c4322333325g5433234336a654432344

• Obviously, Problem 2 can be solved by DP2, and the time complexity is O(mn). It is to be noted that we do not use any window sliding method if DP2 is used directly.

Again, if some of the positions of T could be ignored, this method would be more efficient. Some Algorithms try to do this. For example, Navarro and Baeza-Yates Algorithm [NB99], Tarhio and Ukkonen Algorithm [TU93] and Z. H. Pans thesis.

We shall explain them later.

• HN Algorithm (Bit-parallel Witnesses and their Applications to Approximate String Matching, Heikki Hyyro and Gonzalo Navarro, Algorithmica, 2005 Vol 4, No 3.)For a substring S of T, they use the DP2 method to find the minimum ED(S, P) among all substring P of P.HN Algorithm solves Problem 1. But, they also use the DP2 method. We will explain in the next slide.

• For a window of size m-k, if there exists a substring S in this window such that its edit distance with every substring of P is greater than k, we move P to S.

This rule is called Rule 5 by Lees group.T:P:m - kHN Algorithm (This paper has not been reported yet. It is rumored Mr. Ou-Dee should study and report this. Obviously, he is busily doing some crazy things.

S

• TPk=1m-kDP2:P>kPIn this case, both patterns and texts are reversed.

1234567891011121314

ababcd

dcbaba0000000e1111111e2222222a3b4a5

ababcd

• Both DP1 and DP2 can be improved by the LV algorithm, Fast Parallel and Serial Approximate String Matching, G. Landau and U. Vishkin, Journal of Algorithms, Vol.10 (1989), pp.157-169., which takes O(nk) time complexity.

This algorithm tries not to do the entire computation of the DP table.

• Diagonal d is defined as all of the Di,js where d = ij ,where Di,j is the value of (i, j) in DP table.

Diagonal 2Diagonal 0

• This algorithm is based on the following observations:The values of the elements on the same diagonal are non-decreasing.The value of every element on the diagonal d is decided by the elements on diagonals d, d-1 and d+1.

Di-1, j-1Di, j-1Di-1, jDi, jdd+1d-1deleteinsertsubstitution

• The values of the elements on the same diagonal are non-decreasing.Proof:Assume these exists a value Di, j such that Di, j>Di+1, j+1. Then, we have Di, jDi+1, j+1+1.

By definition of DP, we have either Di+1, j+1= Di+1, j + 1 or Di+1, j+1= Di, j+1 + 1. That is, Di+1, j = Di+1, j+1-1 or Di, j+1 = Di+1, j+1-1.Thus, Di, j - Di+1, j 2 or Di, j - Di, j+1 2

The two cases all contradict to another property that the value of any location in the DP table can be only 1 larger than that of its neighbors.

• Di, jDi+1, jDi, j+1Di+1, j+13112

• Besides, the value of any location in the DP table can be only 1 larger than that of its neighbors.

Di-1, j-1Di, j-1Di-1, jDi, jdd+1d-1deleteinsertsubstitution

• Let us consider the following table.Question: Assuming that we have already found all locations of i and j such that Di, j=0, what is largest j on diagonal 1 such that Di, j=1?

• Let us consider the following table.Certainly D4, 30 because we have found all 0s. D4, 3 must be greater than 0 and can be only 1 larger than D4, 2. Thus D4, 3=1.

• Question: Can D(5,4)=1?Since T5 =P4, D5,4 =D4,3 =1.This step can be found by a lowest common ancestor query which takes O(1) time [BF2000]. We explain it in next slide.

• 10a0a0a10t100t0000 g00000000ctctgggi 1 2 3 4 5 6 7 8 j

1

2

3

4

5d=3Question: What is the longest common prefix of tac and taa?Answer: It is ta whose length is 2.

This means that D6, 3 and D7, 4 are all 1.We find this longest common prefix by using a suffix tree.

• 10a0a0a10t100t0000 g00000000ctctgggi 1 2 3 4 5 6 7 8 j

1

2

3

4

5d=3We concatenate the two strings gggtctac and gttaa and construct its suffix tree for finding the LCA of S1 and S2.S1=taaS2=tacgttaa

• S= gggtctacgttacS2S1ta is found.

• Algorithms using the DP1 methodto solve Problem 1.

• A Hybrid Indexing Method for Approximate String Matching Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh

• Lemma 1Let A and B are two strings such that ed(A,B) k. Let , for any j1. Then at least one string appears in B with at most errors.

By the above lemma, when j = k+1, we want to find whether any piece of P exactly appears in T or not.We divide P into several pieces. After the pattern is divided, it has a property as shown in the following Lemma 1.

• After we find all probable positions in T, we verify every substring of those positions.

The probable positions of T are: 3, 10, 13, 15 and 16

We use DP1 with window size m+k to verify whether any approximate string matching occurs between T and P at the above locations .P1 = CAP2 = AGTPk = 1

G

A

C

A

C

G

G

A

C

C

A

A

A

G

C

A

G

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

C

A

A

G

• The algorithm would open windows whose sizes are all equal to m+k step by step. If a window does not contain any piece of P, that window is ignored. Thus it avoids an exhaustive search.

• The probable positions of T are 3, 10, 13, 15, 16m+kk = 1No approximatematching with k=1found.TPi=1. Window size=m+kDP1 is used.

GACAC012345C112234A221223A332223G433333

C

A

A

G

G

A

C

A

C

G

G

A

C

C

A

A

A

G

C

A

G

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

• m+kThe probable positions of T are: 3, 10, 13, ##### Improving an Approximate String Matching · PDF file 2016-12-16 · string matching algorithms that have been applied to various areas of study and research . The literature also
Documents ##### Approximate String Matching - · PDF file Approximate String Matching ° 383 ficially different can be substituted for each other in all contexts without making any difference in meaning,
Documents ##### A Randomized Algorithm for Approximate String Matching · PDF file 2002. 9. 5. · CERIAS Tech Report 2001-05 A Randomized Algorithm for Approximate String Matching Mikhail Atallah1,
Documents ##### Improved Approximate String Matching and Regular ... stelo/cpm/cpm07/improved_approx_string_bille.pdf Approximate String Matching on ZL78 compressed texts • Let be a string and be
Documents ##### A Guided Tour to Approximate String cheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simpliï¬cation of the
Documents ##### Using a Genetic Algorithm for Approximate String Matching on Genetic Code Carrie Mantsch December 5, 2003
Documents ##### On{line Approximate String Matching with Bounded Errors gnavarro/ps/tcs11.pdf · PDF file pattern matching and approximate searching is used to increase the chance of nding relevant
Documents ##### Filter Algorithms for Approximate String Matching Stefan Burkhardt
Documents ##### A fast algorithm for approximate string matching on gene sequences
Documents ##### Approximate string matching Evlogi Hristov Telerik Corporation Student at Telerik Academy
Documents ##### String Matching - Technical University of .â€¢ String matching problem: â€¢ string T (text) and string
Documents ##### Two-way string- mac/Articles-PDF/CP-1991-jacm.pdf string-matching problem have been studied. In particular, several approximate string-matching algorithms have been proposed [23, 32]
Documents ##### Block Edit Models for Approximate String lopresti/Publications/1997/tcs97.pdf the low-level notion of approximate string matching (e.g., the close similarity between the 2This is dictated
Documents ##### APPROXIMATE BOYER-MOORE STRING MATCHING tarhio/papers/abm.pdf · PDF file The Boyer-Moore idea applied in exact string matching is generalized to approximate string matching. Two
Documents ##### Accelerating approximate string matching in heterogeneous ... Accelerating approximate string matching in heterogeneous computing platforms Joao Pedro Silva Rodriguesœ Dissertac¸ao
Documents ##### Sublinear Approximate String Matching - · PDF file Sublinear Approximate String Matching Robert Z. West Department of Informatics Technische Universit¨at Mu¨nchen Joint Advanced
Documents ##### APPROXIMATE BOYER-MOORE STRING tarhio/papers/abm.pdf 3 We develop a new approximate string matching algorithm of Boyer-Moore type for the k mismatches problem and show, under a mild
Documents Documents