104
1 Two Different Approximate String Matching Problems and Their Algorithms Speakers: C. W. Lu and Y. K. Shie Advisor: Richard Chia-Tung Le e

Two Different Approximate String Matching Problems and Their Algorithms

  • Upload
    arin

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

Two Different Approximate String Matching Problems and Their Algorithms. Speakers: C. W. Lu and Y. K. Shie Advisor: Richard Chia-Tung Lee. Two different definitions of approximate string matching problem: - PowerPoint PPT Presentation

Citation preview

Page 1: Two Different Approximate String Matching Problems and Their Algorithms

1

Two Different Approximate String Matching Problems and Their Algorithms

Speakers: C. W. Lu and Y. K. ShieAdvisor: Richard Chia-Tung Lee

Page 2: Two Different Approximate String Matching Problems and Their Algorithms

2

• Two different definitions of approximate string matching problem:– Given a text , a pattern

and a error bound k, find all the substrings of T whose edit distances with P are less than or equal to k. (Denoted as Problem 1)

– Given a text , a pattern and a error bound k, find all positions i of T such that there exists a suffix of T(1, i) whose edit distances with P are less than or equal to k. (Denoted as Problem 2)

ntttT ...21 mpppP ...21

ntttT ...21 mpppP ...21

Page 3: Two Different Approximate String Matching Problems and Their Algorithms

3

An example of Problem 1:

T: a b a b c d b c d d

P: abcd

k = 1

Output:

T(2, 6)=babcd T(4, 6)=bcd

T(3, 6)=abcd T(6, 9)=dbcd

T(3, 7)=abcdb T(7, 9)=bcd

1 2 3 4 5 6 7 8 9 10

Page 4: Two Different Approximate String Matching Problems and Their Algorithms

4

An example of Problem 2:

T: a b a b c d b c d d

P: abcd

k = 1

Output:

Positions of T:

6, 7 and 9.

1 2 3 4 5 6 7 8 9 10

Page 5: Two Different Approximate String Matching Problems and Their Algorithms

5

.],0[,]0,[

otherwise. 1])[ ],[(

and ],[][ if 0,])[ ],[( where

])[ ],[ (]1 ,1[

1] ,1 [

1]1 , [

min ] , [

jjEDITiiEDIT

jyix

jyixjyix

jyixj iEDIT

jiEDIT

jiEDIT

jiEDIT

Computing the edit distance between two strings X and Y by using dynamic programming method:

(Delete)

(Insert)

(Substitute)

Let us denote this method to be DP1.

Page 6: Two Different Approximate String Matching Problems and Their Algorithms

6

i 0 1 2 3 4 5 6 7 8

j Y a c c g a t g c0 X 0 1 2 3 4 5 6 7 8

1 a 1 0 1 2 3 4 5 6 72 a 2 1 1 2 3 3 4 5 63 a 3 2 2 2 3 3 4 5 64 c 4 3 2 2 3 4 4 5 55 g 5 4 3 3 2 3 4 4 56 a 6 5 4 4 3 2 3 4 5

Example:

We can find the edit distances between all prefixes of Y and all prefixes of X from this table.

Page 7: Two Different Approximate String Matching Problems and Their Algorithms

7

• Problem 1 can be solved by computing the edit distance between T(i, i+m-1+k) and P for all 0<i<n, and the time complexity is O(m2n). (m=size of P and n=size of T)

• That is, for every i, we perform DP1 on T(i, i+m-1+k) and P, for i=1 to n.

• Thus, we open a window with size m+k all the time and slide the window.

Page 8: Two Different Approximate String Matching Problems and Their Algorithms

8

T: a b a b c d b c

P: a b c d

k = 1

1 2 3 4 5 6 7 8

m+k

i 0 1 2 3 4 5

j a b a b c

0 P 0 1 2 3 4 5

1 a 1 0 1 2 3 4

2 b 2 1 0 1 2 3

3 c 3 2 1 1 2 2

4 d 4 3 2 2 2 3

Output: ψ

Page 9: Two Different Approximate String Matching Problems and Their Algorithms

9

T: a b a b c d b c

P: a b c d

k = 1

1 2 3 4 5 6 7 8

m+k

i 0 1 2 3 4 5

j b a b c d

0 P 0 1 2 3 4 5

1 a 1 1 1 2 3 4

2 b 2 1 2 1 2 3

3 c 3 2 2 2 1 2

4 d 4 3 3 3 2 1

Output: T(2, 6)=babcd

Page 10: Two Different Approximate String Matching Problems and Their Algorithms

10

T: a b a b c d b c

P: a b c d

k = 1

1 2 3 4 5 6 7 8

m+k

i 0 1 2 3 4 5

j a b c d b

0 P 0 1 2 3 4 5

1 a 1 0 1 2 3 4

2 b 2 1 0 1 2 3

3 c 3 2 1 0 1 2

4 d 4 3 2 1 0 1

Output: T(3, 5)=abcT(3, 6)=abcdT(3, 7)=abcdb

Page 11: Two Different Approximate String Matching Problems and Their Algorithms

11

T: a b a b c d b c

P: a b c d

k = 1

1 2 3 4 5 6 7 8

m+k

i 0 1 2 3 4 5

j b c d b c

0 P 0 1 2 3 4 5

1 a 1 1 2 3 4 5

2 b 2 1 2 3 3 4

3 c 3 2 1 2 3 3

4 d 4 3 2 1 2 3

Output: T(4, 6)=bcd

Page 12: Two Different Approximate String Matching Problems and Their Algorithms

12

T: a b a b c d b c

P: a b c d

k = 1

1 2 3 4 5 6 7 8

m+k

i 0 1 2 3 4

j c d b c

0 P 0 1 2 3 4

1 a 1 1 2 3 4

2 b 2 2 2 2 3

3 c 3 2 3 3 2

4 d 4 3 2 3 4

Output: ψ

Page 13: Two Different Approximate String Matching Problems and Their Algorithms

13

T: a b a b c d b c

P: a b c d

k = 1

1 2 3 4 5 6 7 8

m+k

i 0 1 2 3

j d b c

0 P 0 1 2 3

1 a 1 1 2 3

2 b 2 2 1 2

3 c 3 3 2 1

4 d 4 3 3 2

Output: ψ

Page 14: Two Different Approximate String Matching Problems and Their Algorithms

14

• Some algorithms try to avoid exhaustive computing in this way.

• For example, Navarro and Baeza-Yates Algorithm [NB2000], Fredriksson and Navarro Algorithm [FN2004], Lu and Lee Algorithm [LL2008]. We shall explain those algorithm later.

Page 15: Two Different Approximate String Matching Problems and Their Algorithms

15

• Another approach which does not use this sliding window approach is the Wu and Manber Algorithm [WM92].

• Actually, in [WM92], the idea proposed in [NB2000] was mentioned, but barely, as if this is trivial.

Page 16: Two Different Approximate String Matching Problems and Their Algorithms

16

• To solve Problem 2, we may use another DP algorithm, called DP2, which will be explained as follows.

Page 17: Two Different Approximate String Matching Problems and Their Algorithms

17

Given strings Y and X, computing the minimal ED(S, X) where S is a suffix of the substring Y(1, i) for all 0<i<|Y|:

.0]0,[]0,0[,],0[

otherwise. 1])[ ],[(

and ],[][ if 0,])[ ],[( where

])[ ],[ (]1 ,1 [

1] ,1 [

1]1 , [

min ] , [

iSESEjjSE

jyix

jyixjyix

jyixjiSE

jiSE

jiSE

jiSE

Let us denote this method to be DP2.

Note that SE[i. 0]=i in DP1 and SE[i, 0]=0 in DP2.

Page 18: Two Different Approximate String Matching Problems and Their Algorithms

18

i 0 1 2 3 4 5 6 7 8

j Y a c c g a t g c0 X 0 0 0 0 0 0 0 0 0

1 a 1 0 1 1 1 0 1 1 12 a 2 1 1 2 2 1 1 2 23 a 3 2 2 2 3 2 2 2 34 c 4 3 2 2 3 3 3 3 25 g 5 4 3 3 2 3 4 3 36 a 6 5 4 4 3 2 3 4 4

Example:

Page 19: Two Different Approximate String Matching Problems and Their Algorithms

19

i 0 1 2 3 4 5 6 7 8

j Y a c c g a t g c0 X 0 0 0 0 0 0 0 0 0

1 a 1 0 1 1 1 0 1 1 12 a 2 1 1 2 2 1 1 2 23 a 3 2 2 2 3 2 2 2 34 c 4 3 2 2 3 3 3 3 25 g 5 4 3 3 2 3 4 3 36 a 6 5 4 4 3 2 3 4 4

Table (5,6)=2 indicates that there is a substring, namely accga, ending at Location 5 of Y, whose edit distance with P is the smallest, among all substrings of T ending at location 5.

Page 20: Two Different Approximate String Matching Problems and Their Algorithms

20

i 0 1 2 3 4 5 6 7 8

j Y a c c g a t g c0 X 0 0 0 0 0 0 0 0 0

1 a 1 0 1 1 1 0 1 1 12 a 2 1 1 2 2 1 1 2 23 a 3 2 2 2 3 2 2 2 34 c 4 3 2 2 3 3 3 3 25 g 5 4 3 3 2 3 4 3 36 a 6 5 4 4 3 2 3 4 4

Example:

We need to trace back if we want to know which substring of Y is our solution.

Page 21: Two Different Approximate String Matching Problems and Their Algorithms

21

i 0 1 2 3 4 5 6 7 8

j Y a c c g a t g c0 X 0 0 0 0 0 0 0 0 0

1 a 1 0 1 1 1 0 1 1 12 a 2 1 1 2 2 1 1 2 23 a 3 2 2 2 3 2 2 2 34 c 4 3 2 2 3 3 3 3 25 g 5 4 3 3 2 3 4 3 36 a 6 5 4 4 3 2 3 4 4

If we set k=3, then from the above table, we can see that locations 4,5,6 are the solutions for Problem 2. If k=4, the solutions are 2~8.

Page 22: Two Different Approximate String Matching Problems and Their Algorithms

22

• Obviously, Problem 2 can be solved by DP2, and the time complexity is O(mn). It is to be noted that we do not use any window sliding method if DP2 is used directly.

• Again, if some of the positions of T could be ignored, this method would be more efficient. Some Algorithms try to do this. For example, Navarro and Baeza-Yates Algorithm [NB99], Tarhio and Ukkonen Algorithm [TU93] and Z. H. Pan’s thesis.

• We shall explain them later.

Page 23: Two Different Approximate String Matching Problems and Their Algorithms

23

• HN Algorithm (Bit-parallel Witnesses and their Applications to Approximate String Matching, Heikki Hyyro and Gonzalo Navarro, Algorithmica, 2005 Vol 4, No 3.)

• For a substring S of T, they use the DP2 method to find the minimum ED(S, P’) among all substring P’ of P.

• HN Algorithm solves Problem 1. But, they also use the DP2 method. We will explain in the next slide.

Page 24: Two Different Approximate String Matching Problems and Their Algorithms

24

For a window of size m-k, if there exists a substring S in this window such that its edit distance with every substring of P is greater than k, we move P to S.

This rule is called Rule 5 by Lee’s group.

ST:

P:

m - k

HN Algorithm (This paper has not been reported yet. It is rumored Mr. Ou-Dee should study and report this. Obviously, he is busily doing some crazy things.

Page 25: Two Different Approximate String Matching Problems and Their Algorithms

25

a b a e e a a b c d e a d a1 2 3 4 5 6 7 8 9 10 11 12 13 14

a b a b c d

T

P

k=1

m-k

d c b a b a0 0 0 0 0 0 0

e 1 1 1 1 1 1 1e 2 2 2 2 2 2 2a 3b 4a 5

DP2:P

>k

a b a b c dP

In this case, both patterns and texts are reversed.

Page 26: Two Different Approximate String Matching Problems and Their Algorithms

26

• Both DP1 and DP2 can be improved by the LV algorithm, “Fast Parallel and Serial Approximate String Matching, G. Landau and U. Vishkin, Journal of Algorithms, Vol.10 (1989), pp.157-169.”, which takes O(nk) time complexity.

• This algorithm tries not to do the entire computation of the DP table.

Page 27: Two Different Approximate String Matching Problems and Their Algorithms

27

• Diagonal d is defined as all of the Di,j’s where d = i–j ,where Di,j is the value of (i, j) in DP table.

Diagonal 2

Diagonal 0

1

0122c

101b

0000

cba

i 1 2 3

j

1

2

Page 28: Two Different Approximate String Matching Problems and Their Algorithms

28

• This algorithm is based on the following observations:– The values of the elements on the same diagonal ar

e non-decreasing.– The value of every element on the diagonal d is de

cided by the elements on diagonals d, d-1 and d+1.

Di-1, j-1Di, j-1

Di-1, jDi, j

d

d+1

d-1

delete

insert

substitution

Page 29: Two Different Approximate String Matching Problems and Their Algorithms

29

–The values of the elements on the same diagonal are non-decreasing.

Proof:Assume these exists a value Di, j such that Di, j>Di+1, j+1.

Then, we have Di, j D≧ i+1, j+1+1.

By definition of DP, we have either Di+1, j+1= Di+1, j + 1 or Di+1, j+1= Di, j+1 + 1. That is, Di+1, j = Di+1, j+1-1 or Di, j+1 = Di+1, j+1-1.Thus, Di, j - Di+1, j 2 or D≧ i, j - Di, j+1 2 ≧

The two cases all contradict to another property that the value of any location in the DP table can be only 1 larger than that of its neighbors.

Page 30: Two Different Approximate String Matching Problems and Their Algorithms

30

Di, j Di+1, j

Di, j+1 Di+1, j+1

3 1

1 2

Page 31: Two Different Approximate String Matching Problems and Their Algorithms

31

• Besides, the value of any location in the DP table can be only 1 larger than that of its neighbors.

Di-1, j-1Di, j-1

Di-1, jDi, j

d

d+1

d-1

delete

insert

substitution

Page 32: Two Different Approximate String Matching Problems and Their Algorithms

32

• Let us consider the following table.

• Question: Assuming that we have already found all locations of i and j such that Di, j=0, what is largest j on diagonal 1 such that Di, j=1?

j

1

2

3

4

d =1

i 1 2 3 4 5 6 7

0c

?0t

00t

0000 g

00000000

atctggg

Page 33: Two Different Approximate String Matching Problems and Their Algorithms

33

• Let us consider the following table.

• Certainly D4, 3≠0 because we have found all 0’s. D4,

3 must be greater than 0 and can be only 1 larger than D4, 2. Thus D4, 3=1.

j

1

2

3

4

d =1

i 1 2 3 4 5 6 7

0c

?0t

00t

0000 g

00000000

atctggg

Page 34: Two Different Approximate String Matching Problems and Their Algorithms

34

• Question: Can D(5,4)=1?

– Since T5 =P4, D5,4 =D4,3 =1.

j

1

2

3

4

d =1

i 1 2 3 4 5 6 7

?0c

10t

00t

0000 g

00000000

atctggg

• This step can be found by a lowest common ancestor query which takes O(1) time [BF2000]. We explain it in next slide.

Page 35: Two Different Approximate String Matching Problems and Their Algorithms

35

1

0

a

0a

0a

10t

100t

0000 g

00000000

ctctgggi 1 2 3 4 5 6 7 8

j

1

2

3

4

5 d=3

Question: What is the longest common prefix of tac and taa?Answer: It is ta whose length is 2.

This means that D6, 3 and D7, 4 are all 1.We find this longest common prefix by using a suffix tree.

Page 36: Two Different Approximate String Matching Problems and Their Algorithms

36

1

0

a

0a

0a

10t

100t

0000 g

00000000

ctctgggi 1 2 3 4 5 6 7 8

j

1

2

3

4

5 d=3

tacgggtc g atat

S1

S2

We concatenate the two strings gggtctac and gttaa and construct its suffix tree for finding the LCA of S1 and S2.

S1=taa

S2=tacgttaa

Page 37: Two Different Approximate String Matching Problems and Their Algorithms

37

a$

taa$

cgttaa$

a gc t

tacgttaa$

gttaa$

g t

gtctacgttaa$

tctacgttaa$

taa$

ctacgttaa$

ctacgttaa$

a

a$

cgttac$

$

S= gggtctacgttac

S2S1

ta is found.

Page 38: Two Different Approximate String Matching Problems and Their Algorithms

38

Algorithms using the DP1 method

to solve Problem 1.

Page 39: Two Different Approximate String Matching Problems and Their Algorithms

39

A Hybrid Indexing Method for Approximate String Matching

Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates

Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh

Page 40: Two Different Approximate String Matching Problems and Their Algorithms

40

Lemma 1Let A and B are two strings such that ed(A,B) ≦k. Let , for any j 1. Then at least one strin≧g appears in B with at most errors.

By the above lemma, when j = k+1, we want to find whether any piece of P exactly appears in T or not.

jj AxxAxAA 12211 ...

iA jk /

)1( jaPa

We divide P into several pieces. After the pattern is divided, it has a property as shown in the following Lemma 1.

Page 41: Two Different Approximate String Matching Problems and Their Algorithms

41

After we find all probable positions in T, we verify every substring of those positions.

The probable positions of T are: 3, 10, 13, 15 and 16

We use DP1 with window size m+k to verify whether any approximate string matching occurs between T and P at the above locations .

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

P1 = CAP2 = AG

T

C A A GP

k = 1

Page 42: Two Different Approximate String Matching Problems and Their Algorithms

42

• The algorithm would open windows whose sizes are all equal to m+k step by step. If a window does not contain any piece of P, that window is ignored. Thus it avoids an exhaustive search.

Page 43: Two Different Approximate String Matching Problems and Their Algorithms

43

The probable positions of T are 3, 10, 13, 15, 16

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

G A C A C

0 1 2 3 4 5

C 1 1 2 2 3 4

A 2 2 1 2 2 3

A 3 3 2 2 2 3

G 4 3 3 3 3 3

k = 1

No approximatematching with k=1found.

T

P

i=1. Window size=m+k

DP1 is used.

Page 44: Two Different Approximate String Matching Problems and Their Algorithms

44

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

A C A C G

0 1 2 3 4 5

C 1 1 1 2 3 4

A 2 1 2 1 2 3

A 3 2 2 2 2 3

G 4 3 3 3 3 2

The probable positions of T are: 3, 10, 13, 15, 16 k = 1

No approximatematching with k=1found.

T

P

i=2.

DP1 is used.

Page 45: Two Different Approximate String Matching Problems and Their Algorithms

45

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

C A C G G

0 1 2 3 4 5

C 1 0 1 2 3 4

A 2 1 0 1 2 3

A 3 2 1 1 2 3

G 4 3 2 2 1 2

The probable positions of T are: 3, 10, 13, 15, 16

CACG is found.

k = 1

T

P

i=3.

DP1 is used.

Page 46: Two Different Approximate String Matching Problems and Their Algorithms

46

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

The probable positions of T are: 3, 10, 13, 15, 16. k=1.

This window does not include any probable position.Therefore we can ignore this window.

T

P

i=4.

Page 47: Two Different Approximate String Matching Problems and Their Algorithms

47

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

The probable positions of T are: 3, 10, 13, 15, 16.

The window does not include any probable position.Therefore we can shift the window directly.

T

P

i=5.

Page 48: Two Different Approximate String Matching Problems and Their Algorithms

48

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

A A G C A

0 1 2 3 4 5

C 1 1 2 3 3 4

A 2 1 1 2 3 3

A 3 2 1 2 3 3

G 4 3 2 1 2 3

The probable positions of T are: 3, 10, 13, 15, 16 k = 1.

AAG is found.

T

P

i=12.

DP1 is used.

Page 49: Two Different Approximate String Matching Problems and Their Algorithms

49

Average-Optimal Multiple Approximate String Matching

Kimmo Fredriksson , Gonzalo NavarroACM Journal of Experimental Algorithmics,

Vol 9, Article No. 1.4,2004, Pages 1-47

Professor R.C.T LeeSpeaker K.W.Liu

Page 50: Two Different Approximate String Matching Problems and Their Algorithms

50

This algorithm uses a checking window.

For a checking window of size m-k, if there exists a substring S in this window such that its edit distance with every substring of P is greater than k, we move P to S. We are using Rule 5 now.

Our algorithm scans from the right as shown below:

ST:

P:

m - k

Page 51: Two Different Approximate String Matching Problems and Their Algorithms

51

Note that the way to move the window ensures us that we do not miss anything and the size of the window needs only to be m+k.

Besides, during the checking phase, the window size is m-k.

But, how do we find such an S?

We use a very useful lemma.

Page 52: Two Different Approximate String Matching Problems and Their Algorithms

52

LemmaConsider string Q and P. Let Q be divided in

to q1,q2,…,qn as shown below:

qn … q2 q1

For each qi, let pi be the substring in P such that ED(qi,pi) is the smallest, among all substrings in P.

kPQkpqn

iii

),ED( then ,),ED( If1

Page 53: Two Different Approximate String Matching Problems and Their Algorithms

53

Divide a window of T into pieces as shown below

smallest. theis ),ED( because

of pieceany for ),ED( , ),ED( If11

ii

i

n

iii

n

iii

pt

Ppkptkpt

T:

P:

… t2 t1

p1 p2

Page 54: Two Different Approximate String Matching Problems and Their Algorithms

54

To apply this lemma, we open a checking window in T with size m-k, according to Rule 5. We now divide this window into substrings with length 2, called 2-grams. Note that for 2-grams, the Hamming distance is equal to edit distance.

T:

P:

… t2 t1

p1 p2

m-k

Page 55: Two Different Approximate String Matching Problems and Their Algorithms

55

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

c t a g g g a a t a a t t t a c a a t t

← m-k →

Smallest edit distance between “aa” and all substrings of P = 0 Smallest edit distance between “gg” and all substrings of P = 2

∴∑ > k already. According to Rule 5, we move P after S.

S

Page 56: Two Different Approximate String Matching Problems and Their Algorithms

56

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

c t a g g g a a t a a t t t a c a a t t

← m-k →

Smallest edit distance between “tt” and all substrings of P = 0

Smallest edit distance between “aa” and all substrings of P = 0

Smallest edit distance between “at” and all substrings of P = 0

Smallest edit distance between “ga” and all substrings of P = 1

∴∑ == k

c t a g g g a a t a a t t t a c a a t t

← m-k →

No S is found. We extend the window to size m+k and examine whether there is a prefix of the window whose edit distance with P is smaller than or equal to k by using DP1.

S

Page 57: Two Different Approximate String Matching Problems and Their Algorithms

57

c t a g g g a a t a a t t t a c a a t t

← m-k →

← m+k →

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

No prefix of the extended window whose edit distance with P is smaller than or equal to k can be found. After checking, no matter whether a solution is found or not, we can only move P one step.

Page 58: Two Different Approximate String Matching Problems and Their Algorithms

58

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

c t a g g g a a t a a t t t a c a a t t

← m-k →

c t a g g g a a t a a t t t a c a a t t

← m-k →

Page 59: Two Different Approximate String Matching Problems and Their Algorithms

59

An Approximate String Matching Algorithm Based upon the Candidate

Elimination Method

C. W. Lu Thesis

Page 60: Two Different Approximate String Matching Problems and Their Algorithms

60

Ti

x

. of ... substring thebeLet 11 Tttt S xiiiix

ixS

Example:

T a a a a c a a c a b a c b a c a a a a a1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

caa.53 S

Page 61: Two Different Approximate String Matching Problems and Their Algorithms

61

Ti m-k

m+k

• For every location i of T, we only consider the substrings .

• We use DP1 to decide whether any prefix of the window whose edit distance with P is smaller than or equal to k exists.

ikm

ikm

im-k

im-k SSSS , ..., , , 1-1

• The solution size must be between m-k and m+k.• Therefore, any solution starting from i must end in th

e range of i+m-1-k and i+m-1+k. The window size is therefore m+k.

Solution end points

Window

Page 62: Two Different Approximate String Matching Problems and Their Algorithms

62

In the following, we shall show that we may determine that no solution can be found in a window.

Page 63: Two Different Approximate String Matching Problems and Their Algorithms

63

.in character theofnumber thebe Let ScSNc

If Nc( ) = y,

Nc( )≧y, Nc( )≧y, … and Nc( )≧y .

ikmS -

ikmS 1-

ikmS 2-

ikmS

T a a a a c a a c a b a c b a c a a a a a1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

m-2

Example:

m=9 and k = 2.

Na(T(1, 7)) = 6, Nb(T(1, 7)) = 0, Nc(T(1, 7)) = 1.

m-1mm+1

m+2

Page 64: Two Different Approximate String Matching Problems and Their Algorithms

64

Let C1 be the set of all alphabets c such that . If ,

then ED( , P) > k, for .

Lemma 2

)()( PNSN ci

kmc kPNSNCc

ci

kmc ))()((1

ixS kmxkm

.

Thus, to use Lemma 2, we use a checking window whose size is m-k.

Page 65: Two Different Approximate String Matching Problems and Their Algorithms

65

Example

T b a b b a b b c a b a c b a c a a a a a

P a c c a c a b c b k = 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

m-k

Na(P) = 3, Nb(P) = 2, Nc(P) = 4.

Na(T(1, 7)) = 2, Nb(T(1, 7)) = 5, Nc(T(1, 7)) = 0.

The number of the character b in T(1, 7) is larger than that in P.

Nb(T(1, 7)) – Nb(P) = 5 – 2 = 3 > k.

Thus, the edit distances of all substrings starting at location 1 with P are larger than k.

Page 66: Two Different Approximate String Matching Problems and Their Algorithms

66

T a a a a c a a c a b a c b a c a a a a a1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

m-2

Example:

Na(T(1, 11)) = 8, Nb(T(1, 11)) = 1, Nc(T(1, 11)) = 2.

m-1mm+1

m+2

If Nc( ) = y,

Nc( )≦y, Nc( )≦y, …, Nc( )≦y .

ikmS

ikmS -

ikmS 1-

ikmS 1-

m=9 and k = 2.

Page 67: Two Different Approximate String Matching Problems and Their Algorithms

67

Let C2 be the set of all alphabets c such that . If ,

then ED( , P) > k, for .

Lemma 3

ixS kmxkm

.

)()( ikmcc SNPN kSNPN

Cc

ikmcc ))()((

2

To apply this lemma, the checking window size is now m+k.

Page 68: Two Different Approximate String Matching Problems and Their Algorithms

68

Example

T a a a a c a a c a b a c b a c a a a a a

P a c c a c a b c b k = 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

m+k

Na(P) = 3, Nb(P) = 2, Nc(P) = 4.

Na(T(1, 11)) = 8, Nb(T(1, 11)) = 1, Nc(T(1, 11)) = 2.

The numbers of the characters b and c in T(1, 11) are smaller than in P.

[Nb(P) – Nb(T(1, 11))] + [Nc(P) – Nc(T(1, 11))]

= (2-1) + (4-2) = 3 > k.

Thus, the edit distances of all substrings starting at location 1 with P are larger than k.

Page 69: Two Different Approximate String Matching Problems and Their Algorithms

69

If or ,

we can eliminate position i of T. That is, we prune all substrings , for out of consideration.

Theorem 1

.

kSNPNCc

ikmcc ))()((

2

kPNSNCc

ci

kmc ))()((1

ixS 0x

Page 70: Two Different Approximate String Matching Problems and Their Algorithms

70

Example:

T a a a a c a a c a b a c b a c a a a a a

P a c c a c a b c b k = 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

m+k

Na(P) = 3, Na(T(1, 7)) = 6, Na(T(1, 11)) = 8,

Nb(P) = 2, Nb(T(1, 7)) = 0, Nb(T(1, 11)) = 1,

Nc(P) = 4, Nc(T(1, 7)) = 1, Nc(T(1, 11)) = 2.

m-k: Na(T(1, 7))–Na(P) = 6-3 = 3 > k.

m+k: [Nb(P) – Nb(T(1, 11))] + [Nc(P) – Nc(T(1, 11))] =(2-1) + (4-2) = 3 > k.

m-k

Page 71: Two Different Approximate String Matching Problems and Their Algorithms

71

Example:

T a a a a c a a c a b a c b a c a a a a a

P a c c a c a b c b k = 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

m+k

Na(P) = 3, Na(T(2, 8)) = 5, Na(T(2, 12)) = 7,

Nb(P) = 2, Nb(T(2, 8)) = 0, Nb(T(2, 12)) = 1,

Nc(P) = 4, Nc(T(2, 8)) = 2, Nc(T(2, 12)) = 3.

m-k: Na(T(2, 8))–Na(P) = 5-3 = 2≦k.

m+k: [Nb(P) – Nb(T(2, 12))] + [Nc(P) – Nc(T(2, 12))] =(2-1) + (4-3) = 2≦k.

m-k

DP1 is to be used now.

Page 72: Two Different Approximate String Matching Problems and Their Algorithms

72

a a a c a a c a b a c

0 1 2 3 4 5 6 7 8 9 10 11

a 1 0 1 2 3 4 5 6 7 8 9 10

c 2 1 1 2 2 3 4 5 6 7 8 9

c 3 2 2 2 2 3 4 4 5 6 7 8

a 4 3 2 2 3 2 3 4 4 5 6 7

c 5 4 3 3 2 3 3 3 4 5 6 6

a 6 5 4 3 3 2 3 4 3 4 5 6

b 7 6 5 4 4 3 3 4 4 3 4 5

c 8 7 6 5 4 4 4 3 4 4 4 4

b 9 8 7 6 5 5 5 4 4 4 5 5

T(2, 12)

P

> k

k = 2

ED( , P) > k,2xS 0.for x

DP1

Page 73: Two Different Approximate String Matching Problems and Their Algorithms

73

Example:

T a a a a c a a c a b a c b a c a a a a a

P a c c a c a b c b k = 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

m+k

Na(P) = 3, Na(T(4, 10)) = 4, Na(T(4, 14)) = 6,

Nb(P) = 2, Nb(T(4, 10)) = 1, Nb(T(4, 14)) = 2,

Nc(P) = 4, Nc(T(4, 10)) = 2, Nc(T(4, 14)) = 3.

m-k: Na(T(2, 8))–Na(P) = 4-3 = 1≦k.

m+k: Nc(P)–Nc(T(2, 12)) = 4-3 = 1≦k.

m-k

Page 74: Two Different Approximate String Matching Problems and Their Algorithms

74

a c a a c a b a c b a

0 1 2 3 4 5 6 7 8 9 10 11

a 1 0 1 2 3 4 5 6 7 8 9 10

c 2 1 0 1 2 3 4 5 6 7 8 9

c 3 2 1 1 2 2 3 4 5 6 7 8

a 4 3 2 1 1 2 2 3 4 5 6 7

c 5 4 3 2 2 1 2 3 4 4 5 6

a 6 5 4 3 2 2 1 2 3 4 5 5

b 7 6 5 4 3 3 2 1 2 3 4 5

c 8 7 6 5 4 3 3 2 2 2 3 4

b 9 8 7 6 5 4 4 3 3 3 2 3

T(4, 14)

P

≦ k

k = 2

ED( , P) = 2 ≦ k.410S

DP1

Page 75: Two Different Approximate String Matching Problems and Their Algorithms

75

Example:

T a a a a c a a c a b a c b a c a a a a a

P a c c a c a b c b k = 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

m+km-k

ED(T(4, 13), P) = 2 ≦ k.

Page 76: Two Different Approximate String Matching Problems and Their Algorithms

76

Algorithms using the DP2 method to Solve Problem 2.

Page 77: Two Different Approximate String Matching Problems and Their Algorithms

77

Very fast and simple approximate string matching

Information Processing Letters, 72:65-70, 1999.

G. Navarro and R. Baeza-Yates

Advisor: Prof. R. C. T. Lee

Speaker: H. M. Chen

Page 78: Two Different Approximate String Matching Problems and Their Algorithms

78

Lemma 1Let A and B are two strings such that ed(A,B) ≦k. Let , for any j 1. Then at least one strin≧g appears in B with at most errors.

By the above lemma, when j = k+1, we want to find whether any piece of P exactly appears in T or not.

jj AxxAxAA 12211 ...

iA jk /

)1( jaPa

Lemma 1 is used again. We divide P into several pieces. After the pattern is divided, it has a property as shown in the following Lemma 1.

Page 79: Two Different Approximate String Matching Problems and Their Algorithms

79

• Suppose the window with size m exactly matches with P, we can extend it to the left to i-k and to the right to i+m-1+k. Thus, the approximate solution lies in the window with size m+2k.

• We now use DP2 to decide whether any substring in the window whose edit distance with P is smaller than or equal to k exists.

T:

P:

mk k

Possible starting points

Possible ending points

Page 80: Two Different Approximate String Matching Problems and Their Algorithms

80

• Although this algorithm looks like the NB Algorithm [NB2000], it is actually different from it because we are now solving Problem 2 while NB Algorithm solves Problem 1.

• In the NB Algorithm, DP1 is used. The window size must be m+k and the window is moved step by step.

Page 81: Two Different Approximate String Matching Problems and Their Algorithms

81

•But, in this algorithm, we solve Problem 2. Thus DP2 is used and the examination window size must be m+2k as shown below:

T:

P:

mk k

Possible starting points

Possible ending points

Page 82: Two Different Approximate String Matching Problems and Their Algorithms

82

A full example:T = GACACTAGCCACACTGATCCP = ACATCAGCCk = 1

By Lemma1, we divide P into j = k+1=1+1=2 pieces.Therefore, we obtain = ACATC, = AGCC.1P 2P

Page 83: Two Different Approximate String Matching Problems and Their Algorithms

83

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T = GACACTAGCCACACTGATCC

2PP = ACATCAGCC

We then open a window T(1, 1+m-1+2k)=T(1,11) and use DP2.

Page 84: Two Different Approximate String Matching Problems and Their Algorithms

84

G A C A C T A G C C A

0 0 0 0 0 0 0 0 0 0 0 0

A 1 1 0 1 0 1 1 0 1 1 1 0

C 2 2 1 0 1 0 1 1 1 1 1 1

A 3 3 2 1 0 1 1 1 2 2 2 1

T 4 4 3 2 1 1 1 2 2 3 3 2

C 5 5 4 3 2 1 2 2 3 2 3 3

A 6 6 5 4 3 2 2 2 3 3 3 3

G 7 6 6 5 4 3 3 3 2 3 4 4

C 8 7 7 6 5 4 4 4 3 2 3 4

C 9 8 8 7 6 5 5 5 4 3 2 3

From the table, we conclude that there is no solution for k=1.

DP2

Page 85: Two Different Approximate String Matching Problems and Their Algorithms

85

Pan’s Algorithm for Problem 2

• In Pan’s thesis, he does not divide P into k+1 pieces. Instead, he picks k+1 substrings with constant length C. If for a window W, one such substring exactly appears, he then opens a window of size m+2k as shown below.

exact matching

a piece of pattern

T:

P:

mk k

Page 86: Two Different Approximate String Matching Problems and Their Algorithms

86

• Pan’s algorithm only checks those windows which are opened.

• The checking is done by DP2.

Page 87: Two Different Approximate String Matching Problems and Their Algorithms

87

Approximate Boyer-Moore String Matching

Source : SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp.243-260

J. Tarhio and E. Ukkonen

Advisor: Prof. R. C. T. Lee

Speaker: Kuei-hao Chen

Page 88: Two Different Approximate String Matching Problems and Their Algorithms

88

In the following figure, y in T is located at i and x in T is located in j. But, y does not appear within i-k to i+k in P and x does not appear within j-k to j+k.

In this case, it can be seen that deleting any character in P will not result in an exact match. Thus, the edit distance between T and P must be larger than 1. Since k=1, it is impossible to have ED(T,P)≦k.

k=1 T

P

yxi

i+ki-k

y y yx x x

j

Page 89: Two Different Approximate String Matching Problems and Their Algorithms

89

• Suppose a character x of a window of T is located in i. The range of P from i-k to i+k is called the 2k-range of x.

Page 90: Two Different Approximate String Matching Problems and Their Algorithms

90

In this algorithm, we always open a window with size m and check whether there exist k+1 characters in the window which do not appear in their corresponding 2k-ranges.

If yes, shift the window according to some rules called The Shifting Rule explained later.

If no, suppose the window starts at location i, use DP2 on a window T(i-k, i+m-1+k) of T with P. After this, shift the window according to the Shifting Rule.

Page 91: Two Different Approximate String Matching Problems and Their Algorithms

91

• The Shifting Rule is given in the next slides.

Page 92: Two Different Approximate String Matching Problems and Their Algorithms

92

Case 1: There is one character in this (k+1)-suffix which exists in P in such a way as shown below. Move the pattern to match these characters. Note that in such a situation, there are at most k mismatches between the (k+1)-suffix and its corresponding substring in P.

T

P

(k+1)-suffix

x

x

T

P

(k+1)-suffix

x

x

window

•A x-suffix (prefix) is a suffix (prefix) with size x.

Page 93: Two Different Approximate String Matching Problems and Their Algorithms

93

T

P

(k+1)-suffix

x

x

T

P

(k+1)-suffix

x

x

window

A very tricky point here. An approximate solution of our problem may still start from a location to the left of i. In fact, it may start from any location between i-k to i+k. Similar argument applies to the ending point. Conclusion: The examination window size must be m+2k.

i

Page 94: Two Different Approximate String Matching Problems and Their Algorithms

94

Case 2: No such a character exists. Move the pattern in such a way that the k-prefix of P aligns with the k-suffix of W as shown below. Under such a situation, again, there are at most k-mismatches between the k-suffix of W and k-prefix of P.

T

P

(k+1)-suffix

k-prefix

Page 95: Two Different Approximate String Matching Problems and Their Algorithms

95

• We perform a pre-processing similar to the bad character rule pre-processing done in BM Algorithm.

• For this algorithm, the checking window size is m and the examination window size is m+2k.

Page 96: Two Different Approximate String Matching Problems and Their Algorithms

96

Complete example for approximate string matching

For example :Let k=1, m=8, n=24

T: G C A T C G C A G A G A G T A T G C A G A G C G

P: G C A G A G A G

Σ A C G *

D1[i=8, a]

1 6 2 8

D1[i=7, a]

2 5 1 8

Bad character ruleTable.

Page 97: Two Different Approximate String Matching Problems and Their Algorithms

97

Example(1/15)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

T:

G C A T C G C A G A G A G T A T G C A G A G C G

P:

G C A G A G A G

Σ A C G *

D[i=8, a] 1 6 2 8

D[i=7, a] 2 5 1 8

k=1>k

t8=A appears in P(7,8)t7=C does not appear in P(6,8)t6=G appears in P(5,7)t5=C does not appear in P(4,6)

Shifting is needed now. We examine the (k+1)-suffix which is a 2-suffix .

Page 98: Two Different Approximate String Matching Problems and Their Algorithms

98

Example(2/15)

Σ A C G *

D[i=8, a] 1 6 2 8

D[i=7, a] 2 5 1 8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

T:

G C A T C G C A G A G A G T A T G C A G A G C G

P:

G C A G A G A G

>kk=1

t9=G appears in P(7,8)t8=A appears in P(6,8)t7=C does not appear in P(5,7)t6=G appears in P(4,6)t5=C does not appear in P(3,5)

Shifting is needed now.

Page 99: Two Different Approximate String Matching Problems and Their Algorithms

99

Example(3/15)

Σ A C G *

D[i=8, a] 1 6 2 8

D[i=7, a] 2 5 1 8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

T:

G C A T C G C A G A G A G T A T G C A G A G C G

P:

G C A G A G A G

>kk=1

t11=G appears in P(7,8)t10=A appears in P(6,8)t9=G appears in P(5,7)t8=A appears in P(4,6)t7=C does not appear in P(3,5)t6=G appears in P(2,4)t5=C appears in P(1,3)t4=T does not appear in P(1,2) Shifting is needed now.

Page 100: Two Different Approximate String Matching Problems and Their Algorithms

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

T:

G C A T C G C A G A G A G T A T G C A G A G C G

P:

G C A G A G A G

Σ A C G *

D[i=8, a] 1 6 2 8

D[i=7, a] 2 5 1 8

Output locations12, 13 and 14.

C G C A G A G A G T

G

C

A

G

A

G

A

G

0 0 0 0 0 0 0 0 0 0 0

1 1 0 1 1 0 1 0 1 0 1

2 1 1 0 1 1 1 1 1 1 1

3 2 2 1 0 1 1 2 1 2 2

4 2 1 0 1 1 2 1 2

5 2 1 0 1 1 2 2

6 2 1 0 1 1 2

7 2 1 0 1 2

8 2 1 0 1

k=1

No k+1 characters in the window do not appear in their 2k-ranges. DP2 is used.

Page 101: Two Different Approximate String Matching Problems and Their Algorithms

101

Summary

1. DP1 is used for Problem 1 and DP2 is used for Problem 2.

2. For both problems, algorithms try to avoid exhaustive search.

3. Most algorithms use checking windows to determine which region needs to be examined.

4. The NB Algorithm in [NB99] and Pan Algorithm do not use checking windows to determine regions which need to be examined.

Page 102: Two Different Approximate String Matching Problems and Their Algorithms

102

5. During the examination phase when DP1 or DP2 is used, there is another window which may be called the examination window.

6. For Problem 1 where DP1 is used, the examination window size is m+k for all algorithms.

7. For Problem 2 where DP2 is used, the examination window size is m+2k for all algorithms.

Page 103: Two Different Approximate String Matching Problems and Their Algorithms

103

• The examination window size is m+2k when the solution may start and end as shown below:

T:

P:

mk k

Possible starting points

Possible ending points

Page 104: Two Different Approximate String Matching Problems and Their Algorithms

104

Thank You