22
1 Fast text searching: al lowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 3 5, 1992, pp. 83-91 Advisor: Prof. R. C. T. L ee Reporter: Z. H. Pan

1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

Embed Size (px)

Citation preview

Page 1: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

1

Fast text searching: allowing errors

Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91

Advisor: Prof. R. C. T. Lee

Reporter: Z. H. Pan

Page 2: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

2

Given a text T(1,n), a pattern P(1,m) and an error found k.

Our approximate string matching problem is defined as follow:

Find all location i of T such that the following condition is satisfied: There exists a suffix A of T(1, i) such that d(A,P)≦k where d(x,y) is the edit distance between x and y.

Page 3: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

3

Example:

T=deaabeg,

P=aabac and k=2.

For i=5.

T(1, 5)= deaab.

We note that there exists a suffix A=aab of T(1, 5) such that d(A,P)=d(aab,aabac)=2.

Page 4: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

4

Example:

T=deaabeg, P=aab and k=2.

Consider i=5.

T(1,5)=deaab.

We have A=aab of T(1,5) and d(A,P)=d(aab,aab)=0. Thus we have found a substring aab in T such that d(aab,P)=0.

Consider i=6.

T(1,6)=deaabe.

We have A=aabe of T(1,6) and d(A,P)=d(aabe,aab)=1. Again, we have found a substring aabe in T such that d(aabe,P)=1.

Page 5: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

5

T

P

S2

Let S be a substring of T.

If there exists a suffix S2 of S and a suffix P2 of P such that

d(S2, P2) = 0, and d(S1, P1) ≦k,

we have d(S, P) ≦ k.

S1

S

P1 P2

Our approach is based upon the following observation:

Page 6: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

6

Example:

A=addcd and B=abcd. k=2. We may decompose A and B as follows:A=add+cd.B=ab+cd.d(add,ab)=2.

Thus d(A,B)=2.

Page 7: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

7

A Recursive Operation for the Dynamic Programming Approach

Consider T(1,i) and P(1, j).

Case 1: T(i)=P( j). We denote prefix B which is P(1, j-1) in P. We consider whether there is a suffix A in T(1,i-1) such that d( A, B ) k.≦

i

j

T :

P :

A

B

i-1

j-11

Page 8: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

8

Case 2: T(i)≠P(j). We consider three cases:

2.1 We denote B which is P(1, j). There is a suffix A in T(1,i-1) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:

i

j

T :

P :

A

B

i-1

1

i

jT :

P :

A

B

i-1

1

insertion

Page 9: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

9

Case 2: T(i)≠P(j). We consider three cases:

2.2 We denote B which is P(1, j-1). There is a suffix A in T(1, i ) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:

i

j

T :

P :

A

B1 j-1

Page 10: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

10

Case 2: T(i)≠P(j). We consider three cases:

2.3 We denote B which is P(1, j-1). There is a suffix A in T(1, i-1) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:

i

j

T :

P :

A

B

i-1

1 j-1

Page 11: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

11

To solve our approximate string matching problem, we start with a table, called Rk[n, m]. Let S=T(1, i).

Rk(i,j)

Where 1≦i≦n and 1≦j≦m.

11000

a a b a a c a a b a c a b 11100

11110

11111

11111

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

11101

11010

11100

11110

11111

11011

11001

11100

Example:

T:aabaacaabacab, P:aabac and k=1.

Consider i=9, j=4.

S=T(1, 9)=aabaacaab

P(1, 4)=aaba

A=aab

d(A,P(1, 4))=d(aab,aaba)=1

∴ R1(9, 4)=1

R1

=1 if there exists a suffix A of S such that d(A, P1,j)≦k.=0 otherwise.

Page 12: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

12

11000

a a b a a c a a b a c a b 11100

11110

11111

11111

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

11101

11010

11100

11110

11111

11011

11001

11100

Example:

T:aabaacaabacab, P:aabac and k=1.

Consider i=13 and j=5.

S=T(1, 13)=aabaacaabacab

P(1, 5)=aabac

There doesn’t exist any suffix A of S such that d(A,P(1, 5)) 1.≦

∴ R1(13,5)=0

R1

Page 13: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

13

Question: How can we find Rk(i, j)?

Answer: Dynamic Programming.

There are three types of operation in edit distance:

(1) Insertion

(2) Deletion

(3) Substitution

We consider them separately and combine the results later.

Page 14: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

14

Let RIk(i,j), RD

k(i,j) and RSk(i,j) denote the Rk(i,j) related t

o insertion, deletion and substitution respectively.

And let RIk[i,j], RD

k[i,j] and RSk[i,j] denote the Rk[i,j] relat

ed to insertion, deletion and substitution of table respectively.

Page 15: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

15

Consider RIk(i,j) first.

RIk(i,j)

=1 if ti≠pj and Rk-1(i-1,j)=1

or ti= pj and Rk(i-1,j-1)=1,

=0 otherwise.

T:

P:

aabac

aabac

b

binsertion

i

j

i-1

Page 16: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

16

RIk(i,j)

=1 if ti≠pj and Rk-1(i-1,j)=1 or ti= pj and Rk(i-1,j-1)=1

=0 otherwise

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

R0[13,5]

RI1[13,5]

(1) When i=13 and j=3.

t13=p3=‘b’, R1(12,2)=1

∴ RI1(13,3)=1

(2) When i=6 and j=4.

t6=‘c’≠p4=‘a’, R0(5,4)=0

∴ RI1(6,4)=0

(3) When i=11 and j=4.

t11=‘c’≠p4=‘a’, R0(10,4)=1

∴ RI1(11,4)=1

10000

a a b a a c a a b a c a b 11000

11100

11110

11010

11001

11000

11000

11100

11110

10011

11001

10100

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

Page 17: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

17

Consider RDk(i,j).

RDk(i,j)

=1 if ti≠pj and Rk-1(i,j-1)=1

or ti= pj and Rk(i-1,j-1)=1,

=0 otherwise.

T:

P:

aabac

aabac b

deletion

i

jj-1

Page 18: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

18

RDk(i,j)

=1 if ti≠pj and Rk-1(i,j-1)=1 or ti= pj and Rk(i-1,j-1)=1

=0 otherwise

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

R0[13,5]

(1) When i=13 and j=3.

t13=p3=‘b’, R1(12,2)=1

∴ RD1(13,3)=1

(2) When i=6 and j=4.

t6=‘c’≠p4=‘a’, R0(6,3)=0

∴ RD1(6,4)=0

(3) When i=3 and j=4.

t3=‘b’≠p4=‘a’, R0(3,3)=1

∴ RD1(3,4)=1

RD1[13,5]

11000

a a b a a c a a b a c a b 11100

10110

11011

11100

10000

11000

11100

10110

11011

10001

11000

10100

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

Page 19: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

19

Consider RSk(i,j).

RSk(i,j)

=1 if ti≠pj and Rk-1(i-1,j-1)=1

or ti= pj and Rk(i-1,j-1)=1

=0 otherwise

T:

P:

aabac

aabac a

b T:

P:

aabac

aabac b

substitution

b

i

j

i-1

j-1

i

j

i-1

j-1

Page 20: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

20

RSk(i,j)

=1 if ti≠pj and Rk-1(i-1,j-1)=1 or ti= pj and Rk(i-1,j-1)=1

=0 otherwise

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

R0[13,5]

(1) When i=13 and j=3.

t13=p3=‘b’, R1(12,2)=1

∴ RD1(13,3)=1

(2) When i=6 and j=4.

t6=‘c’≠p4=‘a’, R0(5,3)=0

∴ RD1(6,4)=0

(3) When i=5 and j=5.

t3=‘b’≠p4=‘a’, R0(4,4)=1

∴ RD1(5,5)=1

10000

a a b a a c a a b a c a b 11000

11100

11010

11001

11100

11010

11000

11100

11010

11001

11000

11100

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

RS1[13,5]

Page 21: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

21

After every RIk(i,j), RD

k(i,j) and RSk(i,j) have found, we immedi

ately determine Rk(i,j) by

Rk(i,j)= RIk(i,j) or RD

k(i,j) or RSk(i,j).

11000

a a b a a c a a b a c a b 11100

11110

11111

11111

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

11101

11010

11100

11110

11111

11011

11001

11100

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

R1[13,5]

Page 22: 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

22

Thank you!