22
1 Approximate string matching usi ng factor automata J. Holub and B. Melichar Theoretical Computer Science vo l.249 p.305-311 Speaker: L. C. Chen Advisor: R. C. T. Lee

1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

Embed Size (px)

Citation preview

Page 1: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

1

Approximate string matching using factor automata

J. Holub and B. MelicharTheoretical Computer Science vol.249 p.30

5-311

Speaker: L. C. Chen

Advisor: R. C. T. Lee

Page 2: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

2

Problem

• DL(P, X) between strings P and X is the minimum number of edit operations (substitution, insertion and deletion) needed to convert string P to X.

• Given a text T, a pattern P, and an integer k, k≦m≦n, approximate string matching can be defined as determining whether string X occurs in text T such that edit distance DL(P, X) between pattern P and string X is less than or equal to k.

Page 3: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

3

An example of Edit Distance

To convert P into T: P = abcde T = bcfeg

P = abcde T = bcfeg

P1 = bcde P2 = bcfef

gDelete a Substitute d

with fInsert

Page 4: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

4

Basic definition

• Fac(T): a set contains all the substrings of text T.• A nondeterministic finite automaton (NFA) is a five-t

uple M=(Q, Σ, δ, q0 , F), where Q is a finite set of states, Σ is a finite input alphabet, δ is a mapping from Q×(Σ {ε}) into the set of subsets of ∪ Q, q0 Q is an initial state, and F Q is a set of final states.

• M(Fac(T)): a factor automaton accepts Fac(T).

Page 5: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

5

1 2 3 40 5 6 7

8

9

d

a a a

a

b b

b b b

bb

d

d

d

T=aabbabdFac(T)={a,b,d,aa,ab,bb,ba,bd,aab,abb,bba,bab,abd,aabb,abba,bbab,babd aabba,abbab,bbabd,aabbab,abbabd,aabbabd}

Factor automaton

Factor automation M(Fac(T)): a deterministic finite automaton (DFA) accepts all substrings of the given text T.

Page 6: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

6

a

d

b

roota

b

b

a

b

d

b

b

b

a

d

a

b

d

b

a

b

d

d

A suffix tree can also be used to recognize all substrings of T=aabbabd,Fac(T)={a,b,d,aa,ab,bb,ba,bd,aab,abb,bba,bab,abd,aabb,abba,bbab,babd aabba,abbab,bbabd,aabbab,abbabd,aabbabd}

Page 7: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

7

3,0

3,1a

b 2,01,00,0

2,11,1

b

b

a

P = bab, k=1.The finite automaton M(Lk(P)) accepts Lk(P).Lk(P)={ab, bb, ba, aab, bab, dab, bbb, bdb baa, bad, bbab, bdab, baab, badb}.

b baa b

}),(,|{)( kYPDYYPL Lk One matched, 0 error.

One matched, one error.

Three matched, 0 error.

Page 8: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

8

3,0

3,1a

b2,01,00,0

2,11,1

b

b

a

b baa b

P = bab, k=1.The finite automaton M(Lk(P)) accepts Lk(P).Lk(P)={ab, bb, ba, aab, bab, dab, bbb, bdb baa, bad, bbab, bdab, baab, badb}.

Recognize ab

Page 9: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

9

3,0

3,1a

b2,01,00,0

2,11,1

b

b

a

b ba

a b

P = bab, k=1.The finite automaton M(Lk(P)) accepts Lk(P).Lk(P)={ab, bb, ba, aab, bab, dab, bbb, bdb baa, bad, bbab, bdab, baab, badb}.

Recognize aab

Page 10: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

10

3,0

3,1a

b2,01,00,0

2,11,1

b

b

a

b baa b

P = bab, k=1.The finite automaton M(Lk(P)) accepts Lk(P).Lk(P)={ab, bb, ba, aab, bab, dab, bbb, bdb baa, bad, bbab, bdab, baab, badb}.

Recognize bbab

Page 11: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

11

Definition

• Let

An automaton for intersection of M1 and M2 is an automaton

).,,,,( and ),,,,( 202222101111 FqΣQMFqΣQM

. and ,)],,(),,([)],,([ where

),],,[,,,(

221122110201

21020121

aQqQqaqaqaqq

FFqqΣQQM

Page 12: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

12

da

b{0,0;1,1};0 {1,0;2,1};9

{1,1};7

{1,1;2,1};1

{2,0;3,1};5

{1,1;2,1;3,1};4

{1,1;2,1};7

{2,1};2

{3,1};8

{3,1};3

{3,0};6

{2,1};2 {3,1};6

a

a

ab

b

b

b

b

d

T=aabbabdP = bab, k=1

1 2 3 40 5 6 7

8

9

d

a a a

a

b b

b b b

bb

d

d

d

Intersection of M(Lk(P)) and M(Fac(T)).

3,0

3,1a

b2,01,00,0

2,11,1

b

b

a

b baa b

Solutions : {ba, bab, bb, bbab, aab, ab} (All end with {3,0} or {3,1}.)

Page 13: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

13

da

b{0,0;1,1};0 {1,0;2,1};9

{1,1};7

{1,1;2,1};1

{2,0;3,1};5

{1,1;2,1;3,1};4

{1,1;2,1};7

{2,1};2

{3,1};8

{3,1};3

{3,0};6

{2,1};2 {3,1};6

a

a

ab

b

b

b

b

d

T=aabbabdP = bab, k=1

1 2 3 40 5 6 7

8

9

d

a a a

a

b b

b b b

bb

d

d

d

Intersection of M(Lk(P)) and M(Fac(T)).

3,0

3,1a

b2,01,00,0

2,11,1

b

b

a

Page 14: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

14

Intersection

da

b{0,0;1,1};0 {1,0;2,1};9

{1,1};7

{1,1;2,1};1

{2,0;3,1};5

{1,1;2,1;3,1};4

{1,1;2,1};7

{2,1};2

{3,1};8

{3,1};3

{3,0};6

{2,1};2 {3,1};6

a

a

ab

b

b

b

b

d

a a b b a b dT

DL(P,ba)=1P=bab

Page 15: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

15

Intersection

da

b{0,0;1,1};0 {1,0;2,1};9

{1,1};7

{1,1;2,1};1

{2,0;3,1};5

{1,1;2,1;3,1};4

{1,1;2,1};7

{2,1};2

{3,1};8

{3,1};3

{3,0};6

{2,1};2 {3,1};6

a

a

ab

b

b

b

b

d

a a b b a b dT

DL(P,bab)=0P=bab

Page 16: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

16

Intersection

da

b{0,0;1,1};0 {1,0;2,1};9

{1,1};7

{1,1;2,1};1

{2,0;3,1};5

{1,1;2,1;3,1};4

{1,1;2,1};7

{2,1};2

{3,1};8

{3,1};3

{3,0};6

{2,1};2 {3,1};6

a

a

ab

b

b

b

b

d

a a b b a b dT

PP=bab DL(P,bb)=1

Page 17: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

17

Intersection

da

b{0,0;1,1};0 {1,0;2,1};9

{1,1};7

{1,1;2,1};1

{2,0;3,1};5

{1,1;2,1;3,1};4

{1,1;2,1};7

{2,1};2

{3,1};8

{3,1};3

{3,0};6

{2,1};2 {3,1};6

a

a

ab

b

b

b

b

d

a a b b a b dT

P=bab DL(P,bbab)=1

Page 18: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

18

Intersection

da

b{0,0;1,1};0 {1,0;2,1};9

{1,1};7

{1,1;2,1};1

{2,0;3,1};5

{1,1;2,1;3,1};4

{1,1;2,1};7

{2,1};2

{3,1};8

{3,1};3

{3,0};6

{2,1};2 {3,1};6

a

a

ab

b

b

b

b

d

a a b b a b dT

P=bab DL(P,aab)=1

Page 19: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

19

Intersection

da

b{0,0;1,1};0 {1,0;2,1};9

{1,1};7

{1,1;2,1};1

{2,0;3,1};5

{1,1;2,1;3,1};4

{1,1;2,1};7

{2,1};2

{3,1};8

{3,1};3

{3,0};6

{2,1};2 {3,1};6

a

a

ab

b

b

b

b

d

a a b b a b dT

P=bab DL(P,ab)=1

Page 20: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

20

Lemma

• The number of automaton is always lower than

. !1)2( 1 kk m-k

Page 21: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

21

a

d

b

roota

b

b

a

b

d

b

b

b

a

d

a

b

d

b

a

b

d

d

T=aabbabd P = bab, k=1.The finite automaton M(Lk(P)) accepts Lk(P).Lk(P)={ab, bb, ba, aab, bab, dab, bbb, bdb baa, bad, bbab, bdab, baab, badb}.

Page 22: 1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor:

22

Thank you!