44
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986 , pp. 52–54 Original: Moshe Lewenstein Modified by: Hsing-Yen Ann Date: Nov. 26, 2004

Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Embed Size (px)

Citation preview

Page 1: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Improved string matching with k mismatches

(The Kangaroo Method)Galil, R. Giancarlo

SIGACT News, Vol. 17, No. 4, 1986 , pp. 52–54

Original: Moshe LewensteinModified by: Hsing-Yen Ann Date: Nov. 26, 2004

Page 2: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Exact String Matching

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

Page 3: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3

Exact String Matching

Page 4: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7

Exact String Matching

Page 5: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7 11

Exact String Matching

Page 6: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

Answer: {3,7,11,..}

Exact String Matching

Page 7: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Approximate String Matching

Idea: Find all text locations where distance from pattern is sufficiently small.

distance metric: HAMMING DISTANCE

Let S = s1s2…sm

R = r1r2…rm

Ham(S,R) = The number of locations j where sj rj

Example: S = ABCABC R = ABBAAC

Ham(S,R) = 2

Page 8: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C…

Page 9: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2

Ham(P,T1) = 2

Page 10: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4

Ham(P,T2) = 4

Page 11: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6

Ham(P,T3) = 6

Page 12: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2

Ham(P,T4) = 2

Page 13: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Page 14: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Page 15: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Page 16: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … Y,N,N,Y,…

Page 17: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Naïve Algorithm(for counting mismatches or k-mismatches problem)

Running Time: O(nm) n = |T|, m = |P|

- Goto each location of text and compute hamming distance of P and Ti

Page 18: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

The Kangaroo Method(for k-mismatches)

Landau – Vishkin 1986

Galil – Giancarlo 1986

Page 19: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Trie

• A tree representing a set of strings.

ab

c

e

e

f

d b

f

e g

{ aeef ad bbfe bbfg c }

Page 20: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Trie (Cont)

• Assume no string is a prefix of another

ab

c

e

e

f

d b

f

e g

Each string corresponds to a leaf.

Page 21: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Compressed Trie • Compress unary nodes, label edges by strings

ab

c

e

e

f

d b

f

e g

a

bbf

c

eefd

e g

Page 22: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Suffix tree

Suffix tree of string s:a compressed trie of all suffixes of s

Prefix-free: add a special character, say $, at the end of s

Page 23: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$

{ $ b$ ab$ bab$ abab$ }

ab

ab$

ab$

b

$

$

$

Page 24: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Suffix Tree properties

- Succint in space - O(n).

- Can be built in O(n) time. McCreight, Weiner,

Ukkonen, Farach-Colton

b

12

ab

a

b$

a

b$

3

$ 4

$

5

$

Page 25: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Exact string matching

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Given a pattern P = ab we traverse the tree according to the pattern.

s=abab$

Page 26: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Exact string matching

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Leaves correspond to locations of appearance!

s=abab$ 1 3

Page 27: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Exact string matching

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Prepare Tree: O(n) time

Find matches: O(m + occ) time occ = # of matches

s=abab$ 1 3

Page 28: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Lowest common ancestors

A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Page 29: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

s = abbaab$

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

Page 30: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

s = abbaab$ aab$

Page 31: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

s = abbaab$ aab$ abbaab$

Page 32: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

s = abbaab$

aab$ abbaab$

Page 33: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

LCA/LCP propertiesa

1

3

b

aa

b

ab$

b

5

$

2

b

4

b$

a6

$

7

$

b

$

aa

ab

$

Preprocesssing time : O(n)

Query Time: O(1)

Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993

Page 34: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iFinding LCP(s, P0, Ti)

Page 35: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iLength of LCP(s, P0, Ti) = 4

Page 36: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iKangrooing distance = LCP(s, P0, Ti) +1 = 5

Page 37: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iFinding LCP(s, P5, Ti+5)

Page 38: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iLength of LCP(s, P5, Ti+5) = 2

Page 39: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iKangrooing distance = LCP(s, P5, Ti+5) +1 = 3

Page 40: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iFinding LCP(s, P8, Ti+8)

Page 41: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iLength of LCP(s, P8, Ti+8) = 3

Page 42: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iNext iteration: i = i + 1

Page 43: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

The Kangaroo Method(for k-mismatches)

Preprocess:

Build suffix tree of both P and T - O(n+m) timeLCA preprocessing - O(n+m) time

Check P at given text location

Kangroo jump till next mismatch - O(k) time

Overall time for naïve approach: O(nk)

Page 44: Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein

2004/11/22 Hsing-Yen Ann

Faster Algorithms for Four Different Cases

Large alphabet At least 2k different alphabets in pattern P. O(n)

Small alphabet At most different alphabets in pattern P.

General alphabets - many frequent symbols At least frequent symbols

General alphabets - few frequent symbols Less than frequent symbols

k2

mknO log

mknO log

mknO log

k

k