46
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Embed Size (px)

Citation preview

Page 1: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

String Matching with k Mismatches

Moshe Lewenstein Bar Ilan UniversityModified by Ariel Rosenfeld

Page 2: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

String Matching with k Mismatches

Landau – Vishkin 1986

Galil – Giancarlo 1986

Abrahamson 1987

Amir - Lewenstein - Porat 2000

Page 3: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Exact String Matching

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

Page 4: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3

Exact String Matching

Page 5: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7

Exact String Matching

Page 6: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7 11

Exact String Matching

Page 7: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

Answer: {3,7,11,..}

Exact String Matching

Page 8: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Exact String Matching

Problem: Matching not exact in applications of:

• Computational Biology

• Musicology

• Text Editing

• Meteorology

• etc.

Need other definitions of string matching!

Page 9: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Approximate String Matching

Idea: Find all text locations where distance from pattern is sufficiently small.

distance metric: HAMMING DISTANCE

Let S = s1s2…sm

R = r1r2…rm

Ham(S,R) = The number of locations j where sj rj

Example: S = ABCABC R = ABBAAC

Ham(S,R) = 2

Page 10: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C…

Page 11: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2

Ham(P,T1) = 2

Page 12: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4

Ham(P,T2) = 4

Page 13: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6

Ham(P,T3) = 6

Page 14: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2

Ham(P,T4) = 2

Page 15: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Page 16: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Page 17: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Page 18: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … Y,N,N,Y,…

Page 19: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Naïve Algorithm(for counting mismatches or k-mismatches problem)

Running Time: O(nm) n = |T|, m = |P|

- Goto each location of text and compute hamming distance of P and Ti

Page 20: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

The Kangaroo Method(for k-mismatches)

Landau – Vishkin 1986

Galil – Giancarlo 1986

Page 21: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Trie

• A tree representing a set of strings.

ab

c

e

e

f

d b

f

e g

{ aeef ad bbfe bbfg c }

Page 22: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Trie (Cont)

• Assume no string is a prefix of another

ab

c

e

e

f

d b

f

e g

Each string corresponds to a leaf.

Page 23: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Compressed Trie • Compress unary nodes, label edges by strings

ab

c

e

e

f

d b

f

e g

a

bbf

c

eefd

e g

Page 24: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Suffix tree

Suffix tree of string s:a compressed trie of all suffixes of s

Prefix-free: add a special character, say $, at the end of s

Page 25: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$

{ $ b$ ab$ bab$ abab$ }

ab

ab$

ab$

b

$

$

$

Page 26: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Suffix Tree properties

- Succint in space - O(n).

- Can be built in O(n) time. McCreight, Weiner,

Ukkonen, Farach-Colton

b

12

ab

a

b$

a

b$

3

$ 4

$

5

$

Page 27: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Exact string matching

12

ab

ab$

ab$

b

3

$ 4

$

5

$

Given a pattern P = ab we traverse the tree according to the pattern.

s=abab$

Page 28: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Exact string matching

12

ab

ab$

ab$

b

3

$ 4

$

5

$

Leaves correspond to locations of appearance!

s=abab$ 1 3

Page 29: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Exact string matching

12

ab

ab$

ab$

b

3

$ 4

$

5

$

Prepare Tree: O(n) time

Find matches: O(m + occ) time occ = # of matches

s=abab$ 1 3

Page 30: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Lowest common ancestors

A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Page 31: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

s = abbaab$

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaab$

Page 32: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaab$

s = abbaab$ aab$

Page 33: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaab$

s = abbaab$ aab$ abbaab$

Page 34: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaab$

s = abbaab$

aab$ abbaab$

Page 35: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

LCA/LCP propertiesa

1

3

b

aab

ab$

b

5

$

2

b

4

b$

a6

$

7

$

b

$

aaab$

Preprocesssing time : O(n)

Query Time: O(1)

Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993

Page 36: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 37: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 38: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 39: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 40: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 41: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 42: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 43: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 44: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 45: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

- Do up to k LCP queries for every text location

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 46: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

The Kangaroo Method(for k-mismatches)

Preprocess:

Build suffix tree of both P and T - O(n+m) timeLCA preprocessing - O(n+m) time

Check P at given text location

Kangroo jump till next mismatch - O(k) time

Overall time: O(nk)