69
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

  • View
    229

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Survey: String Matching with k Mismatches

Moshe Lewenstein Bar Ilan University

Page 2: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

String Matching with k Mismatches

Landau – Vishkin 1986

Galil – Giancarlo 1986

Abrahamson 1987

Amir - Lewenstein - Porat 2000

Page 3: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Exact String Matching

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

Page 4: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3

Exact String Matching

Page 5: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7

Exact String Matching

Page 6: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7 11

Exact String Matching

Page 7: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

Answer: {3,7,11,..}

Exact String Matching

Page 8: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Exact String Matching

Problem: Matching not exact in applications of:

• Computational Biology

• Musicology

• Text Editing

• Meteorology

• etc.

Need other definitions of string matching!

Page 9: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Approximate String Matching

Idea: Find all text locations where distance from pattern is sufficiently small.

distance metric: HAMMING DISTANCE

Let S = s1s2…sm

R = r1r2…rm

Ham(S,R) = The number of locations j where sj rj

Example: S = ABCABC R = ABBAAC

Ham(S,R) = 2

Page 10: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C…

Page 11: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2

Ham(P,T1) = 2

Page 12: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4

Ham(P,T2) = 4

Page 13: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6

Ham(P,T3) = 6

Page 14: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2

Ham(P,T4) = 2

Page 15: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Page 16: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Page 17: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Page 18: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … Y,N,N,Y,…

Page 19: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Naïve Algorithm(for counting mismatches or k-mismatches problem)

Running Time: O(nm) n = |T|, m = |P|

- Goto each location of text and compute hamming distance of P and Ti

Page 20: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

The Kangaroo Method(for k-mismatches)

Landau – Vishkin 1986

Galil – Giancarlo 1986

Page 21: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Trie

• A tree representing a set of strings.

ab

c

e

e

f

d b

f

e g

{ aeef ad bbfe bbfg c }

Page 22: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Trie (Cont)

• Assume no string is a prefix of another

ab

c

e

e

f

d b

f

e g

Each string corresponds to a leaf.

Page 23: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Compressed Trie • Compress unary nodes, label edges by strings

ab

c

e

e

f

d b

f

e g

a

bbf

c

eefd

e g

Page 24: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Suffix tree

Suffix tree of string s:a compressed trie of all suffixes of s

Prefix-free: add a special character, say $, at the end of s

Page 25: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$

{ $ b$ ab$ bab$ abab$ }

ab

ab$

ab$

b

$

$

$

Page 26: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Suffix Tree properties

- Succint in space - O(n).

- Can be built in O(n) time. McCreight, Weiner,

Ukkonen, Farach-Colton

b

12

ab

a

b$

a

b$

3

$ 4

$

5

$

Page 27: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Exact string matching

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Given a pattern P = ab we traverse the tree according to the pattern.

s=abab$

Page 28: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Exact string matching

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Leaves correspond to locations of appearance!

s=abab$ 1 3

Page 29: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Exact string matching

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Prepare Tree: O(n) time

Find matches: O(m + occ) time occ = # of matches

s=abab$ 1 3

Page 30: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Lowest common ancestors

A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Page 31: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

s = abbaab$

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

Page 32: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

s = abbaab$ aab$

Page 33: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

s = abbaab$ aab$ abbaab$

Page 34: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

s = abbaab$

aab$ abbaab$

Page 35: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

LCA/LCP propertiesa

1

3

b

aa

b

ab$

b

5

$

2

b

4

b$

a6

$

7

$

b

$

aa

ab

$

Preprocesssing time : O(n)

Query Time: O(1)

Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993

Page 36: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 37: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 38: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 39: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 40: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 41: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 42: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 43: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 44: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 45: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

The Kangaroo Method(for k-mismatches)

Preprocess:

Build suffix tree of both P and T - O(n+m) timeLCA preprocessing - O(n+m) time

Check P at given text location

Kangroo jump till next mismatch - O(k) time

Overall time: O(nk)

Page 46: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

a b a c c a c b a c a b a c c P =

a b b b c c c a a a a b a c b ......T =

Boolean Convolutions (FFT) Method

Page 47: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c a-mask

a b b b c c c a a a a b a c b ......T =

Boolean Convolutions (FFT) Method

Page 48: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c a-mask

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

Boolean Convolutions (FFT) Method

Page 49: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

Boolean Convolutions (FFT) Method

Page 50: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

a b b b c c c a a a a b a c b ......not-amask

Boolean Convolutions (FFT) Method

Page 51: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

a b b b c c c a a a a b a c b ......not-amask

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Boolean Convolutions (FFT) Method

Page 52: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Boolean Convolutions (FFT) Method

Page 53: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Multiply Pa and Tnot a to count mismatches (use FFT)

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...

Boolean Convolutions (FFT) Method

Page 54: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Multiply Pa and Tnot a to count mismatches (use FFT)

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...

Boolean Convolutions (FFT) Method

Page 55: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Boolean Convolutions (FFT) Method

Page 56: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Running Time: One boolean convolution - O(n log m) time

# of matches of all symbols - O(n| | log m) timeΣ

Boolean Convolutions (FFT) Method

Page 57: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Counting Method

Input: Text: T = t1…tn

Pattern: P = p1…pm

Max # of allowed mismatches: k

Assumption: Each pattern element is distinct

a b c d e f g h

b g d e f h d c c a b g h h ...

...

...

...

Count matches (instead of mismatches)

P =

T =

counter

increment

Page 58: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

O(n log m) Algorithm

Frequent Symbol: a symbol that appears at least times in P.

Case 1: At least frequent symbols.

- Consider first frequent symbols.

- For each of them construct a mask for first appearances.

k2

k

k

k2

k

We distinguish between two cases:

Case 2: Less than frequent symbols.k

Case 1: At least frequent symbols.k

Page 59: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Example of Masked Countingk = 4, = 4k2

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c

a b a c c a c b a c a b a c c

a-mask

c-mask

a b b b c c c a a a a b a c b ......

T =

use a-mask

Page 60: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Example of Masked Countingk = 4, = 4k2

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c

a b a c c a c b a c a b a c c

a-mask

c-mask

a b b b c c c a a a a b a c b ......

T = a b a c c a c b a c a b a c c

d

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter

Page 61: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Example of Masked Countingk = 4, = 4k2

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c

a b a c c a c b a c a b a c c

a-mask

c-mask

a b b b c c c a a a a b a c b ......

T = a b a c c a c b a c a b a c c

d

0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter

Page 62: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Counting Stage:

Run through text and count occurrencesof all marks.

Time: O(n ).k

For location i of T, if counteri < k then no match at location i.

Why? The total # of elements in all masks is 2 = 2k.

Important Observations:

1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors.

k k

k

Page 63: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

How many locations remain?

Sum of all counters: 2n

Value of potential matches > k

k

kn

kkn 22

The Kangaroo Method.

How do we check these locations?

Use

Kangaroo Method Time: O(k) per location

Overall Time: O( ) = O( )kkn kn

# of potential matches:

Page 64: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Case 2: X frequent symbols, x < k

a) Count all matches of frequent symbols - one boolean convolution per symbol.

k

b) For non-frequent symbols, build full masks.

Time: O(x n log m) = O( n log m)

Symbol non-frequent appears < 2 in P mask size < 2kk

Count time: O(n )k

Page 65: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

So, Case 2 is O(n log m)k

Overall Algo. Time: O(n log m)k

c) Add results of a) & b) and get total number of matches at every text location.

Time:a) O(n log m)b) O(n )c) O(n)

k

k

Page 66: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

Additional Points

1. O(n log k)k

mknkn log

3

mk 31

For there is a linear time

algorithm - O( )

2. O( n )kk log

Better tradeoff:

Define frequent symbol > kk log

Page 67: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

O( ) time algorithmmknkn log

3

Outline:

1. Find 2k special substrings of pattern.

2. Construct forest data structure combining info of special pattern substrings and text.

3. Use local counting arguments and quick queries to forest data structure to prune candidates.

4. Use kangaroo method to check leftover potential candidates.

Page 68: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

k-Mismatches and Matrix Multiplication

“Or-And” matrix multiplication:

AxB = C, cij = aik bkj nk 1

Pattern all-mismatch problem: Find all text locations where the pattern mismatches at every character.

Indyk: If there is an algorithm faster than O(n ) for the Pattern all-mismatch problem then there is a new method for solving “Or-And” matrix multiplication faster than O(n3)

m

Page 69: Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

OPEN PROBLEMS

Hamming Distance in time:O(n log m)

Edit Distance?

Other metrics?