Upload
demetrius-porter
View
45
Download
0
Embed Size (px)
DESCRIPTION
Survey: String Matching with k Mismatches. Moshe Lewenstein Bar Ilan University. String Matching with k Mismatches. Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 - PowerPoint PPT Presentation
Citation preview
Survey: String Matching with k Mismatches
Moshe Lewenstein Bar Ilan University
String Matching with k Mismatches
Landau – Vishkin 1986
Galil – Giancarlo 1986
Abrahamson 1987
Amir - Lewenstein - Porat 2000
Exact String Matching
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A…
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3
Exact String Matching
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7
Exact String Matching
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7 11
Exact String Matching
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A…
Answer: {3,7,11,..}
Exact String Matching
Exact String Matching
Problem: Matching not exact in applications of:
• Computational Biology
• Musicology
• Text Editing
• Meteorology
• etc.
Need other definitions of string matching!
Approximate String Matching
Idea: Find all text locations where distance from pattern is sufficiently small.
distance metric: HAMMING DISTANCE
Let S = s1s2…sm
R = r1r2…rm
Ham(S,R) = The number of locations j where sj rj
Example: S = ABCABC R = ABBAAC
Ham(S,R) = 2
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C…
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2
Ham(P,T1) = 2
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4
Ham(P,T2) = 4
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4, 6
Ham(P,T3) = 6
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2
Ham(P,T4) = 2
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …
Input: T = t1 . . . tn, P = p1 … pm
String Matching with k Mismatches
Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k
Example: k = 2
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …
Input: T = t1 . . . tn, P = p1 … pm
String Matching with k Mismatches
Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k
Example: k = 2
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …
Input: T = t1 . . . tn, P = p1 … pm
String Matching with k Mismatches
Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k
Example: k = 2
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … Y,N,N,Y,…
Naïve Algorithm(for counting mismatches or k-mismatches problem)
Running Time: O(nm) n = |T|, m = |P|
- Goto each location of text and compute hamming distance of P and Ti
The Kangaroo Method(for k-mismatches)
Landau – Vishkin 1986
Galil – Giancarlo 1986
Trie
• A tree representing a set of strings.
ab
c
e
e
f
d b
f
e g
{ aeef ad bbfe bbfg c }
Trie (Cont)
• Assume no string is a prefix of another
ab
c
e
e
f
d b
f
e g
Each string corresponds to a leaf.
Compressed Trie • Compress unary nodes, label edges by strings
ab
c
e
e
f
d b
f
e g
a
bbf
c
eefd
e g
Suffix tree
Suffix tree of string s:a compressed trie of all suffixes of s
Prefix-free: add a special character, say $, at the end of s
Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$
{ $ b$ ab$ bab$ abab$ }
ab
ab$
ab$
b
$
$
$
Suffix Tree properties
- Succint in space - O(n).
- Can be built in O(n) time. McCreight, Weiner,
Ukkonen, Farach-Colton
b
12
ab
a
b$
a
b$
3
$ 4
$
5
$
Exact string matching
12
ab
ab
$
ab$
b
3
$ 4
$
5
$
Given a pattern P = ab we traverse the tree according to the pattern.
s=abab$
Exact string matching
12
ab
ab
$
ab$
b
3
$ 4
$
5
$
Leaves correspond to locations of appearance!
s=abab$ 1 3
Exact string matching
12
ab
ab
$
ab$
b
3
$ 4
$
5
$
Prepare Tree: O(n) time
Find matches: O(m + occ) time occ = # of matches
s=abab$ 1 3
Lowest common ancestors
A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
s = abbaab$
1
3
a
b
aab
ab$
b
5
$
2
b
4
b$a
6
$
7
$
b
$
aaa
b$
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
3
a
b
aab
ab$
b
5
$
2
b
4
b$a
6
$
7
$
b
$
aaa
b$
s = abbaab$ aab$
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
3
a
b
aab
ab$
b
5
$
2
b
4
b$a
6
$
7
$
b
$
aaa
b$
s = abbaab$ aab$ abbaab$
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
3
a
b
aab
ab$
b
5
$
2
b
4
b$a
6
$
7
$
b
$
aaa
b$
s = abbaab$
aab$ abbaab$
LCA/LCP propertiesa
1
3
b
aa
b
ab$
b
5
$
2
b
4
b$
a6
$
7
$
b
$
aa
ab
$
Preprocesssing time : O(n)
Query Time: O(1)
Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
Preprocess:
Build suffix tree of both P and T - O(n+m) timeLCA preprocessing - O(n+m) time
Check P at given text location
Kangroo jump till next mismatch - O(k) time
Overall time: O(nk)
a b a c c a c b a c a b a c c P =
a b b b c c c a a a a b a c b ......T =
Boolean Convolutions (FFT) Method
a b a c c a c b a c a b a c c P =
a b a c c a c b a c a b a c c a-mask
a b b b c c c a a a a b a c b ......T =
Boolean Convolutions (FFT) Method
a b a c c a c b a c a b a c c P =
a b a c c a c b a c a b a c c a-mask
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
Boolean Convolutions (FFT) Method
a b a c c a c b a c a b a c c P =
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
Boolean Convolutions (FFT) Method
a b a c c a c b a c a b a c c P =
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
a b b b c c c a a a a b a c b ......not-amask
Boolean Convolutions (FFT) Method
a b a c c a c b a c a b a c c P =
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
a b b b c c c a a a a b a c b ......not-amask
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a
Boolean Convolutions (FFT) Method
a b a c c a c b a c a b a c c P =
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a
Boolean Convolutions (FFT) Method
a b a c c a c b a c a b a c c P =
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a
Multiply Pa and Tnot a to count mismatches (use FFT)
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...
Boolean Convolutions (FFT) Method
a b a c c a c b a c a b a c c P =
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a
Multiply Pa and Tnot a to count mismatches (use FFT)
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...
Boolean Convolutions (FFT) Method
Boolean Convolutions (FFT) Method
Running Time: One boolean convolution - O(n log m) time
# of matches of all symbols - O(n| | log m) timeΣ
Boolean Convolutions (FFT) Method
Counting Method
Input: Text: T = t1…tn
Pattern: P = p1…pm
Max # of allowed mismatches: k
Assumption: Each pattern element is distinct
a b c d e f g h
b g d e f h d c c a b g h h ...
...
...
...
Count matches (instead of mismatches)
P =
T =
counter
increment
O(n log m) Algorithm
Frequent Symbol: a symbol that appears at least times in P.
Case 1: At least frequent symbols.
- Consider first frequent symbols.
- For each of them construct a mask for first appearances.
k2
k
k
k2
k
We distinguish between two cases:
Case 2: Less than frequent symbols.k
Case 1: At least frequent symbols.k
Example of Masked Countingk = 4, = 4k2
a b a c c a c b a c a b a c c P =
a b a c c a c b a c a b a c c
a b a c c a c b a c a b a c c
a-mask
c-mask
a b b b c c c a a a a b a c b ......
T =
use a-mask
Example of Masked Countingk = 4, = 4k2
a b a c c a c b a c a b a c c P =
a b a c c a c b a c a b a c c
a b a c c a c b a c a b a c c
a-mask
c-mask
a b b b c c c a a a a b a c b ......
T = a b a c c a c b a c a b a c c
d
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter
Example of Masked Countingk = 4, = 4k2
a b a c c a c b a c a b a c c P =
a b a c c a c b a c a b a c c
a b a c c a c b a c a b a c c
a-mask
c-mask
a b b b c c c a a a a b a c b ......
T = a b a c c a c b a c a b a c c
d
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter
Counting Stage:
Run through text and count occurrencesof all marks.
Time: O(n ).k
For location i of T, if counteri < k then no match at location i.
Why? The total # of elements in all masks is 2 = 2k.
Important Observations:
1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors.
k k
k
How many locations remain?
Sum of all counters: 2n
Value of potential matches > k
k
kn
kkn 22
The Kangaroo Method.
How do we check these locations?
Use
Kangaroo Method Time: O(k) per location
Overall Time: O( ) = O( )kkn kn
# of potential matches:
Case 2: X frequent symbols, x < k
a) Count all matches of frequent symbols - one boolean convolution per symbol.
k
b) For non-frequent symbols, build full masks.
Time: O(x n log m) = O( n log m)
Symbol non-frequent appears < 2 in P mask size < 2kk
Count time: O(n )k
So, Case 2 is O(n log m)k
Overall Algo. Time: O(n log m)k
c) Add results of a) & b) and get total number of matches at every text location.
Time:a) O(n log m)b) O(n )c) O(n)
k
k
Additional Points
1. O(n log k)k
mknkn log
3
mk 31
For there is a linear time
algorithm - O( )
2. O( n )kk log
Better tradeoff:
Define frequent symbol > kk log
O( ) time algorithmmknkn log
3
Outline:
1. Find 2k special substrings of pattern.
2. Construct forest data structure combining info of special pattern substrings and text.
3. Use local counting arguments and quick queries to forest data structure to prune candidates.
4. Use kangaroo method to check leftover potential candidates.
k-Mismatches and Matrix Multiplication
“Or-And” matrix multiplication:
AxB = C, cij = aik bkj nk 1
Pattern all-mismatch problem: Find all text locations where the pattern mismatches at every character.
Indyk: If there is an algorithm faster than O(n ) for the Pattern all-mismatch problem then there is a new method for solving “Or-And” matrix multiplication faster than O(n3)
m
OPEN PROBLEMS
Hamming Distance in time:O(n log m)
Edit Distance?
Other metrics?