Upload
louise-elliott
View
242
Download
5
Embed Size (px)
Citation preview
CS5263 Bioinformatics
Lecture 17
Exact String Matching Algorithms
Boyer – Moore algorithm
• Three ideas:– Right-to-left comparison– Bad character rule– Good suffix rule
Boyer – Moore algorithm
• Right to left comparison
x
y
y
Skip some chars without missing any occurrence.
Extended bad character rule
char Position in P
a 6, 3
b 7, 4
p 2
t 1
x 5
T: xpbctbxabpqqaabpqz
P: tpabxab *^^
P: tpabxab
Find T(k) in P that is immediately left to i, shift P to align T(k) with that position
k
i = 5 5 – 3 = 2. so shift 2
Preprocessing O(n)
Restart the comparison here.
(Strong) good suffix rule
tx
tyt’
tyt’
In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, such that the char left to t ≠ the char left to t’
T
P
P
z
z
z ≠ y
tyt’P zt’z
tyt’P zt’z
txT
Example preprocessing
qcabdabdab
char Positions in P
a 9, 6, 3
b 10, 7, 4
c 2
d 8, 5
q 1
q c a b d a b d a b1 2 3 4 5 6 7 8 9 10
0 0 0 0 2 0 0 2 0 0dabcab
Bad char rule Good suffix rule
dabdabcabdab
Where to shift depends on T Does not depend on T
Tricky case
Pattern: abcab
a b c a b0 0 0 1 0
* ^ ^
T: x y a a b c a b
shift = 4 – 1 = 3
a b c a bN N 0 N N
c
b
c
b
i-L
Example preprocessing
qcabdabdab
char Positions in P
a 9, 6, 3
b 10, 7, 4
c 2
d 8, 5
q 1
q c a b d a b d a b1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 3 0 0 3 0dabcab
Bad char rule Good suffix rule
Where to shift depends on T Does not depend on T
dabdabcabdab
Example preprocessing
qcabdabdab
char Positions in P
a 9, 6, 3
b 10, 7, 4
c 2
d 8, 5
q 1
q c a b d a b d a b1 2 3 4 5 6 7 8 9 10
N N N N 2 N N 2 N Ndabcab
Bad char rule Good suffix rule
dabdabcabdab
Where to shift depends on T Does not depend on T
Algorithm KMP: Basic idea
tt’P
t xT
y
tt’P y
z
z
In pre-processing: for any position i in P, find the longest suffix t, such that t = t’, and y ≠ z.For each i, let Sp’(i) = length(t)
ij
Failure link
P: aataac
a a t a a c
Sp’(i) 0 1 0 0 2 0
aaat
aataac
If a char in T fails to match at pos 6, re-compare it with the
char at pos 3
FSA
P: aataac
1 2 3 4 50a a t a a c
6
a
t
All other input goes to state 0
Sp’(i) 0 1 0 0 2 0
aaat
aataac
If the next char in T is t, we go to state 3
Tricky case
Pattern: abcab
a b c a b
0 0 0 0 2
a b bc a
c
Failure link
FSA
dummy
How to actually do pre-processing?
• Similar pre-processing for KMP and B-M– Find matches between a suffix and a prefix
– Both can be done in linear time– P is usually short, even a more expensive
pre-processing may result in a gain overall
tt’P yxKMP
tyt’P xB-M
i
ij
j For each i, find a j. similar to DP. Start from i = 2
Fundamental pre-processing
• Zi: length of longest substring starting at i that matches a prefix of P– i.e. t = t’, x ≠ y, Zi = |t|– With the Z-values computed, we can get the
preprocessing for both KMP and B-M in linear time.
aabcaabxaazZ = 01003100210
• How to compute Z-values in linear time?
tt’Pi
x yi+zi-1zi1
Computing Z in Linear time
tt’Pl
x yrk
We already computed all Z-values up to k-1. need to compute Zk. We also know the starting and ending points of the previous match, l and r.
tt’Pl
x yrk
We know that t = t’, therefore the Z-value at k-l+1 may be helpful to us.
1
k-l+1
Computing Z in Linear time
• No char inside the box is compared twice. At most one mismatch per iteration.• Therefore, O(n).
Pk
The previous r is smaller than k. i.e., no previous match extends beyond k. do explicit comparison.
Pl
x yrk
Zk-l+1 <= r-k+1. Zk = Zk-l+1 No comparison is needed.1
k-l+1
Case 1:
Case 2:
Pl rk
Zk-l+1 > r-k+1. Zk = Zk-l+1
Comparison start from r1
k-l+1
Case 3:
Z-preprocessing for B-M and KMP
• Both KMP and B-M preprocessing can be done in O(n)
tt’i
x y
j = i+zi-1zi1
tt’ yxKMP
tyt’xB-Mij
Z j
ijFor each j sp’(j+zj-1) = z(j)
Use Z backwards
Keyword tree for spell checking
• O(n) time to construct. n: total length of patterns.• Search time: O(m). m: length of word• Common prefix only need to be compared once.
p
o
t
a
t
o
e
tr
y
t
er
y
s
c
i
e
n
c
e
h o o l
1
2
3
4
5
Aho-Corasick algorithm
• Generalizing KMP
• Create failure links
• Basis of the fgrep algorithm
• Given the following patterns:– potato– tattoo– theater– other
Failure link
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
potterisapersonwhomakespottery
Failure link
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
O(n) preprocessing, and O(m+k) searching. k is # of occurrence.
Can create a FSA similarly. Requires more space, and preprocessing time depends on alphabet size.
A problem with failure link
• Patterns: {potato, other, pot}
p
o
t
a
t
o
0t
he
r
1
23
A problem with failure link for multiple patterns
• Patterns: {potato, other, pot, the, he, era}
p
o
t
a
t
o
0t
he
r
1
2
t
h
e3
4
potherarac
h e 5er
a
Output link
• Patterns: {potato, other, pot, the}
p
o
t
a
t
o
0t
he
r
1
2
t
h
e3
4
potherarac
h e
Failure link: taken when a mismatch occurs. Output link: always taken. (but will return).
5er
a
Suffix Tree
• All algorithms we talked about so far preprocess pattern(s)– Karp-Rabin: small pattern, small alphabet– Boyer-Moore: fastest in practice. O(m) worst case.– KMP: O(m)– Aho-Corasick: O(m)
• In some cases we may prefer to pre-process T– Fixed T, varying P
• Suffix tree: basically a keyword tree of all suffixes
Suffix tree
• T: xabxac
• Suffixes:1. xabxac
2. abxac
3. bxac
4. xac
5. ac
6. c
a
bx
ac
bxa
c
c
c
x a b x a cc 1
2 3
4
5
6
Naïve construction: O(m2) using Aho-Corasick.
Smarter: O(m). Very technical. big constant factor
Create an internal node only when there is a branch
Suffix tree implementation
• Explicitly labeling seq end
• T: xabxa T: xabxa$
a
bx
a
bxa
x a b x a1
2 3
a
bx
a
bxa
x a b x a1
2 3
$
$$
$
$4
5
Suffix tree implementation
• Implicitly labeling edges
• T: xabxa$
a
bx
a
bxa
x a b x a1
2 3
$
$$
$
$4
5
2:2
3:$ 3:$
1
2 3
$
$4
5
1:23:$
Suffix links
• Similar to failure link in a keyword tree
• Only link internal nodes having branchesx
ab
cd
ef
g
h
ij
ab
c
de
fg
h
i
j
xabcff
Suffix tree construction
1:$
1
1234567890...acatgacatt...
Suffix tree construction
2:$
2
1:$
1
1234567890...acatgacatt...
Suffix tree construction
2:$
a
4:$
2
3
2:$
1
1234567890...acatgacatt...
Suffix tree construction
2:$
2
4:$
4
a
4:$
3
2:$
1
1234567890...acatgacatt...
Suffix tree construction
2:$
2
4:$
4
5:$ 5a
4:$
3
2:$
1
1234567890...acatgacatt...
Suffix tree construction
2:$
2
4:$
4
5:$
ca
tt
5
6
a
4:$
3
5:$
1
$
1234567890...acatgacatt...
Suffix tree construction
• With this suffix link, when we later need to add another suffix, say acaty, we can use the link to avoid going back to the root and re-compare “cat”
5:$
2
4:$
4
5:$ 5cat
t
7
ca
t
t
6
a
4:$
3
5:$
1
$
1234567890...acatgacatt...
Suffix tree construction
5:$
2
4:$
4
5:$ 5cat
t
7
ca
t
t
6
a
5:$
3
5:$
1
t
8
t
$
1234567890...acatgacatt...
Suffix tree construction
5:$
2
5:$
4
5:$ 5cat
t
7
ca
t
t
6
a
5:$
3
5:$
1
t
8
tt
t
9
$
1234567890...acatgacatt...
Suffix tree construction
5:$
2
5:$
4
5:$ 5cat
t
7
ca
t
t
6
a
5:$
3
5:$
1
t
8
tt
t
9
10$
$
1234567890...acatgacatt...
ST Application: pattern matching
• Find all occurrence of P=xa in T– Find node v in the ST
that matches to P– Traverse the subtree
rooted at v to get the locations
a
bx
ac
bxa
c
c
c
x a b x a cc 1
2 3
4
5
6
T: xabxac
O(m) to construct ST (large constant factor)
O(n) to find v – linear to length of P instead of T!
O(k) to get all leaves, k is the number of occurrence.
ST application: repeats finding
• Genome contains many repeated DNA sequences
• Repeat sequence length: Varies from 1 nucleotide to whole gene– Highly repetitive DNA in some non-coding
regions • 6 to 10bp x 100,000 to 1,000,000 times
– Genes may have multiple copies (50 to 10,000)
Find longest repeated substring
• Do a tree traversal, compute the lengths of labels at each node
• O(m)
L = 4
2:5
6:1015:1
8
L = 9
L = 8
Repeats finding
• Find all repeats that are at least k-residue long and appear at least p times in the seq– Phase 1: top-down, count lengths of labels at
each node– Phase 2: bottom-up: count # of leaves
descended from each internal node
(L, N)
For each node with L >= k, and N >= p, print all leaves
O(m) to traverse tree
Repeats finding
• Find repeats with at least 3 bases and 2 occurrence– cat– acat– aca
5:e
2
5:e
4
1234567890acatgacatt
5:e 5ca
t
t
7
ca
t
t
6
a
5:e
3
5:e
1
t
8
tt
t
9
10$
Repeats finding
1. Left-maximal repeat– S[i+1..i+k] = S[j+1..j+k]– S[i] != S[j]
2. Right-maximal repeat– S[i+1..i+k] = S[j+1..j+k], – S[i+k+1] != S[j+k+1]
3. Maximal repeat– S[i+1..i+k] = S[j+1..j+k]– S[i] != S[j], and S[i+k+1] != S[j+k+1]
acatgacatt
1. aca2. cat3. acat
Repeats finding
• How to find maximal repeat?– A right-maximal repeats with different left chars
5:e
2
5:e
4
1234567890acatgacatt
5:e 5cat
t
7
ca
t
t
6
a
5:e
3
5:e
1
t
8
tt
t
9
10$
Left char = [] g c c a a
ST application: word enumeration
• Find all k-mers that occur at least p times– Compute (L, N) for each
node– Find nodes v with L>=k,
and L(parent)<k, and N>=y
– Traverse sub-tree rooted at v to get the locations
L<k
L>=k, N>=p
L = KL=k
This can be used in many applications. For example, to find words that appeared frequently in a genome or a document
Joint Suffix Tree
• Build a ST for many than two strings
• Two strings S1 and S2
• S* = S1 & S2
• Build a suffix tree for S* in time O(|S1| + |S2|)
• The separator will only appear in the edge ending in a leaf
• S1 = abcd
• S2 = abca
• S* = abcd&abca$a
bcd
&ab
ca
bc
d&abca
c
d&
abc
d
d & ab c
d
& a b c d
a aa
$
1,1
2,1
1,2
1,3
1,4
2,2
2,32,4
To Simplify
• We don’t really need to do anything, since all edge labels were implicit.
• The right hand side is more convenient to look at
abc
d&
abc
a
bc
d&abca
c
d&
abc
d
d & ab c
d
& a b c d
a aa
$
1,1
2,1
1,2
1,3
1,4
2,2
2,32,4
uselessa
bcd
bc
d
c
d
d
a aa
$
1,12,1
1,21,3
1,4
2,2
2,32,4
Application of JST
• Longest common substring– For each internal node v,
keep a bit vector B[2]– B[1] = 1 if a child of v is a
suffix of S1– Find all internal nodes with
B[1] = B[2] = 1– Report one with the longest
label– Can be extended to k
sequences. Just use a longer bit vector.
abc
d
bc
d
c
d
d
a aa
$
1,12,1
1,21,3
1,4
2,2
2,32,4
O(m), m the total seq length
Application of JST
• Given K strings, find all substrings with L>=l, that appear in at least d strings
• Exact motif finding problem
• Build a joint suffix tree with all strings
S* = S1 & S2 % S3 * S4 @ S5 ! S6 + S7
– Use a unique end char for each string– Not really necessary if caution is taken in
construction
L< k
L >= k B = 1010 | 0011 = 1011
|B| = 3
1,x3,x 3,x
4,x
B = 0011
O(mK), m the total seq length. K is for “bitwise or” two bit vectors
3,x
B = 1010
Many other applications
• Reproduce the behavior of Aho-Corasick• DNA finger printing
– A database of people’s DNA sequence– Given a short DNA, which person is it from?
• Recognizing DNA contamination• Indexing sequence databases• …• Catch
– Large constant factor for space requirement (15-40 bytes per base for DNA)
– Large constant factor for construction– Suffix array: trade off time for space
Summary
• One T, one P– Boyer-Moore is the choice– KMP works but not the best
• One T, many P– Aho-Corasick– Suffix Tree
• One fixed T, many varying P– Suffix tree
• Two or more T’s– Suffix tree, joint suffix tree, suffix array
Alphabet independent
Alphabet dependent