Upload
ravindra-singh
View
221
Download
0
Embed Size (px)
Citation preview
8/2/2019 Erice2005
1/78
1
Suffix tree and suffix arraytechniques for pattern analysisin strings
Esko Ukkonen
Univ HelsinkiErice School 30 Oct 2005
8/2/2019 Erice2005
2/78
8/2/2019 Erice2005
3/78
3
Algorithms for combinatorial stringmatching?
deep beauty? +-
shallow beauty? +
applications? ++ intensive algorithmic miniatures
sources of new problems:
text processing, DNA, music,
8/2/2019 Erice2005
4/78
4
Analysis of a string of symbols
T = h a t t i v a t t i text
P = a t t pattern
Find the occurrences of P in T:h a t t i v a t t i
Pattern synthesis: #(t) = 4 #(atti) = 2#(t****t) = 2
8/2/2019 Erice2005
5/78
5
Pattern finding & synthesis problems T = t1t2 tn, P = p1 p 2 pn, strings of symbols in
finite alphabet
Indexing problem: Preprocess T (build an indexstructure) such that the occurrences of differentpatterns P can be found fast static text, any given pattern P
Pattern synthesis problem: Learn from T newpatterns that occur surprisingly often
What is a pattern? Exact substring, approximate
substring, with generalized symbols, with gaps,
8/2/2019 Erice2005
6/78
6
1. Suffix tree
2. Suffix array3. Some applications
4. Finding motifs
8/2/2019 Erice2005
7/78
7
The suffix tree Tree(T) of T
data structure suffix tree, Tree(T), iscompacted trie that represents all thesuffixes of string T
linear size: |Tree(T)| = O(|T|)
can be constructed in linear time O(|T|)
has myriad virtues(A. Apostolico) is well-known: 366 000 Google hits
8/2/2019 Erice2005
8/78
8
Suffix trie and suffix tree
a
b
b
aaa
aa
b
b
b
a
baab
baab
ab
abaab
baab
aab
ab
b
Trie(abaab) Tree(abaab)
8/2/2019 Erice2005
9/78
9
Trie(T) can be large
|Trie(T)| = O(|T|2)
bad example: T = anbn
Trie(T) can be seen as a DFA: languageaccepted = the suffixes of T
minimize the DFA => directed cyclic word
graph (DAWG)
8/2/2019 Erice2005
10/78
10
Tree(T) is of linear size
only the internal branching nodes and theleaves represented explicitly
edges labeled by substrings of T
v = node() if the path from root to v spells
one-to-one correspondence of leaves and
suffixes |T| leaves, hence < |T| internal nodes
|Tree(T)| = O(|T| + size(edge labels))
8/2/2019 Erice2005
11/78
11
Tree(hattivatti)hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
hattivattiattivatti
ttivatti
tivatti
ivatti
vatti
vattivatti
attiti
i
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
atti
8/2/2019 Erice2005
12/78
12
Tree(hattivatti)hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
hattivattiattivatti
ttivattitivatti
ivatti
vatti
vattivatti
attiti
i
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
hattivatti
atti
substring labels ofedges represented as
pairs of pointers
8/2/2019 Erice2005
13/78
13
Tree(hattivatti)hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
12
34
5
6
6,106,10
2,54,5
i
10
8
9
3,3
i
vatti
vatti
vatti
hattivatti
hattivatti
7
8/2/2019 Erice2005
14/78
14
Tree(T) is fulltext index
Tree(T)
P
31 8
P occurs in T atlocations 8, 31,
P occurs in T P is a prefix of some suffix of TPath for P exists in Tree(T)
All occurrences of P in time O(|P| + #occ)
8/2/2019 Erice2005
15/78
15
Find att from Tree(hattivatti)hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
hattivattiattivatti
ttivattitivatti
ivatti
vatti
vattivatti
attiti
2
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
atti7
8/2/2019 Erice2005
16/78
16
Linear time construction of Tree(T)
hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
Weiner(1973),
algorithmof theyear
McCreight(1976)
on-line algorithm
(Ukkonen 1992)
8/2/2019 Erice2005
17/78
17
On-line construction of Trie(T)
T = t1t2 tn$
Pi = t1t2 ti i:thprefix of T
on-line idea: update Trie(Pi) to Trie(Pi+1) => very simple construction
8/2/2019 Erice2005
18/78
18
Trie(abaab)
a a
b
b a
b
b
a
a
Trie(a) Trie(ab) Trie(aba)
chain of links connects the end points of currentsuffixes
abaabaaaaa
8/2/2019 Erice2005
19/78
19
Trie(abaab)
a a
b
b a
b
b
a
a
a
b
b
a
aa
aa
Trie(abaa)
8/2/2019 Erice2005
20/78
20
Trie(abaab)
a a
b
b a
b
b
a
a
a
b
b
a
aa
aa
Trie(abaa)
Add next symbol = b
8/2/2019 Erice2005
21/78
21
Trie(abaab)
a a
b
b a
b
b
a
a
a
b
b
a
aa
aa
Trie(abaa)
Add next symbol = b
From here on b-arc already exists
8/2/2019 Erice2005
22/78
8/2/2019 Erice2005
23/78
23
What happens in Trie(Pi) => Trie(Pi+1) ?
ai
ai
ai
aiai
ai
Before
After
New nodes
New suffix links
From here on the ai-arcexists already => stopupdating here
8/2/2019 Erice2005
24/78
24
What happens in Trie(Pi) => Trie(Pi+1) ?
time: O(size of Trie(T))
suffix links:slink(node(a)) = node()
8/2/2019 Erice2005
25/78
25
On-line procedure for suffix trie
1. Create Trie(t1):nodes root and v, an arc son(root, t1) = v, and suffixlinks slink(v) := root and slink(root) := root
2. for i := 2 to n do begin
3. vi-1 :=leaf of Trie(t1ti-1) for string t1ti-1 (i.e., the deepest leaf)4. v := vi-1; v:= 0
5. while node vhas no outgoing arc for ti do begin
6. Create a new node vand an arc son(v,ti) = v
7. ifv 0 then slink(v) := v
8. v := slink(v); v:= v end
9. for the node vsuch that v= son(v,ti) doif v= v then slink(v) := root else slink(v) := v
8/2/2019 Erice2005
26/78
26
Suffix trees on-line
compacted version of the on-line trieconstruction: simulate the construction onthe linear size tree instead of the trie =>
time O(|T|)
all trie nodes are conceptually still needed=> implicitand realnodes
8/2/2019 Erice2005
27/78
27
Implicit and real nodes
Pair (v,) is an implicit nodein Tree(T) if vis a node of Tree and is a (proper) prefixof the label of some arc from v. If is the
empty string then (v, ) is arealnode (=v).
Let v = node() inTree(T). Thenimplicit
node (v, ) represents node() of Trie(T)
8/2/2019 Erice2005
28/78
28
Implicit node
v
(v, )
8/2/2019 Erice2005
29/78
29
Suffix links and open arcs
v
a
root
slink(v)
label [i,*] instead of [i,j] if
w is a leaf and j is thescanned position of T
w
8/2/2019 Erice2005
30/78
30
Big picture
suffix link path traversed: total work O(n)
new arcs and nodes created:
total work O(size(Tree(T))
8/2/2019 Erice2005
31/78
31
On-line procedure for suffix tree
Input: string T = t1t2 tn$
Output: Tree(T)
Notation: son(v,) = w iff there is an arc from v to w with label
son(v,) = v
Function Canonize(v, ):
while son(v, ) 0 where =, || > 0 dov := son(v, ); :=
return (v, )
8/2/2019 Erice2005
32/78
32
Suffix-tree on-line: main procedure
Create Tree(t1 ); slink(root) := root
(v, ) := (root, ) /* (v,) is the start node */
for i := 2 to n+1 do
v:= 0
while there is no arc from v with label prefix ti do
if then /*divide the arc w = son(v, ) into two */
son(v, ) := v; son(v,ti) := v; son(v,) := w
else
son(v,ti) := v; v:= v
if v 0 then slink(v ) := v
v:= v; v := slink(v); (v, ) := Canonize(v, )
if v 0 then slink(v) := v
(v, ) := Canonize(v,ti ) /* (v,) = start node of the next round */
8/2/2019 Erice2005
33/78
33
The actual time and space
|Tree(T)| is about 20|T| in practice
brute-force construction is O(|T|log|T|) forrandom strings as the average depth of internalnodes is O(log|T|)
difference between linear and brute-forceconstructions not necessarily large (Giegerich &Kurtz)
truncated suffix trees: k symbols long prefix ofeach suffix represented (Na et al. 2003)
alphabet independent linear time (Farach 1997)
8/2/2019 Erice2005
34/78
34
1. Suffix tree
2. Suffix array3. Some applications
4. Finding motifs
8/2/2019 Erice2005
35/78
35
Suffix array: example
suffix array = lexicographic order of the suffixes
hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
atti
attivatti
hattivatti
i
ivatti
ti
tivatti
tti
ttivatti
vatti
11
7
2
1
10
5
9
4
8
3
6
8/2/2019 Erice2005
36/78
36
Suffix array
suffix array SA(T) = an array giving thelexicographic order of the suffixes of T
space requirement: 5|T|
practitioners like suffix arrays (simplicity,space efficiency)
theoreticians like suffix trees (explicitstructure)
8/2/2019 Erice2005
37/78
37
Pattern search from suffix arrayhattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
atti
attivatti
hattivatti
i
ivatti
ti
tivatti
tti
ttivatti
vatti
11
7
2
1
10
5
9
4
8
3
6
att binary search
8/2/2019 Erice2005
38/78
38
Recent suffix array constructions
Manber&Myers (1990): O(|T|log|T|)
linear time via suffix tree
January / June 2003: direct linear timeconstruction of suffix array- Kim, Sim, Park, Park (CPM03)- Krkkinen & Sanders (ICALP03)
- Ko & Aluru (CPM03)
8/2/2019 Erice2005
39/78
39
Krkkinen-Sanders algorithm
1.Construct the suffix array of the suffixesstarting at positions i mod 3 0. This is
done by reduction to the suffix array
construction of a string of two thirds thelength, which is solved recursively.
2.Construct the suffix array of the remaining
suffixes using the result of the first step.3.Merge the two suffix arrays into one.
8/2/2019 Erice2005
40/78
8/2/2019 Erice2005
41/78
41
Running example
T[0,n) = y a b b a d a b b a d o 0 0
SA = (12,1,6,4,9,3,8,2,7,5,10,11,0)
0 1 2 3 4 5 6 7 8 9 10 11
8/2/2019 Erice2005
42/78
42
Step 0: Construct a sample
for k = 0,1,2Bk = {i [0,n] | i mod 3 = k}
C = B1 U B2 sample positions
SC sample suffixes
Example: B1 = {1,4,7,10}, B2 = {2,5,8,11},C = {1,4,7,10,2,5,8,11}
8/2/2019 Erice2005
43/78
8/2/2019 Erice2005
44/78
44
Step 1 (cont.)
once the sample suffixes are sorted, assign arank to each: rank(Si) = the rank of Si in SC;rank(Sn+1) = rank(Sn+2) = 0
Example:R = [abb][ada][bba][do0][bba][dab][bad][o00]
R = (1,2,4,6,4,5,3,7)
SAR = (8,0,1,6,4,2,5,3,7)rank(Si) - 1 4 - 2 6 - 5 3 7 8 0 0
8/2/2019 Erice2005
45/78
45
Step 2: Sort nonsample suffixes
for each non-sample Si SB0 (note thatrank(Si+1) is always defined for i B0):
Si Sj (ti,rank(Si+1)) (tj,rank(Sj+1))
radix sort the pairs (ti,rank(Si+1)).
Example: S12
< S6
< S9
< S3
< S0
because(0,0) < (a,5) < (a,7) < (b,2) < (y,1)
8/2/2019 Erice2005
46/78
46
Step 3: Merge
merge the two sorted sets of suffixes using astandard comparison-based merging:
to compare Si SC with Sj SB0, distinguish twocases:
i B1: Si Sj (ti,rank(Si+1)) (tj,rank(Sj+1))
i B2: Si Sj (ti,ti+1,rank(Si+2)) (tj,tj+1,rank(Sj+2))
note that the ranks are defined in all cases!
S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) b a -> -> b
dID(A,B) = 5 dLevenshtein(A,B)= 4
mutation costs => probabilistic modeling
evaluation by dynamic programming =>alignment
8/2/2019 Erice2005
62/78
8/2/2019 Erice2005
63/78
63
A\B s t o c k h o l m
0 1 2 3 4 5 6 7 8 9
t 1 2 1 2 3 4 5 6 7 8
u 2 3 2 3 4 5 6 7 8 9
k 3 4 3 4 5 4 5 6 7 8
h 4 5 4 5 6 5 4 5 6 7
o 5 6 5 4 5 6 5 4 5 6
l 6 7 6 5 6 7 6 5 4 5
m 7 8 7 6 7 8 7 6 5 4
a 8 9 8 7 8 9 8 7 6 5
di,j = min(if ai=bj then di-1,j-1 else ,di-1,j + 1,di,j-1 + 1)
dID(A,B)optimal alignment by trace-back
8/2/2019 Erice2005
64/78
64
Search problem
find approximate occurrences of pattern Pin text T: substrings P of T such that
d(P,P) small
dyn progr with small modification: O(mn)
lots of (practical) improvement tricks
P
T P
8/2/2019 Erice2005
65/78
65
Index for approximate searching?
dynamic programming: P x Tree(T) withbacktracking
P
Tree(T)
8/2/2019 Erice2005
66/78
66
1. Suffix tree
2. Suffix array
3. Some applications
4. Finding motifs
8/2/2019 Erice2005
67/78
67
Gapped motifs
a##bc
8/2/2019 Erice2005
68/78
68
Gapped motifs
a##bc
ababbbcccbcaaabca
a##bc
8/2/2019 Erice2005
69/78
69
Gapped motifs of T
gapped pattern: P in (A U {#})*
gap symbol # matches any symbol in A aa##bb#b
L(P) = occurrence locations of P in T
P is called a motif of T if |L(P)| > 1 and a motifwith quorumq if |L(P)| q.
Problem: find occurrence count |L(P)| for allgapped motifs P of T
anban has exponentially many motifs!
8/2/2019 Erice2005
70/78
70
Maximal motifs
motif P is the maximal version of motif Pif P has the largest possible number ofnon-# symbols among motifs that have
the same set of occurrence locations as P
every motif has unique maximal version
unfortunately still exponential number ofdifferent maximal motifs
Bl k f i l tif
8/2/2019 Erice2005
71/78
71
Blocks of maximal motifs
aaa##b##ba has blocks aaa, b, ba
Lemma: Maximal 1-block motifs (branching)nodes of Tree(S)
Thm: Each block of a maximal motif of T is amaximal substring motif of T. Hence there areO(n) different strings that can be used as a block
of a maximal motif.
=> There are O(n2k-1) different maximal motifswith k blocks [O(n2k) unrestricted motifs].
8/2/2019 Erice2005
72/78
8/2/2019 Erice2005
73/78
73
Counting 2-block maximal motifs
Thm: The occurrence counts for allmaximal motifs with two blocks can befound in (optimal) time O(n3).
c.f., Arimura et al (2000), Apostolico et al(2004),
Algorithm (very simple)
8/2/2019 Erice2005
74/78
74
Algorithm (very simple)
XYd
for each maximal substring motif X
for each distance d = 1,2, mark the leaves of Tree(T) that correspond to
locations L(X) + d
for each maximal substring motif Y,find the number h(Y) of marked leaves inits subtree in Tree(T)
the occurrence count of motif (X,d,Y) is h(Y)
2-block motif (X,d,Y)
Algorithm (very simple)
8/2/2019 Erice2005
75/78
75
Algorithm (very simple)
XYd
for each maximal substring motif X
for each distance d = 1,2,
mark the leaves of Tree(T) that correspond tolocations L(X) + d
for each maximal substring motif Y,find the number h(Y) of marked leaves inits subtree in Tree(T)
the occurrence count of motif (X,d,Y) is h(Y)
2-block motif (X,d,Y)
O(n)
O(n)
O(n)
Counting 2-block maximal motifs
8/2/2019 Erice2005
76/78
76
Counting 2 block maximal motifs(cont)
Thm: The occurrence counts for all maximalmotifs with two blocks can be found in (optimal)time O(n3).
flexible gaps:x*y * = gap of any length
Thm: The occurrence counts for all maximal
motifs with two blocks and one flexible gap canbe found in (optimal) time O(n2). k-block case?
C
8/2/2019 Erice2005
77/78
77
Conclusion
suffix structures provide very efficientalgorithmic tools for finding and learningpotentially interesting patterns in strings
G l
8/2/2019 Erice2005
78/78
General case
Q1: Given q and W, has T a motif with atleast W non-gap symbols and at least qoccurrences?
In k-block case, is O(n2k-1) (or even better)time possible?
related work: H.Arimura&al, A. Apostolico
&al, M-F. Sagot, L. Parida, N. Pisanti,