Upload
zukun
View
643
Download
0
Tags:
Embed Size (px)
Citation preview
String algorithms• Knuth-Morris-Pratt algorithm
- find pattern P in string T- preprocess P (construct Fail array)- search time: O(n+m) n=|T|, m=|P|
• Data structures for strings- trie- suffix tree & generalized suffix tree- suffix array
• Many applications- e.g. computing common longest substring
lasttime:
today:
Trie data structure
• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree• Edges labeled with letters• Leafs correspond to input strings
h
e
l
ia
d
d p
held
hi
had
help
{ had, held, help, hi }
Trie data structure
• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree• Edges labeled with letters• Leafs correspond to input strings
{ had, held, help, hi }
hadheldhelphi
Trie data structure
• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree• Edges labeled with letters• Leafs correspond to input strings
{ had, held, help, hi }
hadheldhelphi
h
Trie data structure
• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree• Edges labeled with letters• Leafs correspond to input strings
{ had, held, help, hi }
heldhelp
h
e ia
hihad
Trie data structure
• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree• Edges labeled with letters• Leafs correspond to input strings
{ had, held, help, hi }
heldhelp
h
e ia
hid
had
Trie data structure
• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree• Edges labeled with letters• Leafs correspond to input strings
{ had, held, help, hi } heldhelp
h
e ia
hid
had
l
Trie data structure
• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree• Edges labeled with letters• Leafs correspond to input strings
h
e
l
ia
d
d p
held
hi
had
help
{ had, held, help, hi }
Trie data structure
• Cannot represent strings T and TT'• Solution: append special symbol $
to the end of each word
{ had, held, help, hi }
h
e
l
ia
d
d p
held
hi
had
help
$
h$
Trie data structure
• Cannot represent strings T and TT'• Solution: append special symbol $
to the end of each word
{ had$, held$, help$, hi$, h$ }
h
e
l
ia
d
d p$
had$
hi$
$
held$
$
help$
$
Trie data structure
• Application: storing dictionaries• Space: O(m1+...+mk), mi=length of string Ti
• Looking up word of length n: O(n) - assuming alphabet has constant size
• Alternative to binary trees- pros and cons (see wikipedia)
h
e
l
ia
d
d p
held$
hi$
$
had$
$
$
$
h$
help$
$
had
h held
help hi
- all nodes store items, not just leaves- unlabeled edges
Tries for storing suffixes
• This lecture: use tries for storing the set of suffixes of string T
a
o
oc
a
c
cacao$
$
$
$
$
cacao$acao$cao$ao$o$$
o
a
o
c
a
o
$
$
$ acao$
cao$
ao$
o$
Reducing space requirements
• Trick 1: Delete non-branching nodes, merge labels (compact trie)
a
o$
o$ca
cacao$
$
$ cacao$acao$cao$ao$o$$
o$
cao$
cao$
acao$
cao$
ao$
o$
Reducing space requirements
• # leafs: n+1• # internal nodes: at most n• # edges: at most 2n edges• Trick 2: Store edge labels via two pointers into the string
(begin & end)
a
o$
o$ca
cacao$
$
$
cacao$acao$cao$ao$o$$
o$cao$ cao$
acao$cao$ ao$
o$
=> O(n) storage
Suffix trees
• Called the suffix tree for string T [Weiner’73]• D. Knuth: “algorithm of the year 1973”• Can be constructed in linear time
- e.g. [McCreight’76], [Ukkonen’95] - algorithms are complicated [will not be covered]
• Used for all sorts of string problems
a
o$
o$ca
cacao$
$
$
cacao$acao$cao$ao$o$$
o$cao$ cao$
acao$cao$ ao$
o$
Suffix trees for string matching
Does text T contain pattern P?• Construct suffix tree for T in O(|T|) time• Query takes O(|P|) time!• same as Knuth-Morris-Pratt, but...
a
o$
o$ca
cacao$
$
$
cacao$acao$cao$ao$o$$
o$cao$ cao$
acao$cao$ ao$
o$
Suffix trees for string matching
a
o$
o$ca
cacao$
$
$
cacao$acao$cao$ao$o$$
o$cao$ cao$
acao$cao$ ao$
o$
• Scenario: text T is fixed, patterns P1,...,Pk come one at a time
- T is very long• Time: O(|T| + |P1| + |P2| + ... + |Pk|)• Knuth-Morris-Pratt can be generalized to multiple patterns
(Aho-Corasick alg.). Same complexity as above, but requires knowledge of P1,...,Pk in advance
Getting extra information
a
o$
o$ca
cacao$
$
$
cacao$acao$cao$ao$o$$
o$cao$ cao$
acao$cao$ ao$
o$
Task: Get all occurrences of P in T (give a list of start positions)
• Can be done in O(|P|+c) time where c is # of occurences- Locate the node corresponding to P- Traverse all leaves of the subtree rooted at this node
Getting extra information
a
o$
o$ca
cacao$
$
$
cacao$acao$cao$ao$o$$
o$cao$ cao$
acao$cao$ ao$
o$
Task: Get # of occurrences of P in T
• Can be done in O(|P|) time- With O(|T|) preprocessing
Generalized suffix tree
• Input: a set of strings {T1,...,Tk}• Generalized suffix tree: tree containing all suffixes of all strings• Tree for { abba , bac } : abba$1
bba$1
ba$1
a$1
$1
bac$2
ac$2
c$2
$2
a
c$2
ba$1
abba$1
c$2
c$2
bba$1
$1
ba$1
ac$2
bac$2
$1
ab $1
$2
a$1 bba$1
c$2 $1 $2
...$1bac$1
abba$1bac$2
...$1bac$2
$1bac$2
ba$1bac$2
$1bac$2
$1bac$2
a$1bac$2 bba$1bac$2
$1bac$2
bac$2
ac$2
c$2
$2
abba$1bac$2
bba$1bac$2
ba$1bac$2
a$1bac$2
$1bac$2
Construction of generalized suffix tree for {X, Y}
• Construct suffix tree for X $1Y $2
• Edges leading to red leaves are labeled as “...$1Y $2”; delete “Y $2”
...$1
abba$1
...$1
$1
ba$1
$1
$1
a$1 bba$1
$1
bac$2
ac$2
c$2
$2
abba$1bac$2
bba$1bac$2
ba$1bac$2
a$1bac$2
$1bac$2
Construction of generalized suffix tree for {X, Y}
• Construct suffix tree for X $1Y $2
• Edges leading to red leaves are labeled as “...$1Y $2”; delete “Y $2”
Construction of generalized suffix tree {T1,...,Tk}
• Can use the same trick
- construct suffix tree for T1 $1 ... Tk $k
• Linear runtime: O(|T1|+...+|Tk|)
• In practice, modify the algorithm (e.g. Ukkonen’s alg.) to get a slightly faster performance
Computing common longest subsequence of {T1,T2}
• Mark each internal node with red (blue) if it has at least one red (blue) child
• Nodes with both colors contain common subsequences
a
c$2
ba$1
abba$1
c$2
abba$1
bba$1
ba$1
a$1
$1
bac$2
ac$2
c$2
$2
c$2
bba$1
$1
ba$1
ac$2
bac$2
$1
ab $1
$2
a$1 bba$1
c$2 $1 $2
Computing common longest subsequence of {T1,...,TK}
1. Construct generalized suffix tree for {T1,...,TK}
......
Task: For each r = 2,...,K compute the length of the longest sequencecontained in at least r strings
Computing common longest subsequence of {T1,...,TK}
1. Construct generalized suffix tree for {T1,...,TK}2. Mark each node v with color k=1,...,K if it has a leaf with that color3. Compute c(v) = # of colors at v
- Naive computation: O(nK), n=|T1|+...+|TK|. Can also be done in O(n)
String corresponding to v is contained in exactly c(v) input strings
......
Task: For each r = 2,...,K compute the length of the longest sequencecontained in at least r strings
Suffix array for text T[1..n]
mississippiississippississippisissippiissippissippisippiippippipii
01234567891011
iippiissippiississippimississippipippisippisissippissippississippi
11107410986352
SA[0..n]
Suffix array: stores suffixes of string T in a lexicographic order
• SA[i] = id of the i’th smallest suffix
• Number k corresponds to suffix T[k+1..n]
• Lexicographic order:- X < X...- X... < X... if <
Pattern search from a suffix array
iippiissippiississippimississippipippisippisissippissippississippi
11107410986352
• All occurrences appear contiguously in the array
• Computing start & end indexes: binary search
• Worst-case: O(m log n)
• In practice, often O(m + log n)
• Can be improved to O(m + log n) worst-case- need to store extra 2n numbers
Task: Find pattern P = iss in text T = mississippi
m n
Construction of suffix arrays
iippiissippiississippimississippipippisippisissippissippississippi
11107410986352
• SA[0..n] can be constructed in linear time
- from a suffix tree- or directly
[Kim,Sim,Park,Park’03][Kärkkäinen&Sanders’03][Ko & Aluru’03]
Summary • Suffix trees & suffix arrays: two powerful data structures for strings
• Space requirements:- suffix trees: linear but with a large constant, e.g. 20 |T|- suffix arrays: more efficient, e.g. 5 |T|
• Suffix arrays are more suited to large alphabets
• Practitioners like suffix arrays (simplicity, space efficiency)• Theoreticians like suffix trees (explicit structure)
iippiissippiississippimississippipippisippisissippissippississippi
11107410986352
Further information on string algorithms
• Dan Gusfield, “Algorithms on Strings, Trees and Sequences:
Computer Science and Computational Biology”, 1997
• Slides mentioning more recent results:http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt