Erice2005

8/2/2019 Erice2005

1/78

1

Suffix tree and suffix arraytechniques for pattern analysisin strings

Esko Ukkonen

Univ HelsinkiErice School 30 Oct 2005

8/2/2019 Erice2005

2/78

8/2/2019 Erice2005

3/78

3

Algorithms for combinatorial stringmatching?

deep beauty? +-

shallow beauty? +

applications? ++ intensive algorithmic miniatures

sources of new problems:

text processing, DNA, music,

8/2/2019 Erice2005

4/78

4

Analysis of a string of symbols

T = h a t t i v a t t i text

P = a t t pattern

Find the occurrences of P in T:h a t t i v a t t i

Pattern synthesis: #(t) = 4 #(atti) = 2#(t****t) = 2

8/2/2019 Erice2005

5/78

5

Pattern finding & synthesis problems T = t1t2 tn, P = p1 p 2 pn, strings of symbols in

finite alphabet

Indexing problem: Preprocess T (build an indexstructure) such that the occurrences of differentpatterns P can be found fast static text, any given pattern P

Pattern synthesis problem: Learn from T newpatterns that occur surprisingly often

What is a pattern? Exact substring, approximate

substring, with generalized symbols, with gaps,

8/2/2019 Erice2005

6/78

6

1. Suffix tree

2. Suffix array3. Some applications

4. Finding motifs

8/2/2019 Erice2005

7/78

7

The suffix tree Tree(T) of T

data structure suffix tree, Tree(T), iscompacted trie that represents all thesuffixes of string T

linear size: |Tree(T)| = O(|T|)

can be constructed in linear time O(|T|)

has myriad virtues(A. Apostolico) is well-known: 366 000 Google hits

8/2/2019 Erice2005

8/78

8

Suffix trie and suffix tree

a

b

b

aaa

aa

b

b

b

a

baab

baab

ab

abaab

baab

aab

ab

b

Trie(abaab) Tree(abaab)

8/2/2019 Erice2005

9/78

9

Trie(T) can be large

|Trie(T)| = O(|T|2)

bad example: T = anbn

Trie(T) can be seen as a DFA: languageaccepted = the suffixes of T

minimize the DFA => directed cyclic word

graph (DAWG)

8/2/2019 Erice2005

10/78

10

Tree(T) is of linear size

only the internal branching nodes and theleaves represented explicitly

edges labeled by substrings of T

v = node() if the path from root to v spells

one-to-one correspondence of leaves and

suffixes |T| leaves, hence < |T| internal nodes

|Tree(T)| = O(|T| + size(edge labels))

8/2/2019 Erice2005

11/78

11

Tree(hattivatti)hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

hattivattiattivatti

ttivatti

tivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti

8/2/2019 Erice2005

12/78

12


attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

hattivattiattivatti

ttivattitivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

hattivatti

atti

substring labels ofedges represented as

pairs of pointers

8/2/2019 Erice2005

13/78

13


attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

12

34

5

6

6,106,10

2,54,5

i

10

8

9

3,3

i

vatti

vatti

vatti

hattivatti

hattivatti

7

8/2/2019 Erice2005

14/78

14

Tree(T) is fulltext index

Tree(T)

P

31 8

P occurs in T atlocations 8, 31,

P occurs in T P is a prefix of some suffix of TPath for P exists in Tree(T)

All occurrences of P in time O(|P| + #occ)

8/2/2019 Erice2005

15/78

15

Find att from Tree(hattivatti)hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

hattivattiattivatti

ttivattitivatti

ivatti

vatti

vattivatti

attiti

2

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti7

8/2/2019 Erice2005

16/78

16

Linear time construction of Tree(T)

hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

Weiner(1973),

algorithmof theyear

McCreight(1976)

on-line algorithm

(Ukkonen 1992)

8/2/2019 Erice2005

17/78

17

On-line construction of Trie(T)

T = t1t2 tn$

Pi = t1t2 ti i:thprefix of T

on-line idea: update Trie(Pi) to Trie(Pi+1) => very simple construction

8/2/2019 Erice2005

18/78

18

Trie(abaab)

a a

b

b a

b

b

a

a

Trie(a) Trie(ab) Trie(aba)

chain of links connects the end points of currentsuffixes

abaabaaaaa

8/2/2019 Erice2005

19/78

19

Trie(abaab)

a a

b

b a

b

b

a

a

a

b

b

a

aa

aa

Trie(abaa)

8/2/2019 Erice2005

20/78

20

Trie(abaab)

a a

b

b a

b

b

a

a

a

b

b

a

aa

aa

Trie(abaa)

Add next symbol = b

8/2/2019 Erice2005

21/78

21

Trie(abaab)

a a

b

b a

b

b

a

a

a

b

b

a

aa

aa

Trie(abaa)

Add next symbol = b

From here on b-arc already exists

8/2/2019 Erice2005

22/78

8/2/2019 Erice2005

23/78

23

What happens in Trie(Pi) => Trie(Pi+1) ?

ai

ai

ai

aiai

ai

Before

After

New nodes

New suffix links

From here on the ai-arcexists already => stopupdating here

8/2/2019 Erice2005

24/78

24

What happens in Trie(Pi) => Trie(Pi+1) ?

time: O(size of Trie(T))

suffix links:slink(node(a)) = node()

8/2/2019 Erice2005

25/78

25

On-line procedure for suffix trie

1. Create Trie(t1):nodes root and v, an arc son(root, t1) = v, and suffixlinks slink(v) := root and slink(root) := root

2. for i := 2 to n do begin

3. vi-1 :=leaf of Trie(t1ti-1) for string t1ti-1 (i.e., the deepest leaf)4. v := vi-1; v:= 0

5. while node vhas no outgoing arc for ti do begin

6. Create a new node vand an arc son(v,ti) = v

7. ifv 0 then slink(v) := v

8. v := slink(v); v:= v end

9. for the node vsuch that v= son(v,ti) doif v= v then slink(v) := root else slink(v) := v

8/2/2019 Erice2005

26/78

26

Suffix trees on-line

compacted version of the on-line trieconstruction: simulate the construction onthe linear size tree instead of the trie =>

time O(|T|)

all trie nodes are conceptually still needed=> implicitand realnodes

8/2/2019 Erice2005

27/78

27

Implicit and real nodes

Pair (v,) is an implicit nodein Tree(T) if vis a node of Tree and is a (proper) prefixof the label of some arc from v. If is the

empty string then (v, ) is arealnode (=v).

Let v = node() inTree(T). Thenimplicit

node (v, ) represents node() of Trie(T)

8/2/2019 Erice2005

28/78

28

Implicit node

v

(v, )

8/2/2019 Erice2005

29/78

29

Suffix links and open arcs

v

a

root

slink(v)

label [i,*] instead of [i,j] if

w is a leaf and j is thescanned position of T

w

8/2/2019 Erice2005

30/78

30

Big picture

suffix link path traversed: total work O(n)

new arcs and nodes created:

total work O(size(Tree(T))

8/2/2019 Erice2005

31/78

31

On-line procedure for suffix tree

Input: string T = t1t2 tn$

Output: Tree(T)

Notation: son(v,) = w iff there is an arc from v to w with label

son(v,) = v

Function Canonize(v, ):

while son(v, ) 0 where =, || > 0 dov := son(v, ); :=

return (v, )

8/2/2019 Erice2005

32/78

32

Suffix-tree on-line: main procedure

Create Tree(t1 ); slink(root) := root

(v, ) := (root, ) /* (v,) is the start node */

for i := 2 to n+1 do

v:= 0

while there is no arc from v with label prefix ti do

if then /*divide the arc w = son(v, ) into two */

son(v, ) := v; son(v,ti) := v; son(v,) := w

else

son(v,ti) := v; v:= v

if v 0 then slink(v ) := v

v:= v; v := slink(v); (v, ) := Canonize(v, )

if v 0 then slink(v) := v

(v, ) := Canonize(v,ti ) /* (v,) = start node of the next round */

8/2/2019 Erice2005

33/78

33

The actual time and space

|Tree(T)| is about 20|T| in practice

brute-force construction is O(|T|log|T|) forrandom strings as the average depth of internalnodes is O(log|T|)

difference between linear and brute-forceconstructions not necessarily large (Giegerich &Kurtz)

truncated suffix trees: k symbols long prefix ofeach suffix represented (Na et al. 2003)

alphabet independent linear time (Farach 1997)

8/2/2019 Erice2005

34/78

34

1. Suffix tree

2. Suffix array3. Some applications

4. Finding motifs

8/2/2019 Erice2005

35/78

35

Suffix array: example

suffix array = lexicographic order of the suffixes

hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

atti

attivatti

hattivatti

i

ivatti

ti

tivatti

tti

ttivatti

vatti

11

7

2

1

10

5

9

4

8

3

6

8/2/2019 Erice2005

36/78

36

Suffix array

suffix array SA(T) = an array giving thelexicographic order of the suffixes of T

space requirement: 5|T|

practitioners like suffix arrays (simplicity,space efficiency)

theoreticians like suffix trees (explicitstructure)

8/2/2019 Erice2005

37/78

37

Pattern search from suffix arrayhattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

atti

attivatti

hattivatti

i

ivatti

ti

tivatti

tti

ttivatti

vatti

11

7

2

1

10

5

9

4

8

3

6

att binary search

8/2/2019 Erice2005

38/78

38

Recent suffix array constructions

Manber&Myers (1990): O(|T|log|T|)

linear time via suffix tree

January / June 2003: direct linear timeconstruction of suffix array- Kim, Sim, Park, Park (CPM03)- Krkkinen & Sanders (ICALP03)

- Ko & Aluru (CPM03)

8/2/2019 Erice2005

39/78

39

Krkkinen-Sanders algorithm

1.Construct the suffix array of the suffixesstarting at positions i mod 3 0. This is

done by reduction to the suffix array

construction of a string of two thirds thelength, which is solved recursively.

2.Construct the suffix array of the remaining

suffixes using the result of the first step.3.Merge the two suffix arrays into one.

8/2/2019 Erice2005

40/78

8/2/2019 Erice2005

41/78

41

Running example

T[0,n) = y a b b a d a b b a d o 0 0

SA = (12,1,6,4,9,3,8,2,7,5,10,11,0)

0 1 2 3 4 5 6 7 8 9 10 11

8/2/2019 Erice2005

42/78

42

Step 0: Construct a sample

for k = 0,1,2Bk = {i [0,n] | i mod 3 = k}

C = B1 U B2 sample positions

SC sample suffixes

Example: B1 = {1,4,7,10}, B2 = {2,5,8,11},C = {1,4,7,10,2,5,8,11}

8/2/2019 Erice2005

43/78

8/2/2019 Erice2005

44/78

44

Step 1 (cont.)

once the sample suffixes are sorted, assign arank to each: rank(Si) = the rank of Si in SC;rank(Sn+1) = rank(Sn+2) = 0

Example:R = [abb][ada][bba][do0][bba][dab][bad][o00]

R = (1,2,4,6,4,5,3,7)

SAR = (8,0,1,6,4,2,5,3,7)rank(Si) - 1 4 - 2 6 - 5 3 7 8 0 0

8/2/2019 Erice2005

45/78

45

Step 2: Sort nonsample suffixes

for each non-sample Si SB0 (note thatrank(Si+1) is always defined for i B0):

Si Sj (ti,rank(Si+1)) (tj,rank(Sj+1))

radix sort the pairs (ti,rank(Si+1)).

Example: S12

< S6

< S9

< S3

< S0

because(0,0) < (a,5) < (a,7) < (b,2) < (y,1)

8/2/2019 Erice2005

46/78

46

Step 3: Merge

merge the two sorted sets of suffixes using astandard comparison-based merging:

to compare Si SC with Sj SB0, distinguish twocases:

i B1: Si Sj (ti,rank(Si+1)) (tj,rank(Sj+1))

i B2: Si Sj (ti,ti+1,rank(Si+2)) (tj,tj+1,rank(Sj+2))

note that the ranks are defined in all cases!

S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) b a -> -> b

dID(A,B) = 5 dLevenshtein(A,B)= 4

mutation costs => probabilistic modeling

evaluation by dynamic programming =>alignment

8/2/2019 Erice2005

62/78

8/2/2019 Erice2005

63/78

63

A\B s t o c k h o l m

0 1 2 3 4 5 6 7 8 9

t 1 2 1 2 3 4 5 6 7 8

u 2 3 2 3 4 5 6 7 8 9

k 3 4 3 4 5 4 5 6 7 8

h 4 5 4 5 6 5 4 5 6 7

o 5 6 5 4 5 6 5 4 5 6

l 6 7 6 5 6 7 6 5 4 5

m 7 8 7 6 7 8 7 6 5 4

a 8 9 8 7 8 9 8 7 6 5

di,j = min(if ai=bj then di-1,j-1 else ,di-1,j + 1,di,j-1 + 1)

dID(A,B)optimal alignment by trace-back

8/2/2019 Erice2005

64/78

64

Search problem

find approximate occurrences of pattern Pin text T: substrings P of T such that

d(P,P) small

dyn progr with small modification: O(mn)

lots of (practical) improvement tricks

P

T P

8/2/2019 Erice2005

65/78

65

Index for approximate searching?

dynamic programming: P x Tree(T) withbacktracking

P

Tree(T)

8/2/2019 Erice2005

66/78

66

1. Suffix tree

2. Suffix array

3. Some applications

4. Finding motifs

8/2/2019 Erice2005

67/78

67

Gapped motifs

a##bc

8/2/2019 Erice2005

68/78

68

Gapped motifs

a##bc

ababbbcccbcaaabca

a##bc

8/2/2019 Erice2005

69/78

69

Gapped motifs of T

gapped pattern: P in (A U {#})*

gap symbol # matches any symbol in A aa##bb#b

L(P) = occurrence locations of P in T

P is called a motif of T if |L(P)| > 1 and a motifwith quorumq if |L(P)| q.

Problem: find occurrence count |L(P)| for allgapped motifs P of T

anban has exponentially many motifs!

8/2/2019 Erice2005

70/78

70

Maximal motifs

motif P is the maximal version of motif Pif P has the largest possible number ofnon-# symbols among motifs that have

the same set of occurrence locations as P

every motif has unique maximal version

unfortunately still exponential number ofdifferent maximal motifs

Bl k f i l tif

8/2/2019 Erice2005

71/78

71

Blocks of maximal motifs

aaa##b##ba has blocks aaa, b, ba

Lemma: Maximal 1-block motifs (branching)nodes of Tree(S)

Thm: Each block of a maximal motif of T is amaximal substring motif of T. Hence there areO(n) different strings that can be used as a block

of a maximal motif.

=> There are O(n2k-1) different maximal motifswith k blocks [O(n2k) unrestricted motifs].

8/2/2019 Erice2005

72/78

8/2/2019 Erice2005

73/78

73

Counting 2-block maximal motifs

Thm: The occurrence counts for allmaximal motifs with two blocks can befound in (optimal) time O(n3).

c.f., Arimura et al (2000), Apostolico et al(2004),

Algorithm (very simple)

8/2/2019 Erice2005

74/78

74


XYd

for each maximal substring motif X

for each distance d = 1,2, mark the leaves of Tree(T) that correspond to

locations L(X) + d

for each maximal substring motif Y,find the number h(Y) of marked leaves inits subtree in Tree(T)

the occurrence count of motif (X,d,Y) is h(Y)

2-block motif (X,d,Y)


8/2/2019 Erice2005

75/78

75


XYd

for each maximal substring motif X

for each distance d = 1,2,

mark the leaves of Tree(T) that correspond tolocations L(X) + d

for each maximal substring motif Y,find the number h(Y) of marked leaves inits subtree in Tree(T)

the occurrence count of motif (X,d,Y) is h(Y)

2-block motif (X,d,Y)

O(n)

O(n)

O(n)

Counting 2-block maximal motifs

8/2/2019 Erice2005

76/78

76

Counting 2 block maximal motifs(cont)

Thm: The occurrence counts for all maximalmotifs with two blocks can be found in (optimal)time O(n3).

flexible gaps:x*y * = gap of any length

Thm: The occurrence counts for all maximal

motifs with two blocks and one flexible gap canbe found in (optimal) time O(n2). k-block case?

C

8/2/2019 Erice2005

77/78

77

Conclusion

suffix structures provide very efficientalgorithmic tools for finding and learningpotentially interesting patterns in strings

G l

8/2/2019 Erice2005

78/78

General case

Q1: Given q and W, has T a motif with atleast W non-gap symbols and at least qoccurrences?

In k-block case, is O(n2k-1) (or even better)time possible?

related work: H.Arimura&al, A. Apostolico

&al, M-F. Sagot, L. Parida, N. Pisanti,

Documents

Erice2005