Erice2005

Embed Size (px)

Citation preview

  • 8/2/2019 Erice2005

    1/78

    1

    Suffix tree and suffix arraytechniques for pattern analysisin strings

    Esko Ukkonen

    Univ HelsinkiErice School 30 Oct 2005

  • 8/2/2019 Erice2005

    2/78

  • 8/2/2019 Erice2005

    3/78

    3

    Algorithms for combinatorial stringmatching?

    deep beauty? +-

    shallow beauty? +

    applications? ++ intensive algorithmic miniatures

    sources of new problems:

    text processing, DNA, music,

  • 8/2/2019 Erice2005

    4/78

    4

    Analysis of a string of symbols

    T = h a t t i v a t t i text

    P = a t t pattern

    Find the occurrences of P in T:h a t t i v a t t i

    Pattern synthesis: #(t) = 4 #(atti) = 2#(t****t) = 2

  • 8/2/2019 Erice2005

    5/78

    5

    Pattern finding & synthesis problems T = t1t2 tn, P = p1 p 2 pn, strings of symbols in

    finite alphabet

    Indexing problem: Preprocess T (build an indexstructure) such that the occurrences of differentpatterns P can be found fast static text, any given pattern P

    Pattern synthesis problem: Learn from T newpatterns that occur surprisingly often

    What is a pattern? Exact substring, approximate

    substring, with generalized symbols, with gaps,

  • 8/2/2019 Erice2005

    6/78

    6

    1. Suffix tree

    2. Suffix array3. Some applications

    4. Finding motifs

  • 8/2/2019 Erice2005

    7/78

    7

    The suffix tree Tree(T) of T

    data structure suffix tree, Tree(T), iscompacted trie that represents all thesuffixes of string T

    linear size: |Tree(T)| = O(|T|)

    can be constructed in linear time O(|T|)

    has myriad virtues(A. Apostolico) is well-known: 366 000 Google hits

  • 8/2/2019 Erice2005

    8/78

    8

    Suffix trie and suffix tree

    a

    b

    b

    aaa

    aa

    b

    b

    b

    a

    baab

    baab

    ab

    abaab

    baab

    aab

    ab

    b

    Trie(abaab) Tree(abaab)

  • 8/2/2019 Erice2005

    9/78

    9

    Trie(T) can be large

    |Trie(T)| = O(|T|2)

    bad example: T = anbn

    Trie(T) can be seen as a DFA: languageaccepted = the suffixes of T

    minimize the DFA => directed cyclic word

    graph (DAWG)

  • 8/2/2019 Erice2005

    10/78

    10

    Tree(T) is of linear size

    only the internal branching nodes and theleaves represented explicitly

    edges labeled by substrings of T

    v = node() if the path from root to v spells

    one-to-one correspondence of leaves and

    suffixes |T| leaves, hence < |T| internal nodes

    |Tree(T)| = O(|T| + size(edge labels))

  • 8/2/2019 Erice2005

    11/78

    11

    Tree(hattivatti)hattivatti

    attivatti

    ttivatti

    tivatti

    ivatti

    vatti

    atti

    tti

    ti

    i

    hattivattiattivatti

    ttivatti

    tivatti

    ivatti

    vatti

    vattivatti

    attiti

    i

    i

    tti

    ti

    t

    i

    vatti

    vatti

    vatti

    hattivatti

    atti

  • 8/2/2019 Erice2005

    12/78

    12

    Tree(hattivatti)hattivatti

    attivatti

    ttivatti

    tivatti

    ivatti

    vatti

    atti

    tti

    ti

    i

    hattivattiattivatti

    ttivattitivatti

    ivatti

    vatti

    vattivatti

    attiti

    i

    i

    tti

    ti

    t

    i

    vatti

    vatti

    vatti

    hattivatti

    hattivatti

    atti

    substring labels ofedges represented as

    pairs of pointers

  • 8/2/2019 Erice2005

    13/78

    13

    Tree(hattivatti)hattivatti

    attivatti

    ttivatti

    tivatti

    ivatti

    vatti

    atti

    tti

    ti

    i

    12

    34

    5

    6

    6,106,10

    2,54,5

    i

    10

    8

    9

    3,3

    i

    vatti

    vatti

    vatti

    hattivatti

    hattivatti

    7

  • 8/2/2019 Erice2005

    14/78

    14

    Tree(T) is fulltext index

    Tree(T)

    P

    31 8

    P occurs in T atlocations 8, 31,

    P occurs in T P is a prefix of some suffix of TPath for P exists in Tree(T)

    All occurrences of P in time O(|P| + #occ)

  • 8/2/2019 Erice2005

    15/78

    15

    Find att from Tree(hattivatti)hattivatti

    attivatti

    ttivatti

    tivatti

    ivatti

    vatti

    atti

    tti

    ti

    i

    hattivattiattivatti

    ttivattitivatti

    ivatti

    vatti

    vattivatti

    attiti

    2

    i

    tti

    ti

    t

    i

    vatti

    vatti

    vatti

    hattivatti

    atti7

  • 8/2/2019 Erice2005

    16/78

    16

    Linear time construction of Tree(T)

    hattivatti

    attivatti

    ttivatti

    tivatti

    ivatti

    vatti

    atti

    tti

    ti

    i

    Weiner(1973),

    algorithmof theyear

    McCreight(1976)

    on-line algorithm

    (Ukkonen 1992)

  • 8/2/2019 Erice2005

    17/78

    17

    On-line construction of Trie(T)

    T = t1t2 tn$

    Pi = t1t2 ti i:thprefix of T

    on-line idea: update Trie(Pi) to Trie(Pi+1) => very simple construction

  • 8/2/2019 Erice2005

    18/78

    18

    Trie(abaab)

    a a

    b

    b a

    b

    b

    a

    a

    Trie(a) Trie(ab) Trie(aba)

    chain of links connects the end points of currentsuffixes

    abaabaaaaa

  • 8/2/2019 Erice2005

    19/78

    19

    Trie(abaab)

    a a

    b

    b a

    b

    b

    a

    a

    a

    b

    b

    a

    aa

    aa

    Trie(abaa)

  • 8/2/2019 Erice2005

    20/78

    20

    Trie(abaab)

    a a

    b

    b a

    b

    b

    a

    a

    a

    b

    b

    a

    aa

    aa

    Trie(abaa)

    Add next symbol = b

  • 8/2/2019 Erice2005

    21/78

    21

    Trie(abaab)

    a a

    b

    b a

    b

    b

    a

    a

    a

    b

    b

    a

    aa

    aa

    Trie(abaa)

    Add next symbol = b

    From here on b-arc already exists

  • 8/2/2019 Erice2005

    22/78

  • 8/2/2019 Erice2005

    23/78

    23

    What happens in Trie(Pi) => Trie(Pi+1) ?

    ai

    ai

    ai

    aiai

    ai

    Before

    After

    New nodes

    New suffix links

    From here on the ai-arcexists already => stopupdating here

  • 8/2/2019 Erice2005

    24/78

    24

    What happens in Trie(Pi) => Trie(Pi+1) ?

    time: O(size of Trie(T))

    suffix links:slink(node(a)) = node()

  • 8/2/2019 Erice2005

    25/78

    25

    On-line procedure for suffix trie

    1. Create Trie(t1):nodes root and v, an arc son(root, t1) = v, and suffixlinks slink(v) := root and slink(root) := root

    2. for i := 2 to n do begin

    3. vi-1 :=leaf of Trie(t1ti-1) for string t1ti-1 (i.e., the deepest leaf)4. v := vi-1; v:= 0

    5. while node vhas no outgoing arc for ti do begin

    6. Create a new node vand an arc son(v,ti) = v

    7. ifv 0 then slink(v) := v

    8. v := slink(v); v:= v end

    9. for the node vsuch that v= son(v,ti) doif v= v then slink(v) := root else slink(v) := v

  • 8/2/2019 Erice2005

    26/78

    26

    Suffix trees on-line

    compacted version of the on-line trieconstruction: simulate the construction onthe linear size tree instead of the trie =>

    time O(|T|)

    all trie nodes are conceptually still needed=> implicitand realnodes

  • 8/2/2019 Erice2005

    27/78

    27

    Implicit and real nodes

    Pair (v,) is an implicit nodein Tree(T) if vis a node of Tree and is a (proper) prefixof the label of some arc from v. If is the

    empty string then (v, ) is arealnode (=v).

    Let v = node() inTree(T). Thenimplicit

    node (v, ) represents node() of Trie(T)

  • 8/2/2019 Erice2005

    28/78

    28

    Implicit node

    v

    (v, )

  • 8/2/2019 Erice2005

    29/78

    29

    Suffix links and open arcs

    v

    a

    root

    slink(v)

    label [i,*] instead of [i,j] if

    w is a leaf and j is thescanned position of T

    w

  • 8/2/2019 Erice2005

    30/78

    30

    Big picture

    suffix link path traversed: total work O(n)

    new arcs and nodes created:

    total work O(size(Tree(T))

  • 8/2/2019 Erice2005

    31/78

    31

    On-line procedure for suffix tree

    Input: string T = t1t2 tn$

    Output: Tree(T)

    Notation: son(v,) = w iff there is an arc from v to w with label

    son(v,) = v

    Function Canonize(v, ):

    while son(v, ) 0 where =, || > 0 dov := son(v, ); :=

    return (v, )

  • 8/2/2019 Erice2005

    32/78

    32

    Suffix-tree on-line: main procedure

    Create Tree(t1 ); slink(root) := root

    (v, ) := (root, ) /* (v,) is the start node */

    for i := 2 to n+1 do

    v:= 0

    while there is no arc from v with label prefix ti do

    if then /*divide the arc w = son(v, ) into two */

    son(v, ) := v; son(v,ti) := v; son(v,) := w

    else

    son(v,ti) := v; v:= v

    if v 0 then slink(v ) := v

    v:= v; v := slink(v); (v, ) := Canonize(v, )

    if v 0 then slink(v) := v

    (v, ) := Canonize(v,ti ) /* (v,) = start node of the next round */

  • 8/2/2019 Erice2005

    33/78

    33

    The actual time and space

    |Tree(T)| is about 20|T| in practice

    brute-force construction is O(|T|log|T|) forrandom strings as the average depth of internalnodes is O(log|T|)

    difference between linear and brute-forceconstructions not necessarily large (Giegerich &Kurtz)

    truncated suffix trees: k symbols long prefix ofeach suffix represented (Na et al. 2003)

    alphabet independent linear time (Farach 1997)

  • 8/2/2019 Erice2005

    34/78

    34

    1. Suffix tree

    2. Suffix array3. Some applications

    4. Finding motifs

  • 8/2/2019 Erice2005

    35/78

    35

    Suffix array: example

    suffix array = lexicographic order of the suffixes

    hattivatti

    attivatti

    ttivatti

    tivatti

    ivatti

    vatti

    atti

    tti

    ti

    i

    atti

    attivatti

    hattivatti

    i

    ivatti

    ti

    tivatti

    tti

    ttivatti

    vatti

    11

    7

    2

    1

    10

    5

    9

    4

    8

    3

    6

  • 8/2/2019 Erice2005

    36/78

    36

    Suffix array

    suffix array SA(T) = an array giving thelexicographic order of the suffixes of T

    space requirement: 5|T|

    practitioners like suffix arrays (simplicity,space efficiency)

    theoreticians like suffix trees (explicitstructure)

  • 8/2/2019 Erice2005

    37/78

    37

    Pattern search from suffix arrayhattivatti

    attivatti

    ttivatti

    tivatti

    ivatti

    vatti

    atti

    tti

    ti

    i

    atti

    attivatti

    hattivatti

    i

    ivatti

    ti

    tivatti

    tti

    ttivatti

    vatti

    11

    7

    2

    1

    10

    5

    9

    4

    8

    3

    6

    att binary search

  • 8/2/2019 Erice2005

    38/78

    38

    Recent suffix array constructions

    Manber&Myers (1990): O(|T|log|T|)

    linear time via suffix tree

    January / June 2003: direct linear timeconstruction of suffix array- Kim, Sim, Park, Park (CPM03)- Krkkinen & Sanders (ICALP03)

    - Ko & Aluru (CPM03)

  • 8/2/2019 Erice2005

    39/78

    39

    Krkkinen-Sanders algorithm

    1.Construct the suffix array of the suffixesstarting at positions i mod 3 0. This is

    done by reduction to the suffix array

    construction of a string of two thirds thelength, which is solved recursively.

    2.Construct the suffix array of the remaining

    suffixes using the result of the first step.3.Merge the two suffix arrays into one.

  • 8/2/2019 Erice2005

    40/78

  • 8/2/2019 Erice2005

    41/78

    41

    Running example

    T[0,n) = y a b b a d a b b a d o 0 0

    SA = (12,1,6,4,9,3,8,2,7,5,10,11,0)

    0 1 2 3 4 5 6 7 8 9 10 11

  • 8/2/2019 Erice2005

    42/78

    42

    Step 0: Construct a sample

    for k = 0,1,2Bk = {i [0,n] | i mod 3 = k}

    C = B1 U B2 sample positions

    SC sample suffixes

    Example: B1 = {1,4,7,10}, B2 = {2,5,8,11},C = {1,4,7,10,2,5,8,11}

  • 8/2/2019 Erice2005

    43/78

  • 8/2/2019 Erice2005

    44/78

    44

    Step 1 (cont.)

    once the sample suffixes are sorted, assign arank to each: rank(Si) = the rank of Si in SC;rank(Sn+1) = rank(Sn+2) = 0

    Example:R = [abb][ada][bba][do0][bba][dab][bad][o00]

    R = (1,2,4,6,4,5,3,7)

    SAR = (8,0,1,6,4,2,5,3,7)rank(Si) - 1 4 - 2 6 - 5 3 7 8 0 0

  • 8/2/2019 Erice2005

    45/78

    45

    Step 2: Sort nonsample suffixes

    for each non-sample Si SB0 (note thatrank(Si+1) is always defined for i B0):

    Si Sj (ti,rank(Si+1)) (tj,rank(Sj+1))

    radix sort the pairs (ti,rank(Si+1)).

    Example: S12

    < S6

    < S9

    < S3

    < S0

    because(0,0) < (a,5) < (a,7) < (b,2) < (y,1)

  • 8/2/2019 Erice2005

    46/78

    46

    Step 3: Merge

    merge the two sorted sets of suffixes using astandard comparison-based merging:

    to compare Si SC with Sj SB0, distinguish twocases:

    i B1: Si Sj (ti,rank(Si+1)) (tj,rank(Sj+1))

    i B2: Si Sj (ti,ti+1,rank(Si+2)) (tj,tj+1,rank(Sj+2))

    note that the ranks are defined in all cases!

    S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) b a -> -> b

    dID(A,B) = 5 dLevenshtein(A,B)= 4

    mutation costs => probabilistic modeling

    evaluation by dynamic programming =>alignment

  • 8/2/2019 Erice2005

    62/78

  • 8/2/2019 Erice2005

    63/78

    63

    A\B s t o c k h o l m

    0 1 2 3 4 5 6 7 8 9

    t 1 2 1 2 3 4 5 6 7 8

    u 2 3 2 3 4 5 6 7 8 9

    k 3 4 3 4 5 4 5 6 7 8

    h 4 5 4 5 6 5 4 5 6 7

    o 5 6 5 4 5 6 5 4 5 6

    l 6 7 6 5 6 7 6 5 4 5

    m 7 8 7 6 7 8 7 6 5 4

    a 8 9 8 7 8 9 8 7 6 5

    di,j = min(if ai=bj then di-1,j-1 else ,di-1,j + 1,di,j-1 + 1)

    dID(A,B)optimal alignment by trace-back

  • 8/2/2019 Erice2005

    64/78

    64

    Search problem

    find approximate occurrences of pattern Pin text T: substrings P of T such that

    d(P,P) small

    dyn progr with small modification: O(mn)

    lots of (practical) improvement tricks

    P

    T P

  • 8/2/2019 Erice2005

    65/78

    65

    Index for approximate searching?

    dynamic programming: P x Tree(T) withbacktracking

    P

    Tree(T)

  • 8/2/2019 Erice2005

    66/78

    66

    1. Suffix tree

    2. Suffix array

    3. Some applications

    4. Finding motifs

  • 8/2/2019 Erice2005

    67/78

    67

    Gapped motifs

    a##bc

  • 8/2/2019 Erice2005

    68/78

    68

    Gapped motifs

    a##bc

    ababbbcccbcaaabca

    a##bc

  • 8/2/2019 Erice2005

    69/78

    69

    Gapped motifs of T

    gapped pattern: P in (A U {#})*

    gap symbol # matches any symbol in A aa##bb#b

    L(P) = occurrence locations of P in T

    P is called a motif of T if |L(P)| > 1 and a motifwith quorumq if |L(P)| q.

    Problem: find occurrence count |L(P)| for allgapped motifs P of T

    anban has exponentially many motifs!

  • 8/2/2019 Erice2005

    70/78

    70

    Maximal motifs

    motif P is the maximal version of motif Pif P has the largest possible number ofnon-# symbols among motifs that have

    the same set of occurrence locations as P

    every motif has unique maximal version

    unfortunately still exponential number ofdifferent maximal motifs

    Bl k f i l tif

  • 8/2/2019 Erice2005

    71/78

    71

    Blocks of maximal motifs

    aaa##b##ba has blocks aaa, b, ba

    Lemma: Maximal 1-block motifs (branching)nodes of Tree(S)

    Thm: Each block of a maximal motif of T is amaximal substring motif of T. Hence there areO(n) different strings that can be used as a block

    of a maximal motif.

    => There are O(n2k-1) different maximal motifswith k blocks [O(n2k) unrestricted motifs].

  • 8/2/2019 Erice2005

    72/78

  • 8/2/2019 Erice2005

    73/78

    73

    Counting 2-block maximal motifs

    Thm: The occurrence counts for allmaximal motifs with two blocks can befound in (optimal) time O(n3).

    c.f., Arimura et al (2000), Apostolico et al(2004),

    Algorithm (very simple)

  • 8/2/2019 Erice2005

    74/78

    74

    Algorithm (very simple)

    XYd

    for each maximal substring motif X

    for each distance d = 1,2, mark the leaves of Tree(T) that correspond to

    locations L(X) + d

    for each maximal substring motif Y,find the number h(Y) of marked leaves inits subtree in Tree(T)

    the occurrence count of motif (X,d,Y) is h(Y)

    2-block motif (X,d,Y)

    Algorithm (very simple)

  • 8/2/2019 Erice2005

    75/78

    75

    Algorithm (very simple)

    XYd

    for each maximal substring motif X

    for each distance d = 1,2,

    mark the leaves of Tree(T) that correspond tolocations L(X) + d

    for each maximal substring motif Y,find the number h(Y) of marked leaves inits subtree in Tree(T)

    the occurrence count of motif (X,d,Y) is h(Y)

    2-block motif (X,d,Y)

    O(n)

    O(n)

    O(n)

    Counting 2-block maximal motifs

  • 8/2/2019 Erice2005

    76/78

    76

    Counting 2 block maximal motifs(cont)

    Thm: The occurrence counts for all maximalmotifs with two blocks can be found in (optimal)time O(n3).

    flexible gaps:x*y * = gap of any length

    Thm: The occurrence counts for all maximal

    motifs with two blocks and one flexible gap canbe found in (optimal) time O(n2). k-block case?

    C

  • 8/2/2019 Erice2005

    77/78

    77

    Conclusion

    suffix structures provide very efficientalgorithmic tools for finding and learningpotentially interesting patterns in strings

    G l

  • 8/2/2019 Erice2005

    78/78

    General case

    Q1: Given q and W, has T a motif with atleast W non-gap symbols and at least qoccurrences?

    In k-block case, is O(n2k-1) (or even better)time possible?

    related work: H.Arimura&al, A. Apostolico

    &al, M-F. Sagot, L. Parida, N. Pisanti,