31
String algorithms Knuth-Morris-Pratt algorithm - find pattern P in string T - preprocess P (construct Fail array) - search time: O(n+m) n=|T|, m=|P Data structures for strings - trie - suffix tree & generalized suffix tree - suffix array Many applications - e.g. computing common longest substring ast ime: oday:

Advances in discrete energy minimisation for computer vision

  • Upload
    zukun

  • View
    643

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advances in discrete energy minimisation for computer vision

String algorithms• Knuth-Morris-Pratt algorithm

- find pattern P in string T- preprocess P (construct Fail array)- search time: O(n+m) n=|T|, m=|P|

• Data structures for strings- trie- suffix tree & generalized suffix tree- suffix array

• Many applications- e.g. computing common longest substring

lasttime:

today:

Page 2: Advances in discrete energy minimisation for computer vision

Trie data structure

• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree• Edges labeled with letters• Leafs correspond to input strings

h

e

l

ia

d

d p

held

hi

had

help

{ had, held, help, hi }

Page 3: Advances in discrete energy minimisation for computer vision

Trie data structure

• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree• Edges labeled with letters• Leafs correspond to input strings

{ had, held, help, hi }

hadheldhelphi

Page 4: Advances in discrete energy minimisation for computer vision

Trie data structure

• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree• Edges labeled with letters• Leafs correspond to input strings

{ had, held, help, hi }

hadheldhelphi

h

Page 5: Advances in discrete energy minimisation for computer vision

Trie data structure

• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree• Edges labeled with letters• Leafs correspond to input strings

{ had, held, help, hi }

heldhelp

h

e ia

hihad

Page 6: Advances in discrete energy minimisation for computer vision

Trie data structure

• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree• Edges labeled with letters• Leafs correspond to input strings

{ had, held, help, hi }

heldhelp

h

e ia

hid

had

Page 7: Advances in discrete energy minimisation for computer vision

Trie data structure

• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree• Edges labeled with letters• Leafs correspond to input strings

{ had, held, help, hi } heldhelp

h

e ia

hid

had

l

Page 8: Advances in discrete energy minimisation for computer vision

Trie data structure

• Data structure for storing a set of strings• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree• Edges labeled with letters• Leafs correspond to input strings

h

e

l

ia

d

d p

held

hi

had

help

{ had, held, help, hi }

Page 9: Advances in discrete energy minimisation for computer vision

Trie data structure

• Cannot represent strings T and TT'• Solution: append special symbol $

to the end of each word

{ had, held, help, hi }

h

e

l

ia

d

d p

held

hi

had

help

Page 10: Advances in discrete energy minimisation for computer vision

$

h$

Trie data structure

• Cannot represent strings T and TT'• Solution: append special symbol $

to the end of each word

{ had$, held$, help$, hi$, h$ }

h

e

l

ia

d

d p$

had$

hi$

$

held$

$

help$

$

Page 11: Advances in discrete energy minimisation for computer vision

Trie data structure

• Application: storing dictionaries• Space: O(m1+...+mk), mi=length of string Ti

• Looking up word of length n: O(n) - assuming alphabet has constant size

• Alternative to binary trees- pros and cons (see wikipedia)

h

e

l

ia

d

d p

held$

hi$

$

had$

$

$

$

h$

help$

$

had

h held

help hi

- all nodes store items, not just leaves- unlabeled edges

Page 12: Advances in discrete energy minimisation for computer vision

Tries for storing suffixes

• This lecture: use tries for storing the set of suffixes of string T

a

o

oc

a

c

cacao$

$

$

$

$

cacao$acao$cao$ao$o$$

o

a

o

c

a

o

$

$

$ acao$

cao$

ao$

o$

Page 13: Advances in discrete energy minimisation for computer vision

Reducing space requirements

• Trick 1: Delete non-branching nodes, merge labels (compact trie)

a

o$

o$ca

cacao$

$

$ cacao$acao$cao$ao$o$$

o$

cao$

cao$

acao$

cao$

ao$

o$

Page 14: Advances in discrete energy minimisation for computer vision

Reducing space requirements

• # leafs: n+1• # internal nodes: at most n• # edges: at most 2n edges• Trick 2: Store edge labels via two pointers into the string

(begin & end)

a

o$

o$ca

cacao$

$

$

cacao$acao$cao$ao$o$$

o$cao$ cao$

acao$cao$ ao$

o$

=> O(n) storage

Page 15: Advances in discrete energy minimisation for computer vision

Suffix trees

• Called the suffix tree for string T [Weiner’73]• D. Knuth: “algorithm of the year 1973”• Can be constructed in linear time

- e.g. [McCreight’76], [Ukkonen’95] - algorithms are complicated [will not be covered]

• Used for all sorts of string problems

a

o$

o$ca

cacao$

$

$

cacao$acao$cao$ao$o$$

o$cao$ cao$

acao$cao$ ao$

o$

Page 16: Advances in discrete energy minimisation for computer vision

Suffix trees for string matching

Does text T contain pattern P?• Construct suffix tree for T in O(|T|) time• Query takes O(|P|) time!• same as Knuth-Morris-Pratt, but...

a

o$

o$ca

cacao$

$

$

cacao$acao$cao$ao$o$$

o$cao$ cao$

acao$cao$ ao$

o$

Page 17: Advances in discrete energy minimisation for computer vision

Suffix trees for string matching

a

o$

o$ca

cacao$

$

$

cacao$acao$cao$ao$o$$

o$cao$ cao$

acao$cao$ ao$

o$

• Scenario: text T is fixed, patterns P1,...,Pk come one at a time

- T is very long• Time: O(|T| + |P1| + |P2| + ... + |Pk|)• Knuth-Morris-Pratt can be generalized to multiple patterns

(Aho-Corasick alg.). Same complexity as above, but requires knowledge of P1,...,Pk in advance

Page 18: Advances in discrete energy minimisation for computer vision

Getting extra information

a

o$

o$ca

cacao$

$

$

cacao$acao$cao$ao$o$$

o$cao$ cao$

acao$cao$ ao$

o$

Task: Get all occurrences of P in T (give a list of start positions)

• Can be done in O(|P|+c) time where c is # of occurences- Locate the node corresponding to P- Traverse all leaves of the subtree rooted at this node

Page 19: Advances in discrete energy minimisation for computer vision

Getting extra information

a

o$

o$ca

cacao$

$

$

cacao$acao$cao$ao$o$$

o$cao$ cao$

acao$cao$ ao$

o$

Task: Get # of occurrences of P in T

• Can be done in O(|P|) time- With O(|T|) preprocessing

Page 20: Advances in discrete energy minimisation for computer vision

Generalized suffix tree

• Input: a set of strings {T1,...,Tk}• Generalized suffix tree: tree containing all suffixes of all strings• Tree for { abba , bac } : abba$1

bba$1

ba$1

a$1

$1

bac$2

ac$2

c$2

$2

a

c$2

ba$1

abba$1

c$2

c$2

bba$1

$1

ba$1

ac$2

bac$2

$1

ab $1

$2

a$1 bba$1

c$2 $1 $2

Page 21: Advances in discrete energy minimisation for computer vision

...$1bac$1

abba$1bac$2

...$1bac$2

$1bac$2

ba$1bac$2

$1bac$2

$1bac$2

a$1bac$2 bba$1bac$2

$1bac$2

bac$2

ac$2

c$2

$2

abba$1bac$2

bba$1bac$2

ba$1bac$2

a$1bac$2

$1bac$2

Construction of generalized suffix tree for {X, Y}

• Construct suffix tree for X $1Y $2

• Edges leading to red leaves are labeled as “...$1Y $2”; delete “Y $2”

Page 22: Advances in discrete energy minimisation for computer vision

...$1

abba$1

...$1

$1

ba$1

$1

$1

a$1 bba$1

$1

bac$2

ac$2

c$2

$2

abba$1bac$2

bba$1bac$2

ba$1bac$2

a$1bac$2

$1bac$2

Construction of generalized suffix tree for {X, Y}

• Construct suffix tree for X $1Y $2

• Edges leading to red leaves are labeled as “...$1Y $2”; delete “Y $2”

Page 23: Advances in discrete energy minimisation for computer vision

Construction of generalized suffix tree {T1,...,Tk}

• Can use the same trick

- construct suffix tree for T1 $1 ... Tk $k

• Linear runtime: O(|T1|+...+|Tk|)

• In practice, modify the algorithm (e.g. Ukkonen’s alg.) to get a slightly faster performance

Page 24: Advances in discrete energy minimisation for computer vision

Computing common longest subsequence of {T1,T2}

• Mark each internal node with red (blue) if it has at least one red (blue) child

• Nodes with both colors contain common subsequences

a

c$2

ba$1

abba$1

c$2

abba$1

bba$1

ba$1

a$1

$1

bac$2

ac$2

c$2

$2

c$2

bba$1

$1

ba$1

ac$2

bac$2

$1

ab $1

$2

a$1 bba$1

c$2 $1 $2

Page 25: Advances in discrete energy minimisation for computer vision

Computing common longest subsequence of {T1,...,TK}

1. Construct generalized suffix tree for {T1,...,TK}

......

Task: For each r = 2,...,K compute the length of the longest sequencecontained in at least r strings

Page 26: Advances in discrete energy minimisation for computer vision

Computing common longest subsequence of {T1,...,TK}

1. Construct generalized suffix tree for {T1,...,TK}2. Mark each node v with color k=1,...,K if it has a leaf with that color3. Compute c(v) = # of colors at v

- Naive computation: O(nK), n=|T1|+...+|TK|. Can also be done in O(n)

String corresponding to v is contained in exactly c(v) input strings

......

Task: For each r = 2,...,K compute the length of the longest sequencecontained in at least r strings

Page 27: Advances in discrete energy minimisation for computer vision

Suffix array for text T[1..n]

mississippiississippississippisissippiissippissippisippiippippipii

01234567891011

iippiissippiississippimississippipippisippisissippissippississippi

11107410986352

SA[0..n]

Suffix array: stores suffixes of string T in a lexicographic order

• SA[i] = id of the i’th smallest suffix

• Number k corresponds to suffix T[k+1..n]

• Lexicographic order:- X < X...- X... < X... if <

Page 28: Advances in discrete energy minimisation for computer vision

Pattern search from a suffix array

iippiissippiississippimississippipippisippisissippissippississippi

11107410986352

• All occurrences appear contiguously in the array

• Computing start & end indexes: binary search

• Worst-case: O(m log n)

• In practice, often O(m + log n)

• Can be improved to O(m + log n) worst-case- need to store extra 2n numbers

Task: Find pattern P = iss in text T = mississippi

m n

Page 29: Advances in discrete energy minimisation for computer vision

Construction of suffix arrays

iippiissippiississippimississippipippisippisissippissippississippi

11107410986352

• SA[0..n] can be constructed in linear time

- from a suffix tree- or directly

[Kim,Sim,Park,Park’03][Kärkkäinen&Sanders’03][Ko & Aluru’03]

Page 30: Advances in discrete energy minimisation for computer vision

Summary • Suffix trees & suffix arrays: two powerful data structures for strings

• Space requirements:- suffix trees: linear but with a large constant, e.g. 20 |T|- suffix arrays: more efficient, e.g. 5 |T|

• Suffix arrays are more suited to large alphabets

• Practitioners like suffix arrays (simplicity, space efficiency)• Theoreticians like suffix trees (explicit structure)

iippiissippiississippimississippipippisippisissippissippississippi

11107410986352

Page 31: Advances in discrete energy minimisation for computer vision

Further information on string algorithms

• Dan Gusfield, “Algorithms on Strings, Trees and Sequences:

Computer Science and Computational Biology”, 1997

• Slides mentioning more recent results:http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt