CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms

CS5263 Bioinformatics

Lecture 17

Exact String Matching Algorithms

Boyer – Moore algorithm

• Three ideas:– Right-to-left comparison– Bad character rule– Good suffix rule

Boyer – Moore algorithm

• Right to left comparison

x

y

y

Skip some chars without missing any occurrence.

Extended bad character rule

char Position in P

a 6, 3

b 7, 4

p 2

t 1

x 5

T: xpbctbxabpqqaabpqz

P: tpabxab *^^

P: tpabxab

Find T(k) in P that is immediately left to i, shift P to align T(k) with that position

k

i = 5 5 – 3 = 2. so shift 2

Preprocessing O(n)

Restart the comparison here.

(Strong) good suffix rule

tx

tyt’

tyt’

In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, such that the char left to t ≠ the char left to t’

T

P

P

z

z

z ≠ y

tyt’P zt’z

tyt’P zt’z

txT

Example preprocessing

qcabdabdab

char Positions in P

a 9, 6, 3

b 10, 7, 4

c 2

d 8, 5

q 1

q c a b d a b d a b1 2 3 4 5 6 7 8 9 10

0 0 0 0 2 0 0 2 0 0dabcab

Bad char rule Good suffix rule

dabdabcabdab

Where to shift depends on T Does not depend on T

Tricky case

Pattern: abcab

a b c a b0 0 0 1 0

* ^ ^

T: x y a a b c a b

shift = 4 – 1 = 3

a b c a bN N 0 N N

c

b

c

b

i-L


qcabdabdab

char Positions in P

a 9, 6, 3

b 10, 7, 4

c 2

d 8, 5

q 1

q c a b d a b d a b1 2 3 4 5 6 7 8 9 10

0 0 0 0 0 3 0 0 3 0dabcab



dabdabcabdab


qcabdabdab

char Positions in P

a 9, 6, 3

b 10, 7, 4

c 2

d 8, 5

q 1

q c a b d a b d a b1 2 3 4 5 6 7 8 9 10

N N N N 2 N N 2 N Ndabcab


dabdabcabdab


Algorithm KMP: Basic idea

tt’P

t xT

y

tt’P y

z

z

In pre-processing: for any position i in P, find the longest suffix t, such that t = t’, and y ≠ z.For each i, let Sp’(i) = length(t)

ij

Failure link

P: aataac

a a t a a c

Sp’(i) 0 1 0 0 2 0

aaat

aataac

If a char in T fails to match at pos 6, re-compare it with the

char at pos 3

FSA

P: aataac

1 2 3 4 50a a t a a c

6

a

t

All other input goes to state 0

Sp’(i) 0 1 0 0 2 0

aaat

aataac

If the next char in T is t, we go to state 3

Tricky case

Pattern: abcab

a b c a b

0 0 0 0 2

a b bc a

c

Failure link

FSA

dummy

How to actually do pre-processing?

• Similar pre-processing for KMP and B-M– Find matches between a suffix and a prefix

– Both can be done in linear time– P is usually short, even a more expensive

pre-processing may result in a gain overall

tt’P yxKMP

tyt’P xB-M

i

ij

j For each i, find a j. similar to DP. Start from i = 2

Fundamental pre-processing

• Zi: length of longest substring starting at i that matches a prefix of P– i.e. t = t’, x ≠ y, Zi = |t|– With the Z-values computed, we can get the

preprocessing for both KMP and B-M in linear time.

aabcaabxaazZ = 01003100210

• How to compute Z-values in linear time?

tt’Pi

x yi+zi-1zi1

Computing Z in Linear time

tt’Pl

x yrk

We already computed all Z-values up to k-1. need to compute Zk. We also know the starting and ending points of the previous match, l and r.

tt’Pl

x yrk

We know that t = t’, therefore the Z-value at k-l+1 may be helpful to us.

1

k-l+1

Computing Z in Linear time

• No char inside the box is compared twice. At most one mismatch per iteration.• Therefore, O(n).

Pk

The previous r is smaller than k. i.e., no previous match extends beyond k. do explicit comparison.

Pl

x yrk

Zk-l+1 <= r-k+1. Zk = Zk-l+1 No comparison is needed.1

k-l+1

Case 1:

Case 2:

Pl rk

Zk-l+1 > r-k+1. Zk = Zk-l+1

Comparison start from r1

k-l+1

Case 3:

Z-preprocessing for B-M and KMP

• Both KMP and B-M preprocessing can be done in O(n)

tt’i

x y

j = i+zi-1zi1

tt’ yxKMP

tyt’xB-Mij

Z j

ijFor each j sp’(j+zj-1) = z(j)

Use Z backwards

Keyword tree for spell checking

• O(n) time to construct. n: total length of patterns.• Search time: O(m). m: length of word• Common prefix only need to be compared once.

p

o

t

a

t

o

e

tr

y

t

er

y

s

c

i

e

n

c

e

h o o l

1

2

3

4

5

Aho-Corasick algorithm

• Generalizing KMP

• Create failure links

• Basis of the fgrep algorithm

• Given the following patterns:– potato– tattoo– theater– other

Failure link

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

potterisapersonwhomakespottery

Failure link

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

O(n) preprocessing, and O(m+k) searching. k is # of occurrence.

Can create a FSA similarly. Requires more space, and preprocessing time depends on alphabet size.

A problem with failure link

• Patterns: {potato, other, pot}

p

o

t

a

t

o

0t

he

r

1

23

A problem with failure link for multiple patterns

• Patterns: {potato, other, pot, the, he, era}

p

o

t

a

t

o

0t

he

r

1

2

t

h

e3

4

potherarac

h e 5er

a

Output link

• Patterns: {potato, other, pot, the}

p

o

t

a

t

o

0t

he

r

1

2

t

h

e3

4

potherarac

h e

Failure link: taken when a mismatch occurs. Output link: always taken. (but will return).

5er

a

Suffix Tree

• All algorithms we talked about so far preprocess pattern(s)– Karp-Rabin: small pattern, small alphabet– Boyer-Moore: fastest in practice. O(m) worst case.– KMP: O(m)– Aho-Corasick: O(m)

• In some cases we may prefer to pre-process T– Fixed T, varying P

• Suffix tree: basically a keyword tree of all suffixes

Suffix tree

• T: xabxac

• Suffixes:1. xabxac

2. abxac

3. bxac

4. xac

5. ac

6. c

a

bx

ac

bxa

c

c

c

x a b x a cc 1

2 3

4

5

6

Naïve construction: O(m2) using Aho-Corasick.

Smarter: O(m). Very technical. big constant factor

Create an internal node only when there is a branch

Suffix tree implementation

• Explicitly labeling seq end

• T: xabxa T: xabxa$

a

bx

a

bxa

x a b x a1

2 3

a

bx

a

bxa

x a b x a1

2 3

$

$$

$

$4

5

Suffix tree implementation

• Implicitly labeling edges

• T: xabxa$

a

bx

a

bxa

x a b x a1

2 3

$

$$

$

$4

5

2:2

3:$ 3:$

1

2 3

$

$4

5

1:23:$

Suffix links

• Similar to failure link in a keyword tree

• Only link internal nodes having branchesx

ab

cd

ef

g

h

ij

ab

c

de

fg

h

i

j

xabcff

Suffix tree construction

1:$

1

1234567890...acatgacatt...


2:$

2

1:$

1

1234567890...acatgacatt...


2:$

a

4:$

2

3

2:$

1

1234567890...acatgacatt...


2:$

2

4:$

4

a

4:$

3

2:$

1

1234567890...acatgacatt...


2:$

2

4:$

4

5:$ 5a

4:$

3

2:$

1

1234567890...acatgacatt...


2:$

2

4:$

4

5:$

ca

tt

5

6

a

4:$

3

5:$

1

$

1234567890...acatgacatt...


• With this suffix link, when we later need to add another suffix, say acaty, we can use the link to avoid going back to the root and re-compare “cat”

5:$

2

4:$

4

5:$ 5cat

t

7

ca

t

t

6

a

4:$

3

5:$

1

$

1234567890...acatgacatt...


5:$

2

4:$

4

5:$ 5cat

t

7

ca

t

t

6

a

5:$

3

5:$

1

t

8

t

$

1234567890...acatgacatt...


5:$

2

5:$

4

5:$ 5cat

t

7

ca

t

t

6

a

5:$

3

5:$

1

t

8

tt

t

9

$

1234567890...acatgacatt...


5:$

2

5:$

4

5:$ 5cat

t

7

ca

t

t

6

a

5:$

3

5:$

1

t

8

tt

t

9

10$

$

1234567890...acatgacatt...

ST Application: pattern matching

• Find all occurrence of P=xa in T– Find node v in the ST

that matches to P– Traverse the subtree

rooted at v to get the locations

a

bx

ac

bxa

c

c

c

x a b x a cc 1

2 3

4

5

6

T: xabxac

O(m) to construct ST (large constant factor)

O(n) to find v – linear to length of P instead of T!

O(k) to get all leaves, k is the number of occurrence.

ST application: repeats finding

• Genome contains many repeated DNA sequences

• Repeat sequence length: Varies from 1 nucleotide to whole gene– Highly repetitive DNA in some non-coding

regions • 6 to 10bp x 100,000 to 1,000,000 times

– Genes may have multiple copies (50 to 10,000)

Find longest repeated substring

• Do a tree traversal, compute the lengths of labels at each node

• O(m)

L = 4

2:5

6:1015:1

8

L = 9

L = 8

Repeats finding

• Find all repeats that are at least k-residue long and appear at least p times in the seq– Phase 1: top-down, count lengths of labels at

each node– Phase 2: bottom-up: count # of leaves

descended from each internal node

(L, N)

For each node with L >= k, and N >= p, print all leaves

O(m) to traverse tree

Repeats finding

• Find repeats with at least 3 bases and 2 occurrence– cat– acat– aca

5:e

2

5:e

4

1234567890acatgacatt

5:e 5ca

t

t

7

ca

t

t

6

a

5:e

3

5:e

1

t

8

tt

t

9

10$

Repeats finding

1. Left-maximal repeat– S[i+1..i+k] = S[j+1..j+k]– S[i] != S[j]

2. Right-maximal repeat– S[i+1..i+k] = S[j+1..j+k], – S[i+k+1] != S[j+k+1]

3. Maximal repeat– S[i+1..i+k] = S[j+1..j+k]– S[i] != S[j], and S[i+k+1] != S[j+k+1]

acatgacatt

1. aca2. cat3. acat

Repeats finding

• How to find maximal repeat?– A right-maximal repeats with different left chars

5:e

2

5:e

4

1234567890acatgacatt

5:e 5cat

t

7

ca

t

t

6

a

5:e

3

5:e

1

t

8

tt

t

9

10$

Left char = [] g c c a a

ST application: word enumeration

• Find all k-mers that occur at least p times– Compute (L, N) for each

node– Find nodes v with L>=k,

and L(parent)<k, and N>=y

– Traverse sub-tree rooted at v to get the locations

L<k

L>=k, N>=p

L = KL=k

This can be used in many applications. For example, to find words that appeared frequently in a genome or a document

Joint Suffix Tree

• Build a ST for many than two strings

• Two strings S1 and S2

• S* = S1 & S2

• Build a suffix tree for S* in time O(|S1| + |S2|)

• The separator will only appear in the edge ending in a leaf

• S1 = abcd

• S2 = abca

• S* = abcd&abca$a

bcd

&ab

ca

bc

d&abca

c

d&

abc

d

d & ab c

d

& a b c d

a aa

$

1,1

2,1

1,2

1,3

1,4

2,2

2,32,4

To Simplify

• We don’t really need to do anything, since all edge labels were implicit.

• The right hand side is more convenient to look at

abc

d&

abc

a

bc

d&abca

c

d&

abc

d

d & ab c

d

& a b c d

a aa

$

1,1

2,1

1,2

1,3

1,4

2,2

2,32,4

uselessa

bcd

bc

d

c

d

d

a aa

$

1,12,1

1,21,3

1,4

2,2

2,32,4

Application of JST

• Longest common substring– For each internal node v,

keep a bit vector B[2]– B[1] = 1 if a child of v is a

suffix of S1– Find all internal nodes with

B[1] = B[2] = 1– Report one with the longest

label– Can be extended to k

sequences. Just use a longer bit vector.

abc

d

bc

d

c

d

d

a aa

$

1,12,1

1,21,3

1,4

2,2

2,32,4

O(m), m the total seq length

Application of JST

• Given K strings, find all substrings with L>=l, that appear in at least d strings

• Exact motif finding problem

• Build a joint suffix tree with all strings

S* = S1 & S2 % S3 * S4 @ S5 ! S6 + S7

– Use a unique end char for each string– Not really necessary if caution is taken in

construction

L< k

L >= k B = 1010 | 0011 = 1011

|B| = 3

1,x3,x 3,x

4,x

B = 0011

O(mK), m the total seq length. K is for “bitwise or” two bit vectors

3,x

B = 1010

Many other applications

• Reproduce the behavior of Aho-Corasick• DNA finger printing

– A database of people’s DNA sequence– Given a short DNA, which person is it from?

• Recognizing DNA contamination• Indexing sequence databases• …• Catch

– Large constant factor for space requirement (15-40 bytes per base for DNA)

– Large constant factor for construction– Suffix array: trade off time for space

Summary

• One T, one P– Boyer-Moore is the choice– KMP works but not the best

• One T, many P– Aho-Corasick– Suffix Tree

• One fixed T, many varying P– Suffix tree

• Two or more T’s– Suffix tree, joint suffix tree, suffix array

Alphabet independent

Alphabet dependent

Documents

CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms