Filter Algorithms for Approximate String Matching Stefan Burkhardt

Filter Algorithms forFilter Algorithms forApproximate String MatchingApproximate String Matching

Stefan Burkhardt

OutlineOutline

Motivation Filter Algorithms Gapped q-grams Experimental Analysis

Motivation

Computational Biology: EST Clustering Assembly Genome comparison (e.g. Human/Mouse)

Information Retrieval Phonebooks Dictionaries Search Engines

Many more….

Why ?

Approximate String Matching

Edit and Hamming Distance

Problems and Motivation

The global approximate

string matching problem

Given a pattern P, a target S, an

error level k and a string distance d(x,y):

Find all substrings y from S with:

Why ?




kyP ),d(

P

S

GAT

ACTGATAACGTTAGCCATGG

The global approximate

string matching problem

d(x,y) = Hamming Distance:

The k-mismatches problem

d(x,y) = Edit Distance:

The k-differences problem

Why ?




P

S

GAT

ACTGATAACGTTAGCCATGG

How?

BLAST

The q-gram Lemma and QUASAR

Filter Algorithms

P

S

Potential Matches

FilterAlgorithm

Filtration Phase,apply Filter Criterion

ExactAlgorithm

Verification Phase,examine Potential Matches

False Matches True Matches

How?

BLAST


Filter Algorithms

BLAST (Altschul, Karlin, et al.) :

S

P

Problem for high similarity: sequential scan quite time consuming

single q-grams unspecific

Sequential scan of S locates all matching q-grams with P

Iterative extension with cutoff to find good matches

How?

BLAST


Filter Algorithms

P

S

Preprocess

Index

ExactAlgorithm

Verification Phase,examine Potential Matches

False Matches True Matches

Potential Matches

Indexed FilterAlgorithm

How?

BLAST


Filter Algorithms

P

S

Preprocess

Potential Matches

IndexIndexed Filter

Algorithm

Con: preprocessing time

extra space required

only good for some filter criteria

Pro: potentially faster evaluation of filter criterium

How?

BLAST


Filter Algorithms

P

S

Preprocess

Potential Matches

IndexIndexed Filter

Algorithm

QUASAR (Burkhardt, Rivals et al. 99):

Filter Criterion: q-gram Lemma (Jokinen, Ukkonen 91)

Index Structure: Lookup table (Jokinen, Ukkonen 91)

with suffix array (Manber, Myers 90)

Match Detection: overlapping rectangles in DP-Matrix

|P| =8, q = 3total # of q-grams : |P| - q + 1 = 6

T C GC G A

G A TA T T

T T AT A C

T C G A T T A C

Each error can ´destroy´q matching q-grams=> for k errors lose

kq q-grams

T C GC G A

G A TA T T

T T AT A C

T C G A A T A C

How?

BLAST


Filter Algorithms

The q-gram Lemma

(Jokinen, Ukkonen, 1991)

For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least

t = |P| - q + 1 - (kq)

substrings of length q (q-grams).

How?

BLAST


Filter Algorithms

Match Detection (Jokinen, Ukkonen 91) :

overlapping rectangles of width 2|P| in DP-Matrix

rectangle with at least t hits => potential match

S

P

3 hits3 hits

2 hits2 hits

1 hitt = 3

How?

BLAST


Filter Algorithms

Match Detection (Jokinen, Ukkonen 91) :

overlapping rectangles of width 2|P| in DP-Matrix

rectangle with at least t hits => potential match

S

P

QUASAR (Burkhardt, Rivals et al. 1999) :

wider rectangles efficient in practice (2048 for QUASAR)

S

How?

BLAST


Filter Algorithms

QUASAR (Burkhardt, Rivals et al. 1999) :

BLAST for the verification of the potential matches

wider Rectangles as Match Regions

Index is a combination of Lookup Table and Suffix Array

used for EST-Clustering at the DKFZ in Heidelberg

searches for EST-Clustering about 30 times faster than BLAST

Gapped Gapped qq-grams-grams

A new (old?) idea Hamming Distance Finding good shapes

use gapped q-grams call arrangement of gaps the shape

General idea:

Gapped q-grams A new (old ?) idea

Hamming Distance

Finding good shapes

TCGATTACTC.A CG.T GA.T AT.A TT.C

gapped3-shape:

# # . #

Match Don’t care

Califano, Rigoutsos (1993) Pevzner, Waterman (1995) Lehtinen, Sutinen, Tarhio (1996)

Previous work...Previous work...

limited attention paid to choice of shapes

no exact threshold for the general case given


Hamming Distance

Finding good shapes

Recently...Recently...Buhler (2001) : Multiple ShapesMa, Tromp, Li (2002) : Pattern Hunter

threshold t = 1

The Threshold tDefinition: t is the number of remaining q-grams in a worst-case placement of k errors

OOXOOXOOXOOOOX OXO XOO OOX OXO XOO OOX OXO XOO

classic3-shape###k = 3

gapped3-shape##.#k = 3t = 1t = 0

no filter!

OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O


Hamming Distance

Finding good shapes

OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O

Definition: t is the number of remaining q-grams in a worst-case placement of k errors

gapped shapes can have higher(!) thresholds t than ungapped shapes

The Threshold t

gapped3-shape##.#k = 3t = 1

classic3-shape###k = 3t = 0

no filter!

no simple formula for t we used a DP-based approach to compute t


Hamming Distance

Finding good shapes

Finding good shapes Finding good shapes

high low# of q-gram hits

high lowfiltration time

high

low

verific. time

high

low

# ofpotentialmatches

goodfilters

badfilters


Hamming Distance

Finding good shapes

tradeoffline

low highq


high

low



Hamming Distance

Finding good shapes

# of q-gram hits |S|1

||q

?tradeoff

line

goodfilters

badfilters

low highq

Finding good shapesFinding good shapes

Reason:

##.# ### ##.# ### ----- ----

5 4

A random match requires 5 matching characters instead of only 4 for the ungapped q-gram.This makes random matchesless likely.


Hamming Distance

Finding good shapes

For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches.

We define the minimum coverage cm as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and S


CGACGATTGAT ##.# ##.# -----ACTCGATTAGA

For t =2 andthe shape ##.#the minimum coverage is 5


Hamming Distance

Finding good shapes




Hamming Distance

Finding good shapes

# of q-gram hits |S|1

||q

low highq

tradeoffline

goodfilters

badfilters

|S|1

||c

m

low

high

cm

8 10 12 14 16 18 20 22

0

600

400

200

t = 1t = 2t = 3t = 4t = 5

minimum coverage

number ofshapes

with givenminimum coveragefor k = 5

q = 8

median

contiguous best

• compute t and minimum coverage for all shapes with |P|=50 and k=3,4,5,6


Hamming Distance

Finding good shapes


Experimental AnalysisExperimental Analysis

Speed and Filtration Efficiency The Heuristic Zone

6 7 8 9 10 11 12 q

min

imum

cov

erag

e

8

12

16

20

24

gapped, Hammingcontiguous

m

atch

es

hits 222 220 218 216 214 212

216

212

28

24

1

2-4

2-8

Experimental Analysis A few different Filters

Speed and Filtration Efficiency

The Heuristic Zonek = 5

|P| = 50|S| = 50Mbps

From Hits to Matches Describing Filter Properties

Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)

Errors |P|0

0%

100%

Rec

ogni

tion

rate



Errors |P|0 k

0%

100%

Rec

ogni

tion

rate



Errors |P|0 k

0%

100%

Rec

ogni

tion

rate

|P|-mc



Errors |P|0 k

0%

100%

Rec

ogni

tion

rate

|P|-mc

Errors |P|0 k |P|-mc

0%

100%

Rec

ogni

tion

rate

A few different Filters


The Heuristic Zone

Experimental Analysis

Heuristic Zone

Problem:Behaviour in the Heuristic Zone hard to predict



The Heuristic Zone


A simple idea:Sampling!

For a value i:1. Generate s sample strings with i random errors each2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent)

This allows an experimental evaluation of the Heuristic Zone

|P| = 501000 samples for each error level



The Heuristic Zone


0%

100%

Rec

ogni

tion

rat

e

Errors0 5 10 15 20 25 30

contiguous k=3, q=11k=4, q=9



The Heuristic Zone


0%

100%

Rec

ogni

tion

rat

e

Errors0 5 10 15 20 25 30

k=4, q=9k=3, q=11

gapped, edit

contiguous

k=5, q=10k=4, q=11k=3, q=11




The Heuristic Zone


0%

100%

Rec

ogni

tion

rat

e

Errors0 5 10 15 20 25 30

k=4, q=9k=3, q=11

BLAST

gapped, edit

contiguous

k=5, q=10k=4, q=11k=3, q=11

k=3,q=11k=4,q=10




The Heuristic Zone


0%

100%

Rec

ogni

tion

rat

e

Errors0 5 10 15 20 25 30

k=4, q=9k=3, q=11

BLAST

gapped, edit

contiguous

k=5, q=10k=4, q=11k=3, q=11

k=3,q=11k=4,q=10




The Heuristic Zone


50%

100%

Rec

ogni

tion

rat

e

Errors0 5 10 15

k=4, q=9k=3, q=11

BLAST

gapped, edit

contiguous

k=3, q=11k=4, q=11k=5, q=10k=3,q=11k=4,q=10


Conclusion - Future WorkConclusion - Future WorkOur Work: Significant sensitivity improvement over existing filters Required modifications easy to implement Methods for describing filter properties

Future Work: Combination of `orthogonal` shapes into one filter Use of word neighborhoods Database of filter properties for good shapes

Documents

Filter Algorithms for Approximate String Matching Stefan Burkhardt