Click here to load reader
View
217
Download
0
Embed Size (px)
Filter Algorithms forApproximate String Matching
Stefan Burkhardt
OutlineMotivationFilter AlgorithmsGapped q-gramsExperimental Analysis
Why ? Approximate String MatchingEdit and Hamming Distance
MotivationComputational Biology:EST ClusteringAssemblyGenome comparison (e.g. Human/Mouse)Information RetrievalPhonebooksDictionariesSearch EnginesMany more.
Problems and Motivation
Why ?Approximate String MatchingEdit and Hamming DistanceThe global approximate string matching problem
Given a pattern P, a target S, an error level k and a string distance d(x,y):
Find all substrings y from S with:
Problems and MotivationPSGATACTGATAACGTTAGCCATGG
Why ? Approximate String MatchingEdit and Hamming Distance
The global approximate string matching problem
d(x,y) = Hamming Distance:The k-mismatches problem
d(x,y) = Edit Distance:The k-differences problem
Problems and MotivationPSGATACTGATAACGTTAGCCATGG
How?BLAST The q-gram Lemma and QUASAR
Filter AlgorithmsPS
How?BLASTThe q-gram Lemma and QUASAR
Filter AlgorithmsBLAST (Altschul, Karlin, et al.) :SPProblem for high similarity: sequential scan quite time consumingsingle q-grams unspecificSequential scan of S locates all matching q-grams with PIterative extension with cutoff to find good matches
How?BLAST The q-gram Lemma and QUASAR
Filter AlgorithmsPS
How?BLAST The q-gram Lemma and QUASAR
Filter AlgorithmsPSPreprocessPotential Matches
IndexIndexed FilterAlgorithm
Con: preprocessing timeextra space requiredonly good for some filter criteriaPro: potentially faster evaluation of filter criterium
How?BLAST The q-gram Lemma and QUASAR
Filter AlgorithmsPSPreprocessPotential Matches
IndexIndexed FilterAlgorithmQUASAR (Burkhardt, Rivals et al. 99):Filter Criterion:q-gram Lemma (Jokinen, Ukkonen 91)Index Structure: Lookup table (Jokinen, Ukkonen 91)with suffix array (Manber, Myers 90)Match Detection:overlapping rectangles in DP-Matrix
How?BLAST The q-gram Lemma and QUASAR
Filter AlgorithmsThe q-gram Lemma (Jokinen, Ukkonen, 1991)For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least t = |P| - q + 1 - (kq) substrings of length q (q-grams).
How?BLAST The q-gram Lemma and QUASAR
Filter AlgorithmsMatch Detection (Jokinen, Ukkonen 91) :overlapping rectangles of width 2|P| in DP-Matrixrectangle with at least t hits => potential matchSP
How?BLAST The q-gram Lemma and QUASAR
Filter AlgorithmsMatch Detection (Jokinen, Ukkonen 91) :overlapping rectangles of width 2|P| in DP-Matrixrectangle with at least t hits => potential matchSPQUASAR (Burkhardt, Rivals et al. 1999) :wider rectangles efficient in practice (2048 for QUASAR) S
How?BLAST The q-gram Lemma and QUASAR
Filter AlgorithmsQUASAR (Burkhardt, Rivals et al. 1999) : BLAST for the verification of the potential matches wider Rectangles as Match Regions Index is a combination of Lookup Table and Suffix Array used for EST-Clustering at the DKFZ in Heidelberg searches for EST-Clustering about 30 times faster than BLAST
Gapped q-gramsA new (old?) ideaHamming DistanceFinding good shapes
use gapped q-grams call arrangement of gaps the shapeGeneral idea:Gapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes
Califano, Rigoutsos (1993) Pevzner, Waterman (1995) Lehtinen, Sutinen, Tarhio (1996)Previous work... limited attention paid to choice of shapes no exact threshold for the general case givenGapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapesRecently...Buhler (2001) : Multiple ShapesMa, Tromp, Li (2002) : Pattern Hunter threshold t = 1
The Threshold tDefinition: t is the number of remaining q-grams in a worst-case placement of k errors
OOXOOXOOXOOOOX OXO XOO OOX OXO XOO OOX OXO XOOclassic3-shape###k = 3gapped3-shape##.#k = 3t = 1t = 0no filter!OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.OGapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes
OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.ODefinition: t is the number of remaining q-grams in a worst-case placement of k errors
gapped shapes can have higher(!) thresholds t than ungapped shapesThe Threshold tgapped3-shape##.#k = 3t = 1classic3-shape###k = 3t = 0no filter! no simple formula for t we used a DP-based approach to compute tGapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes
Finding good shapes Gapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes
Finding good shapes Gapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes
# of q-gram hits |S|1 |S|q?
Finding good shapesGapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes
For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches.
We define the minimum coverage cm as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and SFinding good shapesCGACGATTGAT ##.# ##.# -----ACTCGATTAGA
For t =2 andthe shape ##.#the minimum coverage is 5Gapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes
Finding good shapes # ofpotentialmatchesGapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes
# of q-gram hits |S|1 |S|q
compute t and minimum coverage for all shapes with |P|=50 and k=3,4,5,6Gapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes
Finding good shapes
Experimental Analysis
Speed and Filtration EfficiencyThe Heuristic Zone
6 7 8 9 10 11 12 qminimum coverage812162024gapped, Hammingcontiguous
matches hits 222 220 218 216 214 212 2-8Experimental AnalysisA few different FiltersSpeed and Filtration EfficiencyThe Heuristic Zonek = 5|P| = 50|S| = 50Mbps
From Hits to MatchesDescribing Filter PropertiesFilters usually have 3 recognition zones` depending on k :Guarantee zone (finds all approximate matches)Heuristic zone (finds some of the approximate matches)Negative zone (guaranteed not to find matches)
From Hits to MatchesDescribing Filter PropertiesFilters usually have 3 recognition zones` depending on k :Guarantee zone (finds all approximate matches)Heuristic zone (finds some of the approximate matches)Negative zone (guaranteed not to find matches)Errors|P|0k0%100%Recognitionrate
From Hits to MatchesDescribing Filter PropertiesFilters usually have 3 recognition zones` depending on k :Guarantee zone (finds all approximate matches)Heuristic zone (finds some of the approximate matches)Negative zone (guaranteed not to find matches)Errors|P|0k0%100%Recognitionrate|P|-mc
From Hits to MatchesDescribing Filter PropertiesFilters usually have 3 recognition zones` depending on k :Guarantee zone (finds all approximate matches)Heuristic zone (finds some of the approximate matches)Negative zone (guaranteed not to find matches)Errors|P|0k0%100%Recognitionrate|P|-mc
Errors|P|0k|P|-mc0%100%RecognitionrateA few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental AnalysisProblem:Behaviour in the Heuristic Zone hard to predict
A few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental AnalysisA simple idea:Sampling!For a value i:1. Generate s sample strings with i random errors each2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent)
This allows an experimental evaluation of the Heuristic Zone
|P| = 501000 samples for each error level
A few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental Analysis0%100%Recognition rateErrors051015202530contiguous
A few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental Analysis0%100%Recognition rateErrors051015202530k=4, q=9k=3, q=11gapped, editcontiguous|P| = 501000 samples for each error level
A few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental Analysis0%100%Recognition rateErrors051015202530k=4, q=9k=3, q=11BLASTgapped, editcontiguous|P| = 501000 samples for each error level
A few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental Analysis0%100%Recognition rateErrors051015202530k=4, q=9k=3, q=11BLASTgapped, editcontiguous|P| = 501000 samples for each error level
A few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental Analysis50%100%Recognition rateErrors051015k=4, q=9k=3, q=11BLASTgapped, editcontiguousk=3, q=11k=4, q=11k=5, q=10k=3,q=11k=4,q=10|P| = 501000 samples for each error level
Conclusion - Future WorkOur Work:Significant sensitivity improvement over existing filtersRequired modifications easy to implementMethods for describing filter propertiesFuture Work:Combination of `orthogonal` shapes into one filterUse of word neighborhoods Database of filter properties for good shapes