Click here to load reader

Filter Algorithms for Approximate String Matching Stefan Burkhardt

  • View
    217

  • Download
    0

Embed Size (px)

Text of Filter Algorithms for Approximate String Matching Stefan Burkhardt

  • Filter Algorithms forApproximate String Matching

    Stefan Burkhardt

  • OutlineMotivationFilter AlgorithmsGapped q-gramsExperimental Analysis

  • Why ? Approximate String MatchingEdit and Hamming Distance

    MotivationComputational Biology:EST ClusteringAssemblyGenome comparison (e.g. Human/Mouse)Information RetrievalPhonebooksDictionariesSearch EnginesMany more.

    Problems and Motivation

  • Why ?Approximate String MatchingEdit and Hamming DistanceThe global approximate string matching problem

    Given a pattern P, a target S, an error level k and a string distance d(x,y):

    Find all substrings y from S with:

    Problems and MotivationPSGATACTGATAACGTTAGCCATGG

  • Why ? Approximate String MatchingEdit and Hamming Distance

    The global approximate string matching problem

    d(x,y) = Hamming Distance:The k-mismatches problem

    d(x,y) = Edit Distance:The k-differences problem

    Problems and MotivationPSGATACTGATAACGTTAGCCATGG

  • How?BLAST The q-gram Lemma and QUASAR

    Filter AlgorithmsPS

  • How?BLASTThe q-gram Lemma and QUASAR

    Filter AlgorithmsBLAST (Altschul, Karlin, et al.) :SPProblem for high similarity: sequential scan quite time consumingsingle q-grams unspecificSequential scan of S locates all matching q-grams with PIterative extension with cutoff to find good matches

  • How?BLAST The q-gram Lemma and QUASAR

    Filter AlgorithmsPS

  • How?BLAST The q-gram Lemma and QUASAR

    Filter AlgorithmsPSPreprocessPotential Matches

    IndexIndexed FilterAlgorithm

    Con: preprocessing timeextra space requiredonly good for some filter criteriaPro: potentially faster evaluation of filter criterium

  • How?BLAST The q-gram Lemma and QUASAR

    Filter AlgorithmsPSPreprocessPotential Matches

    IndexIndexed FilterAlgorithmQUASAR (Burkhardt, Rivals et al. 99):Filter Criterion:q-gram Lemma (Jokinen, Ukkonen 91)Index Structure: Lookup table (Jokinen, Ukkonen 91)with suffix array (Manber, Myers 90)Match Detection:overlapping rectangles in DP-Matrix

  • How?BLAST The q-gram Lemma and QUASAR

    Filter AlgorithmsThe q-gram Lemma (Jokinen, Ukkonen, 1991)For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least t = |P| - q + 1 - (kq) substrings of length q (q-grams).

  • How?BLAST The q-gram Lemma and QUASAR

    Filter AlgorithmsMatch Detection (Jokinen, Ukkonen 91) :overlapping rectangles of width 2|P| in DP-Matrixrectangle with at least t hits => potential matchSP

  • How?BLAST The q-gram Lemma and QUASAR

    Filter AlgorithmsMatch Detection (Jokinen, Ukkonen 91) :overlapping rectangles of width 2|P| in DP-Matrixrectangle with at least t hits => potential matchSPQUASAR (Burkhardt, Rivals et al. 1999) :wider rectangles efficient in practice (2048 for QUASAR) S

  • How?BLAST The q-gram Lemma and QUASAR

    Filter AlgorithmsQUASAR (Burkhardt, Rivals et al. 1999) : BLAST for the verification of the potential matches wider Rectangles as Match Regions Index is a combination of Lookup Table and Suffix Array used for EST-Clustering at the DKFZ in Heidelberg searches for EST-Clustering about 30 times faster than BLAST

  • Gapped q-gramsA new (old?) ideaHamming DistanceFinding good shapes

  • use gapped q-grams call arrangement of gaps the shapeGeneral idea:Gapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes

  • Califano, Rigoutsos (1993) Pevzner, Waterman (1995) Lehtinen, Sutinen, Tarhio (1996)Previous work... limited attention paid to choice of shapes no exact threshold for the general case givenGapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapesRecently...Buhler (2001) : Multiple ShapesMa, Tromp, Li (2002) : Pattern Hunter threshold t = 1

  • The Threshold tDefinition: t is the number of remaining q-grams in a worst-case placement of k errors

    OOXOOXOOXOOOOX OXO XOO OOX OXO XOO OOX OXO XOOclassic3-shape###k = 3gapped3-shape##.#k = 3t = 1t = 0no filter!OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.OGapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes

  • OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.ODefinition: t is the number of remaining q-grams in a worst-case placement of k errors

    gapped shapes can have higher(!) thresholds t than ungapped shapesThe Threshold tgapped3-shape##.#k = 3t = 1classic3-shape###k = 3t = 0no filter! no simple formula for t we used a DP-based approach to compute tGapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes

  • Finding good shapes Gapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes

  • Finding good shapes Gapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes

    # of q-gram hits |S|1 |S|q?

  • Finding good shapesGapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes

    For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches.

  • We define the minimum coverage cm as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and SFinding good shapesCGACGATTGAT ##.# ##.# -----ACTCGATTAGA

    For t =2 andthe shape ##.#the minimum coverage is 5Gapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes

  • Finding good shapes # ofpotentialmatchesGapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes

    # of q-gram hits |S|1 |S|q

  • compute t and minimum coverage for all shapes with |P|=50 and k=3,4,5,6Gapped q-gramsA new (old ?) ideaHamming DistanceFinding good shapes

    Finding good shapes

  • Experimental Analysis

    Speed and Filtration EfficiencyThe Heuristic Zone

  • 6 7 8 9 10 11 12 qminimum coverage812162024gapped, Hammingcontiguous

    matches hits 222 220 218 216 214 212 2-8Experimental AnalysisA few different FiltersSpeed and Filtration EfficiencyThe Heuristic Zonek = 5|P| = 50|S| = 50Mbps

  • From Hits to MatchesDescribing Filter PropertiesFilters usually have 3 recognition zones` depending on k :Guarantee zone (finds all approximate matches)Heuristic zone (finds some of the approximate matches)Negative zone (guaranteed not to find matches)

  • From Hits to MatchesDescribing Filter PropertiesFilters usually have 3 recognition zones` depending on k :Guarantee zone (finds all approximate matches)Heuristic zone (finds some of the approximate matches)Negative zone (guaranteed not to find matches)Errors|P|0k0%100%Recognitionrate

  • From Hits to MatchesDescribing Filter PropertiesFilters usually have 3 recognition zones` depending on k :Guarantee zone (finds all approximate matches)Heuristic zone (finds some of the approximate matches)Negative zone (guaranteed not to find matches)Errors|P|0k0%100%Recognitionrate|P|-mc

  • From Hits to MatchesDescribing Filter PropertiesFilters usually have 3 recognition zones` depending on k :Guarantee zone (finds all approximate matches)Heuristic zone (finds some of the approximate matches)Negative zone (guaranteed not to find matches)Errors|P|0k0%100%Recognitionrate|P|-mc

  • Errors|P|0k|P|-mc0%100%RecognitionrateA few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental AnalysisProblem:Behaviour in the Heuristic Zone hard to predict

  • A few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental AnalysisA simple idea:Sampling!For a value i:1. Generate s sample strings with i random errors each2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent)

    This allows an experimental evaluation of the Heuristic Zone

  • |P| = 501000 samples for each error level

    A few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental Analysis0%100%Recognition rateErrors051015202530contiguous

  • A few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental Analysis0%100%Recognition rateErrors051015202530k=4, q=9k=3, q=11gapped, editcontiguous|P| = 501000 samples for each error level

  • A few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental Analysis0%100%Recognition rateErrors051015202530k=4, q=9k=3, q=11BLASTgapped, editcontiguous|P| = 501000 samples for each error level

  • A few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental Analysis0%100%Recognition rateErrors051015202530k=4, q=9k=3, q=11BLASTgapped, editcontiguous|P| = 501000 samples for each error level

  • A few different FiltersSpeed and Filtration EfficiencyThe Heuristic ZoneExperimental Analysis50%100%Recognition rateErrors051015k=4, q=9k=3, q=11BLASTgapped, editcontiguousk=3, q=11k=4, q=11k=5, q=10k=3,q=11k=4,q=10|P| = 501000 samples for each error level

  • Conclusion - Future WorkOur Work:Significant sensitivity improvement over existing filtersRequired modifications easy to implementMethods for describing filter propertiesFuture Work:Combination of `orthogonal` shapes into one filterUse of word neighborhoods Database of filter properties for good shapes

Search related