Approximate Matching (String Algorithms 2007)

  • Upload
    mailund

  • View
    493

  • Download
    4

Embed Size (px)

DESCRIPTION

Approximate pattern matching and a "SHIFT-and-OR" algorithm for k-mismatch pattern matching

Citation preview

  • 1. Approximate pattern matching Given stringx = abbacbbbababacabbbbaandpatternp = bbbafind all almost-occurrences ofpindx x = a bba c bbba babacab b a bba 17 6 1

2. String distance

  • A number of string-distances have been suggested, e.g.:
    • Hammingdistance: d( x , y )=number of characters that differs betweenxandy
      • d( abca,abaa ) = 1, d( abca,abab ) = 2
    • Levenshteindistance: d( x , y )=number of deletions and insertions needed to transformxintoy :
      • d( abca,abaa ) = 2, d( abca,aba ) = 1
    • Editdistance: d( x , y )=number of insertions, deletions, or substitutions needed to transformxintoy
      • d( abca,abaa ) = 1, d( abca,cca ) = 2

3. k-Approximate matching Given stringxand patternpfind all indices inxwhere: i i+h d( x [ i..i+h ], p ) k d( ,) k Generic problem for the various distance functions d 4. Generic Algorithm fori=1..| x |: ifd( x [i..i'], p ) > b) +t [ x [i]] where b=log(| p|+1), tobe s= ( s>> b) +t [ x [i]] o= ( o>> b) | ( s& OF_MASK) s=s& NEG_OF_MASK where b=log( k +1)+1 and OF_MASK = (10 b-1 ) | p | NEG_OF_MASK = ~OF_MASK = (01 b ) | p | 35. Algorithm Preprocessing: b = log(k+1)+1 OF_MASK =(10 b-1 ) | p |; NEG_OF_MASK = ~OF_MASK forhin andj=1..| p |:t [h,j] = 0 b-1 1 forj=1..| p |:t [ p [j],j] = 0 b s= 0 b(| p |+1);o= 0 b(| p |+1) Main: fori=1..| x |: s= ( s>> b) +t [ x [i]] o= ( o>> b) | ( s& OF_MASK) s=s& NEG_OF_MASK if s [| p |] +o [| p |] =| p |: report i-| p |+1 as match 36. Example x = babacacbababacabbbba i=0 p = bbba p = bbba p = bbba p = bbba s 0 == 000 000 000 000 000 o 0 == 000 000 000 000 000 37. Example x = babacacbababacabbbba i=1 p = b bba p = bbba p = bbba p = bbba ( s 0>> b) == 000 000 000 000 000 t ['b'] ==000 000 000 001 s 1' == 000 000 000 000 001 ( o 0>> b) == 000 000 000 000 000 ( s 1'& OF_MASK) == 000 000 000 000 000 o 1 == 000 000 000 000 000 s 1 == 000 000 000 000 001 38. Example x = babacacbababacabbbba i=2 p = b bba p = b b ba p = bbba p = bbba ( s 1>> b) == 000 000 000 000 000 t ['a'] ==001 001 001 000 s 2' == 000 001 001 001 000 ( o 1>> b) == 000 000 000 000 000 ( s 2'& OF_MASK) == 000 000 000 000 000 o 2 == 000 000 000 000 000 s 2 == 000 001 001 001 000 39. Example x = babacacbababacabbbba i=3 p = b bba p = b b ba p = b b b a p = bbba ( s 2>> b) == 000 000 001 001 001 t ['b'] ==000 000 000 001 s 3' == 000 000 001 001 010 ( o 2>> b) == 000 000 000 000 000 ( s 3'& OF_MASK) == 000 000 000 000 000 o 3 == 000 000 000 000 000 s 3 == 000 000 001 001 010 40. Example x = babacacbababacabbbba i=4 p = b bba p = b b ba p = b b b a p = b b ba ( s 3>> b) == 000 000 000 001 001 t ['a'] ==001 001 001 000 s 4' == 000 001 001 010 001 ( o 3>> b) == 000 000 000 000 000 ( s 4'& OF_MASK) == 000 000 000 000 000 o 4 == 000 000 000 000000 s 4 == 000 001 001 010001 MATCH! 41. Example x = babacacbababacabbbba i=5 p = b bba p = bb ba p = b bb a p = b b ba ( s 4>> b) == 000 000 001 001 010 t ['c'] ==001 001 001 001 s 5' == 000 001 010 010 011 ( o 4>> b) == 000 000 000 000 000 ( s 5'& OF_MASK) == 000 000 000 000 000 o 5 == 000 000 000 000 000 s 5 == 000 001 010 010 011 42. Example x = babacacbababacabbbba i=6 p = b bba p = bb ba p = bbb a p = b bb a ( s 5>> b) == 000 000 001 010 010 t ['a'] ==001 001 001 000 s 6' == 000 001 010 011 010 ( o 5>> b) == 000 000 000 000 000 ( s 6'& OF_MASK) == 000 000 000 000 000 o 6 == 000 000 000 000000 s 6 == 000 001 010 011010 MATCH! 43. Example x = babacacbababacabbbba i=7 p = b bba p = bb ba p = bbb a p = bbba ( s 6>> b) == 000 000 001 010 011 t ['c'] ==001 001 001 001 s 7' == 000 001 010 011 100 ( o 6>> b) == 000 000 000 000 000 ( s 7'& OF_MASK) == 000 000 000 000 100 o 7 == 000 000 000 000 100 s 7 == 000 001 010 011 000 44. Example x = babacacbababacabbbba i=8 p = b bba p = b b ba p = bb b a p = bbba ( s 7>> b) == 000 000 001 010 011 t ['b'] ==000 000 000 001 s 8' == 000 000 001 010 100 ( o 7>> b) == 000 000 000 000 000 ( s 8'& OF_MASK) == 000 000 000 000 100 o 8 == 000 000 000 000 100 s 8 == 000 000 001 010 000 45. Time complexity

  • Preprocessing takes time O(| || p |log(k) / w +| p |)
  • Main search takes time O(| x || p |log(k) /w )