Click here to load reader

Boyer Moore Algorithm Idan Szpektor. Boyer and Moore

  • View
    265

  • Download
    8

Embed Size (px)

Text of Boyer Moore Algorithm Idan Szpektor. Boyer and Moore

  • Boyer Moore AlgorithmIdan Szpektor

  • Boyer and Moore

  • What Its AboutA String Matching Algorithm

    Preprocess a Pattern P (|P| = n)

    For a text T (| T| = m), find all of the occurrences of P in T

    Time complexity: O(n + m), but usually sub-linear

  • Right to Left (like in Hebrew)Matching the pattern from right to left

    For a pattern abc: T: bbacdcbaabcddcdaddaaabcbcbP: abc

    Worst case is still O(n m)

  • The Bad Character Rule (BCR)

    On a mismatch between the pattern and the text, we can shift the pattern by more than one place. Sublinearity!ddbbacdcbaabcddcdaddaaabcbcbacabc

  • BCR Preprocessing

    A table, for each position in the pattern and a character, the size of the shift. O(n ||) space. O(1) access time.a b a c b: 1 2 3 4 5

    A list of positions for each character. O(n + ||) space. O(n) access time, But in total O(m).

  • BCR - Summary

    On a mismatch, shift the pattern to the right until the first occurrence of the mismatched char in P.

    Still O(n m) worst case running time:

    T: aaaaaaaaaaaaaaaaaaaaaaaaaP: abaaaa

  • The Good Suffix Rule (GSR)

    We want to use the knowledge of the matched characters in the patterns suffix.

    If we matched S characters in T, what is (if exists) the smallest shift in P that will align a sub-string of P of the same S characters ?

  • GSR (Cont)Example 1 how much to move: T: bbacdcbaabcddcdaddaaabcbcbP: cabbabdbab cabbabdbab

  • GSR (Cont)Example 2 what if there is no alignment: T: bbacdcbaabcbbabdbabcaabcbcbP: bcbbabdbabc bcbbabdbabc

  • GSR - DetailedWe mark the matched sub-string in T with t and the mismatched char with x

    In case of a mismatch: shift right until the first occurrence of t in P such that the next char y in P holds yx

    Otherwise, shift right to the largest prefix of P that aligns with a suffix of t.

  • Boyer Moore AlgorithmPreprocess(P)k := nwhile (k m) doMatch P and T from right to left starting at k

    If a mismatch occurs: shift P right (advance k) by max(good suffix rule, bad char rule).

    else, print the occurrence and shift P right (advance k) by the good suffix rule.

  • Algorithm Correctness

    The bad character rule shift never misses a match

    The good suffix rule shift never misses a match

  • Preprocessing the GSR L(i)

    L(i) The biggest index j, such that j < n and prefix P[1..j] contains suffix P[i..n] as a suffix but not suffix P[i-1..n]

    1 2 3 4 5 6 7 8 9 10 11 12 13 P: b b a b b a a b b c a b b L: 0 0 0 0 0 0 0 0 0 5 9 0 12

  • Preprocessing the GSR l(i)

    l(i) The length of the longest suffix of P[i..n] that is also a prefix of P

    P: b b a b b a a b b c a b b l: 2 2 2 2 2 2 2 2 2 2 2 1

  • Using L(i) and l(i) in GSR

    If mismatch occurs at position n, shift P by 1

    If a mismatch occurs at position i-1 in P:If L(i) > 0, shift P by n L(i)else shift P by n l(i)

    If P was found, shift P by n l(2)

  • Building L(i) and l(i) the ZFor a string s, Z(i) is the length of the longest sub-string of s starting at i that matches a prefix of s.

    s: b b a c d c b b a a b b c d dZ: 1 0 0 0 0 3 1 0 0 2 1 0 0 0

    Naively, we can build Z in O(n^2)

  • From Z to N

    N(i) is the longest suffix of P[1..i] that is also a suffix of P.N(i) is Z(i), built over P reversed.

    s: d d c b b a a b b c d c a b bN: 0 0 0 1 2 0 0 1 3 0 0 0 0 1

  • Building L(i) in O(n)L(i) The biggest index j < n, such that prefix P[1..j] contains suffix P[i..n] as a suffix but not suffix P[i-1..n]

    L(i) The biggest index j < n such that: N(j) == | P[i..n] | == n i + 1

    for i := 1 to n, L(i) := 0for j := 1 to n-1i := n N(j) + 1L(i) := j

  • Building l(i) in O(n)l(i) The length of the longest suffix of P[i..n] that is also a prefix of P

    l(i) The biggest j

  • Building Z in O(n)

    For calculating Z(i), we want to use the previously calculated Z(1)Z(i-1)

    For each I we remember the right most Z(j): j, such that j < i and j + Z(j) >= k + Z(k), for all k < i

  • Building Z in O(n) (Cont) S i j i

    If i < j + Z(j), s[i j + Z(j) - 1] appeared previously, starting at i = i j + 1.Z(i) < Z(j) (i - j) ?

  • Building Z in O(n) (Cont)For Z(2) calculate explicitlyj := 2, i := 3While i = j + Z(j), calculate Z(i) explicitlyelseZ(i) := Z(i)If Z(i) >= Z(j) (i - j), calculate Z(i) tail explicitlyIf j + Z(j) < i + Z(i), j := i

  • Building Z in O(n) - Analysis

    The algorithm builds Z correctly

    The algorithm executes in O(n)A new character is matched only onceAll other operations are in O(1)

  • Boyer Moore Worst Case AnalysisAssume P consists of n copies of a single char and T consists of m copies of the same char:T: aaaaaaaaaaaaaaaaaaaaaaaaaP: aaaaaa

    Boyer Moore Algorithm runs in (m n) when finding all the matches

  • The Galil RuleIn a specific matching phase, We mark with k the position in T of the right end of P. We mark with s the position of last matched char in this phase. s k kT: bbacdcbaabcddcdaddaaabcbcbP: abaab abaab

  • The Galil Rule (Cont)All the chars in position s < j k are known to be matching. The algorithm doesnt need to check them.

    An extended Boyer Moore algorithm with the Galil rule runs in O(m + n) worst case (even without the bad-character rule).

  • Dont Sleep Yet

  • O(n + m) proof - OutlinePreprocess in O(n) already proved

    Properties of stringsProof of search in O(m) if P is not in T, using only the good suffix rule.Proof of search in O(m) even if P is in T, adding the Galil rule.

  • Properties of StringsIf for two strings , : = then there is a string such that = i and = j, i, j > 0 - Proof by induction Definition: A string s is semiperiodic with period if s consists of a non-empty suffix of (possibly the entire ) followed by one or more complete copies of .

  • Properties of Strings (Cont)

    A string is prefix semiperiodic if it contains one or more complete copies of followed by a non-empty prefix of .

    A string is prefix semiperiodic iff it is semiperiodic with the same length period

  • Lemma 1Suppose P occurs in T starting at position p and also at position q, q > p. If q p n/2 then P is semiperiodic with period = P[n-(q-p)+1n]pq

  • Proof - when P is Not Found in T

    We have R rounds during the search.

    After each round the good suffix rule decides on a right shift of si chars.

    si m

    We shall use si as an upper bound.

  • Proof (Cont)For each round we count the matched chars by:fi the number of chars matched for the first timegi the number of chars already matched in previous rounds.

    fi = mWe want to prove that gi 3si ( gi 3m).

  • Proof (Cont)Each round dont find P it matched a substring ti and one bad char xi in T (xiti T)

    T: bbacdcbaabcbbabdbabcaabcbcbP: bdbabc

    |ti|+1 3si gi 3si (because gi + fi = |ti|+1)For the rest of the proof we assume that for the specific round i: |ti| + 1 > 3si

  • Lemma 2 (|ti| + 1 > 3si)In round i we look at the matched suffix of P, marked P*. P* = yi ti, yi xi.

    Both P* and ti are semiperiodic with period of length si and hence with minimal length period , = k. Proof: by Lemma 1.

  • Lemma 3 (|ti| + 1 > 3si)Suppose P overlapped ti during round i. We shall examine in what ways could P overlap ti in previous rounds.

    In any round h < i, the right end of P could not have been aligned with the right end of any full copy of in ti. - proof:Both round h and i fail at char xitwo cases of possible shift after round h are invalid

  • Lemma 4 (|ti| + 1 > 3si)In round h < i, P can correctly match at most ||-1 chars in ti. By Lemma 3, P is not aligned with a right end of ti in phase h.Thus if it matched || chars or more there is a suffix of followed by a prefix of such that = .By the string properties there is a substring such that = k, k>1.This contradicts the minimal period size property of .

  • Lemma 5 (|ti| + 1 > 3si)

    If in round h < i the right end of P is aligned with a char in ti, it can only be aligned with one of the following:One of the left-most ||-1 chars of tiOne of the right-most || chars of ti-proof:If not, By Lemma 3,4, max ||-1 chars are matched and only from the middle of a copy, while there are at least ||A shift cannot pass the right end of that copy

  • Proof (Cont)

    If |ti| + 1 > 3si then gi 3siUsing Lemma 5, in previous rounds we could match only the bad char xi, the last ||-1 chars in ti or start from the first || right chars in ti. In the last case, using Lemma 4, we can only match up to ||-1 chars in total we could previously match: gi = 1 + ||-1 + (|| + ||-1) 3|| 3si

  • Proof - Final

    Number of matches = (fi + gi) = fi + gi m + 3si m + 3m = 4m

  • Proof - when P is Found in T

    Split the rounds to two groups: match rounds an occurrence of P in T was found.mismatch rounds P was not found in T.

    we have proved O(m) for mismatch rounds.

  • Proof (Cont)After P was found in T, P will be shifted by a co

Search related