of 23 /23
String matching algorithms

String matching algorithms - deepak garg · 2012-06-11 · String matching algorithms . Deliverables String Basics Naïve String matching Algorithm Boyer Moore Algorithm Rabin-Karp

  • Author
    others

  • View
    3

  • Download
    0

Embed Size (px)

Text of String matching algorithms - deepak garg · 2012-06-11 · String matching algorithms ....

  • String matching algorithms

  • Deliverables

    String Basics

    Naïve String matching Algorithm

    Boyer Moore Algorithm

    Rabin-Karp Algorithm

    Knuth-Morris- Pratt Algorithm

    Copyright @ gdeepak.Com® 2 6/11/2012 7:22 PM

  • String Basics

    A string is a sequence of characters

    Examples of strings: C++ program, HTML document, DNA sequence, Digitized image

    An alphabet S is the set of possible characters for a family of strings

    Example of alphabets: ASCII (used by C and C++), Unicode (used by Java), {0, 1}, {A, C, G, T}

    Copyright @ gdeepak.Com® 3 6/11/2012 7:22 PM

  • String Basics

    Let P be a string of size m

    A substring P[i .. j] of P is the subsequence of P consisting of the characters with ranks between i and j

    A prefix of P is a substring of the type P[0 .. i]

    A suffix of P is a substring of the type P[i ..m - 1]

    Given strings T (text) and P (pattern), pattern matching problem consists of finding a substring of T equal to P Applications: Text editors, Search engines, Biological research

    Copyright @ gdeepak.Com® 4 6/11/2012 7:22 PM

  • Brute Force String Matching

    Naive-String-Matcher(T, P)

    1. n ← length[T]

    2. m ← length[P]

    3. for s ← 0 to n - m

    4. do if P[1 .. m]=T[s+1..s+m]

    5. then print "Pattern occurs with shift" s

    Worst case O(m*n)

    T = aaa … ah

    P = aaah

    Copyright @ gdeepak.Com® 5

    Match Q with A, not matching, so shift the pattern by one and so on.

    6/11/2012 7:22 PM

  • Boyer-Moore Algorithm

    It uses two heuristics

    Looking-glass heuristic: Compare P with T moving backwards

    Character-jump heuristic: When a mismatch occurs at T[i] = c

    If P contains c, shift P to align the last occurrence of c in P with T[i]

    Else, shift P to align P[0] with T[i + 1]

    Copyright @ gdeepak.Com® 6 6/11/2012 7:22 PM

  • Boyer Moore Algorithm

    • Boyer-Moore’s runs in time O(nm + s)

    • Example of worst case: T = aaa …a P = baaa

    • Worst case may occur in images and DNA sequences but unlikely in English text so BM is better for English Text

    Copyright @ gdeepak.Com® 7

    1

    a p a t t e r n m a t c h i n g a l g o r i t h m

    r i t h m

    r i t h m

    r i t h m

    r i t h m

    r i t h m

    r i t h m

    r i t h m

    2

    3

    4

    5

    6

    7891011

    Character shift

    M 0

    H 1

    T 2

    I 3

    R 4

    6/11/2012 7:22 PM

  • Boyer Moore Algorithm Algorithm BoyerMooreMatch(T, P, S)

    L lastOccurenceFunction(P, S )

    i m - 1

    j m - 1

    repeat

    if T[i] = P[j]

    if j = 0

    return i { match at i }

    else

    i i - 1

    j j - 1

    else

    { character-jump }

    l L[T[i]]

    i i + m – min(j, 1 + l)

    j m - 1

    until i > n - 1

    return -1 { no match }

    Copyright @ gdeepak.Com® 8 6/11/2012 7:22 PM

  • Rabin-Karp Algorithm

    Copyright @ gdeepak.Com® 9

    • It speeds up testing of equality of the pattern to the substrings in the text by using a hash function. A hash function is a function which converts every string into a numeric value, called its hash value; e.g. hash("hello")=5. It exploits the fact that if two strings are equal, their hash values are also equal.

    • There are two problems: Different strings can also result in the same hash value and there is extra cost of calculating the hash for each group of strings.

    • Worst case is O(mn)

    6/11/2012 7:22 PM

  • Rabin Karp Example

    Copyright @ gdeepak.Com® 10

    Here we are matching 31415 in the given string. We can have various valid hits but there may be few valid matches.

    6/11/2012 7:22 PM

  • KMP Algorithm

    • It never re-compares a text symbol that has matched a pattern symbol. As a result, complexity of the searching phase of the KMP = O(n). Preprocessing phase has a complexity of O(m). Since m< n, the overall complexity of is O(n). A border of x is a substring that is both proper prefix and proper suffix of x. We call its length b the width of the border.

    • Let x=abacab Proper prefixes of x are ε, a, ab, aba, abac, abaca

    • The proper suffixes of x are ε, b, ab, cab, acab, bacab

    • The borders of x are ε, ab

    Copyright @ gdeepak.Com® 11 6/11/2012 7:22 PM

  • Border Concept

    if s is the widest border of x, the next-widest border r of x is obtained as the widest border of s

    Copyright @ gdeepak.Com® 12 6/11/2012 7:22 PM

  • Border Extension

    Let x be a string and a є A a symbol. A border r of x can be extended by a, if ra is a border of xa

    Copyright @ gdeepak.Com® 13 6/11/2012 7:22 PM

  • Border calculation

    In the pre-processing phase an array b of length m+1 is computed. Each entry b[i] contains the width of the widest border of the prefix of length i of the pattern (i = 0, ..., m). Since the prefix ε of length i = 0 has no border, we set b[0] = -1.

    Copyright @ gdeepak.Com® 14 6/11/2012 7:22 PM

  • KMP-Preprocess Algorithm void kmpPreprocess()

    {

    int i=0, j=-1;

    b[i]=j;

    while (i=0 && p[i]!=p[j])

    j=b[j];

    i++;

    j++;

    b[i]=j;

    }

    }

    • For pattern p = ababaa the widths of the borders in array b have the following values. For instance we have b[5] = 3, since the prefix ababa of length 5 has a border of width 3

    Copyright @ gdeepak.Com® 15 6/11/2012 7:22 PM

  • Preprocessing

    • pre-processing algorithm could be applied to the string pt instead of p. If borders up to a width of m are computed only, then a border of width m of some prefix x of pt corresponds to a match of the pattern in t (provided that the border is not self-overlapping)

    Copyright @ gdeepak.Com® 16 6/11/2012 7:22 PM

  • KMP Search Algorithm void kmpSearch()

    {

    int i=0, j=0;

    while (i=0 && t[i]!=p[j])

    j=b[j];

    i++;

    j++;

    if (j==m)

    {

    report(i-j);

    j=b[j];

    } } }

    Copyright @ gdeepak.Com® 17 6/11/2012 7:22 PM

  • KMP Search Algorithm

    • When in inner while loop a mismatch at position j occurs, the widest border of the matching prefix of length j of the pattern is considered. Resuming comparisons at position b[j], the width of the border, yields a shift of the pattern such that the border matches. If again a mismatch occurs, the next-widest border is considered, and so on, until there is no border left or the next symbol matches. Then we have a new matching prefix of the pattern and continue with the outer while loop.

    • If all m symbols of the pattern have matched the corresponding text window (j = m), a function report is called for reporting the match at position i-j. Afterwards, the pattern is shifted as far as its widest border allows

    Copyright @ gdeepak.Com® 18 6/11/2012 7:22 PM

  • 6/11/2012 7:22 PM Copyright @ gdeepak.Com® 19

    a b a c a a b a c c a b a c a b a a b b

    a b a c

    a b a c

    a b a c

    a b a c

    a b a c

    a b

    a b

    a b

    a b

    a b

    1 2 3 4 5 6

    7

    8 9 10 11 12

    13

    14 15 16 17 18

    b a c a b

    0 1 0 1 2

    1 2 3 4 5 x 0

    P[x] a

    f(x) 0 19

  • Questions, comments and Suggestions

    Copyright @ gdeepak.Com® 20 6/11/2012 7:22 PM

  • Question 1

    How many nonempty prefixes of the string p=“aaabbaaa” are also suffixes of P?

    Copyright @ gdeepak.Com® 21 6/11/2012 7:22 PM

  • Question 2

    What is the longest proper prefix of the string cgtacgttcgtacg that is also the suffix of this string.

    Copyright @ gdeepak.Com® 22 6/11/2012 7:22 PM

  • Question 3

    What is the complexity of the KMP Algorithm, if we have a main string of length s and we wish to find the pattern of length p.

    A) O( s+p)

    B) O(p)

    C) O(sp)

    D) O(s)

    Copyright @ gdeepak.Com® 23 6/11/2012 7:22 PM