17458 String Matching

Embed Size (px)

Citation preview

  • 8/14/2019 17458 String Matching

    1/35

    Outline

    String Matching

    Introduction

    Nave Algorithm

    Rabin-Karp Algorithm

    Knuth-Morris-Pratt (KMP) Algorithm

  • 8/14/2019 17458 String Matching

    2/35

    Introduction

    What isstring matching?

    Finding all occurrences of apattern in a given text (or

    body of text) Many applications

    While using editor/word processor/browser

    Login name & password checking

    Virus detection

    Header analysis in data communications

    DNA sequence analysis, Web search engines (e.g.

    Google), image analysis

  • 8/14/2019 17458 String Matching

    3/35

    String-Matching Problem

    The textis in an array T [1..n] of length n

    Thepatternis in an arrayP [1..m] of

    length m Elements of TandPare characters from a

    finite alphabet

    E.g., = {0,1} or = {a, b, , z} Usually TandPare calledstringsof

    characters

  • 8/14/2019 17458 String Matching

    4/35

    String-Matching Problem contd

    We say that patternPoccurs with shift sintext T if:

    a) 0 s n-m andb) T [(s+1)..(s+m)] =P [1..m]

    IfPoccurs with shiftsin T, thensis avalid shift, otherwisesis an invalid shift

    String-matching problem: finding all validshifts for a given TandP

  • 8/14/2019 17458 String Matching

    5/35

    Example 1

    a b c a b a a b c a b a c

    a b a a

    text T

    pattern P s= 3

    shift s = 3is a valid shift

    (n=13, m=4 and 0 s n-m holds)

    1 2 3 4 5 6 7 8 9 10 11 12 13

    1 2 3 4

  • 8/14/2019 17458 String Matching

    6/35

    Example 2

    a b c a b a a b c a b a a

    a b a a

    text T

    pattern P

    s= 3

    a b a a

    a b a a

    s= 9

    1 2 3 4 5 6 7 8 9 10 11 12 13

    1 2 3 4

  • 8/14/2019 17458 String Matching

    7/35

    Nave String-Matching Algorithm

    Input: Text strings T [1..n] andP[1..m]Result: All valid shifts displayed

    NAVE-STRING-MATCHER(T,P)n length[T]

    m length[P]

    fors 0 ton-mifP[1..m] = T [(s+1)..(s+m)]

    print pattern occurs with shifts

  • 8/14/2019 17458 String Matching

    8/35

    Nave Algorithm

    The Nave algorithm consists in checking, at all the

    positions in the text between 0 to n-m, whether an

    occurrence of the pattern starts there or not.

    After each attempt, it shifts the pattern by exactly oneposition to the right.

    Example (from left to right):

    a b c a b c a

    a b c a (shift = 0)

    a b c a (shift = 1)

    a b c a (shift = 2)

    a b c a (shift = 3)

  • 8/14/2019 17458 String Matching

    9/35

    Analysis: Worst-case Example

    a a a a a a a a a a a a atext T

    pattern P

    a a a b

    a a a b

    1 2 3 4 5 6 7 8 9 10 11 12 13

    1 2 3 4

    a a a b

  • 8/14/2019 17458 String Matching

    10/35

    Worst-case Analysis

    There are mcomparisons for each shift in theworst case

    There are n-m+1 shifts So, the worst-case running time is ((n-

    m+1)m)

    In the example on previous slide, we have (13-4+1)4

    comparisons in total Nave method is inefficient because information

    from a shift is not used again

  • 8/14/2019 17458 String Matching

    11/35

    Analysis Brute force pattern matching runs in time

    O(mn) in the worst case. But most searches of ordinary text take

    O(m+n), which is very quick.

    continued

  • 8/14/2019 17458 String Matching

    12/35

    Brute-force Analysis (Best)

    Best Case

    Example1: Found in first position of text

    Text: 0000000000000000001 Pattern: 000

    Cost = O(M)

    Example2: Pattern Not found and always a

    mismatch on first character Text: 0000000000000000001

    Pattern: 11

    Cost = O(N+M)

  • 8/14/2019 17458 String Matching

    13/35

    Nave Algorithm

    Example (from right to left):

    a b c a b c a

    a b c a (shift =3)

    a b c a (shift = 2)

    a b c a (shift = 1)

    a b c a (shift = 0)

    Pattern occur with shift 0 and 3

  • 8/14/2019 17458 String Matching

    14/35

    Rabin-Karp Algorithm

    Has a worst-case running time of O((n-m+1)m) but average-case is O(n+m)

    Also works well in practice Based on number-theoretic notion of

    modularequivalence

    We assume that = {0,1, 2, , 9}, i.e.,each character is a decimal digit

    In general, use radix-dwhere d= ||

  • 8/14/2019 17458 String Matching

    15/35

    Rabin-Karp Approach

    We can view a string of kcharacters (digits)

    as a length-kdecimal number

    E.g., the string 31425 corresponds to the

    decimal number 31,425

    Given a patternP [1..m], letpdenote the

    corresponding decimal value

    Given a text T [1..n], let tsdenote the decimal

    value of the length-msubstring T

    [(s+1)..(s+m)] fors=0,1,,(n-m)

  • 8/14/2019 17458 String Matching

    16/35

    Rabin-Karp Approach contd

    ts=piff T [(s+1)..(s+m)] =P [1..m]

    sis a valid shift iff ts=p

    pcan be computed in O(m) time p=P[m] + 10 (P[m-1] + 10 (P[m-2]+))

    t0can similarly be computed in O(m) time

    Other t1, t2,, tn-mcan be computed in O(n-m) time since ts+1can be computed from tsin

    constant time

  • 8/14/2019 17458 String Matching

    17/35

    Rabin-Karp Approach contd ts+1= 10(ts- 10

    m-1T [s+1]) + T[s+m+1]

    E.g., if T={,3,1,4,1,5,2,}, m=5 and ts=31,415, then ts+1= 10(31415100003) + 2

    =14152 Thus we can compute p in (m) and can

    compute t0, t1, t2,, tn-min (n-m+1) time

    And we can find al occurrences of the pattern

    P[1m] in text T[1n] with (m)preprocessing time and (n-m+1) matchingtime.

    Buta problem: this is assumingpand tsare small numbers

    They may be too large to work with easily

  • 8/14/2019 17458 String Matching

    18/35

    Rabin-Karp Approach contd

    Solution: we can use modular arithmetic with

    a suitable modulus, q

    E.g.,

    ts+1(10(tsT[s+1]h) + T[s+m+1]) (mod q)

    Where h =10 m-1(mod q)

    qis chosen as a smallprime number ; e.g., 13

    for radix 10

    Generally, if the radix is d, then dqshould fit

    within one computer word

  • 8/14/2019 17458 String Matching

    19/35

    How values modulo 13 are computed

    3 1 4 1 5 2

    7 8

    14152((314153 10000) 10 + 2)(mod13)

    ((73 3) 10 + 2 )(mod 13)

    8(mod 13)

    old high-order digit

    new low-order digit

  • 8/14/2019 17458 String Matching

    20/35

    Problem of Spurious Hits tsp (mod q) does not imply that ts=p

    Modular equivalence does not necessarily meanthat two integers are equal

    A case in which tsp (mod q) when ts p iscalled aspurious hit

    On the other hand, if two integers are notmodular equivalent, then they cannot beequal

  • 8/14/2019 17458 String Matching

    21/35

    Example

    2 3 1 4 1 5 2 6 7 3 9 9 2 1

    3 1 4 1 5

    1 2 3 4 5 6 7 8 9 10 11 12 13 14

    pattern

    text

    1 7 8 4 5 10 11 7 9 11

    7

    mod 13

    mod 13

    valid

    match

    spurious

    hit

  • 8/14/2019 17458 String Matching

    22/35

    Rabin-Karp Algorithm

    Basic structure like the nave algorithm,but uses modular arithmetic as described

    For each hit, i.e., for eachswhere tsp(mod q), verify character by characterwhethersis a valid shift or a spurious hit

    In the worst case, every shift is verified

    Running time can be shown as O((n-m+1)m)

    Average-case running time is O(n+m)

  • 8/14/2019 17458 String Matching

    23/35

    Example 2

    Let T = a b c b a b and P = a b c

    Take a = 97, b = 98, c= 99 (i.e. ASCII value of characters).

    = 256.

    Integer value of P,p = c + 256(b+256a)= [99 + 256(98+25697)] % 256

    =197

    In similar fashion, we can calculate hash value of m-lengthtext and compare to check valid / spurious hit (as in

    previous slides).Analysis

    In the worst case, every shift is verified

    Running time can be shown as O((n-m+1)m)

    Average-case running time is O (n + m)

  • 8/14/2019 17458 String Matching

    24/35

    3. The KMP Algorithm The Knuth-Morris-Pratt (KMP) algorithm

    looks for the pattern in the text in a left-to-

    rightorder (like the brute force algorithm). But it shifts the pattern more intelligently

    than the brute force algorithm.

    continued

  • 8/14/2019 17458 String Matching

    25/35

    If a mismatch occurs between the text and

    pattern P at P[j], what is the mostwe can

    shift the pattern to avoid wastefulcomparisons?

    Answer: the largest prefix of P[0 .. j-1] thatis a suffix of P[1 .. j-1]

  • 8/14/2019 17458 String Matching

    26/35

    ExampleT:P:

    jnew= 2j = 5

  • 8/14/2019 17458 String Matching

    27/35

    Why Find largest prefix (start) of:

    "a b a a b" ( P[0..j-1] )which is suffix (end) of:

    "b a a b" ( p[1 .. j-1] ) Answer: "a b"

    Set j = 2 // the new j value

    j == 5

  • 8/14/2019 17458 String Matching

    28/35

    KMP Failure Function KMP preprocesses the pattern to find

    matches of prefixes of the pattern with thepattern itself.

    j = mismatch position in P[]

    k = position before the mismatch (k = j-1).

    Thefailure functionF(k) is defined as the

    sizeof the largest prefix of P[0..k] that isalso a suffix of P[1..k].

  • 8/14/2019 17458 String Matching

    29/35

    P: "abaaba"j: 012345

    In code, F() is represented by an array, like

    the table.

    Failure Function Example

    F(k) is the size of

    the largest prefix.1

    3

    2

    4210j

    100F(j)

    (k == j-1)

  • 8/14/2019 17458 String Matching

    30/35

    Why is F(4) == 2? F(4) means

    find the size of the largest prefix of P[0..4] that

    is also a suffix of P[1..4]= find the size largest prefix of "abaab" that

    is also a suffix of "baab"= find the size of "ab"= 2

    P: "abaaba"

  • 8/14/2019 17458 String Matching

    31/35

    Knuth-Morris-Pratts algorithm modifies thebrute-force algorithm.

    if a mismatch occurs at P[j](i.e. P[j] != T[i]), then

    k = j-1;j = F(k); // obtain the new j

    Using the Failure Function

  • 8/14/2019 17458 String Matching

    32/35

    Example

    1

    a b a c a a b a c a b a c a b a a b b

    7

    8

    19181715

    a b a c a b

    1614

    13

    2 3 4 5 6

    9

    a b a c a b

    a b a c a b

    a b a c a b

    a b a c a b

    10 11 12

    c

    0

    3

    1

    4210k

    100F(k)

    T:P:

  • 8/14/2019 17458 String Matching

    33/35

    Why is F(4) == 1? F(4) means

    find the size of the largest prefix of P[0..4] that

    is also a suffix of P[1..4]= find the size largest prefix of "abaca" that

    is also a suffix of "baca"= find the size of "a"= 1

    P: "abacab"

  • 8/14/2019 17458 String Matching

    34/35

    KMP Advantages KMP runs in optimal time: O(m+n)

    very fast The algorithm never needs to move

    backwards in the input text, Tthis makes the algorithm good for processing

    very large files that are read in from externaldevices or through a network stream

  • 8/14/2019 17458 String Matching

    35/35

    KMP Disadvantages KMP doesnt work so well as the size of the

    alphabet increasesmore chance of a mismatch (more possible

    mismatches)mismatches tend to occur early in the pattern,

    but KMP is faster when the mismatches occurlater