15 String Matching

Embed Size (px)

Citation preview

  • 7/28/2019 15 String Matching

    1/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 1

    Chapter 32: String Matching

    Given: Two strings T[1..n] and P[1..m] over alphabet .

    Want to find all occurrences of P[1..m] the pattern in T[1..n] the text.

    Example: = {a, b, c}a b c a b a a b c a a b a c

    a b a a

    text T

    pattern P s=3

    Terminology:

    - P occurs with shift s.

    - P occurs beginning at position s+1.

    - s is a valid shift.

    Goal: Find all valid shifts.

    Applications: Text editors, search for patterns in DNA sequences

    (actually, this is stretching the truth a little),

  • 7/28/2019 15 String Matching

    2/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 2

    Notation and Terminology

    w pre x -- w is a prefix of x.

    Example: aba pre abaabc.

    w suf x -- w is a suffix of x.

    Example: abc suf abaabc.

    Note: In the book the symbol is used instead of pre,

    and the symbol is used instead of suf.

    I couldnt figure out an easy way to reproduce these

    symbols in Powerpoint.

  • 7/28/2019 15 String Matching

    3/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 3

    Lemma 32.1

    Lemma 32.1: Suppose x suf z and y suf z. If |x| |y| thenx suf y. If |x| |y| then y suf x. If |x| = |y| then x = y.

    x

    z

    y

    x

    y

    x

    z

    y

    x

    y

    x

    z

    y

    x

    y

  • 7/28/2019 15 String Matching

    4/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 4

    More Notation

    Pk= P[1..k] where k m.

    Thus, P0 = , Pm = P[1..m] = P.

    Similarly Tk= T[1..k], where k n.

    Our Problem: Find all s, where 0 s nm such that P suf Ts+m.

    Assumption:We assume the test x = y takes (t + 1) time,

    where t is the length of the longest string z such that z pre xand z pre y.

  • 7/28/2019 15 String Matching

    5/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 5

    Nave Brute-Force Algorithm

    Nave(T, P)n := length[T];

    m := length[P];

    for s := 0 to nm do

    ifP[1..m] = T[s+1..s+m] thenprintpattern occurs with shift s

    fi

    od

    Running time is ((nm + 1)m).

    Bound is tight. Consider: T = an, P = am.

  • 7/28/2019 15 String Matching

    6/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 6

    Example

    a c a a b c

    a a bs = 0

  • 7/28/2019 15 String Matching

    7/45Jim Anderson Comp 750, Fall 2009 String Matching - 7

    Example

    a c a a b c

    a a bs = 0

  • 7/28/2019 15 String Matching

    8/45Jim Anderson Comp 750, Fall 2009 String Matching - 8

    Example

    a c a a b c

    a a bs = 0

  • 7/28/2019 15 String Matching

    9/45Jim Anderson Comp 750, Fall 2009 String Matching - 9

    Example

    a c a a b c

    a a bs = 1

  • 7/28/2019 15 String Matching

    10/45Jim Anderson Comp 750, Fall 2009 String Matching - 10

    Example

    a c a a b c

    a a bs = 1

  • 7/28/2019 15 String Matching

    11/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 11

    Example

    a c a a b c

    a a bs = 2

  • 7/28/2019 15 String Matching

    12/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 12

    Example

    a c a a b c

    a a bs = 2

  • 7/28/2019 15 String Matching

    13/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 13

    Example

    a c a a b c

    a a bs = 2

  • 7/28/2019 15 String Matching

    14/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 14

    Example

    a c a a b c

    a a bs = 2

    match!

  • 7/28/2019 15 String Matching

    15/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 15

    Example

    a c a a b c

    a a bs = 3

  • 7/28/2019 15 String Matching

    16/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 16

    Example

    a c a a b c

    a a bs = 3

  • 7/28/2019 15 String Matching

    17/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 17

    Example

    a c a a b c

    a a bs = 3

  • 7/28/2019 15 String Matching

    18/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 18

    Rabin-Karp Algorithm

    Suppose = {0, 1, 2, , 9}.

    Let us view P as a decimal number.

    Example: View P = 31415 as 31,415.

    Can also view substrings of T as decimal numbers.

    Let ts = the decimal number corresponding to T[s+1..s+m].

    Letp = the decimal number corresponding to P.

    We want to know all s such that ts = p.

    We can compute p in O(m)timeusing Horners Rule:

    p = P[m] + 10(P[m1] + 10(P[m2] + + 10(P[2] + 10 P[1]) ))

    Can similarly compute t0 in O(m) time.

  • 7/28/2019 15 String Matching

    19/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 19

    RK Algorithm (Continued)

    Can compute t1, t2, , tn-m in O(nm) time as follows:ts+1 = 10(ts10

    m-1T[s+1]) + T[s+m+1].

    Example:T = 314152

    ts+1 = 10(31415100003) + 2= 14152

    Time Complexity:

    O(n+m) + O(nm) = O(n+m)to compute

    p and t0,,tn-m

    to perform

    nm+1 comparisons

  • 7/28/2019 15 String Matching

    20/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 20

    Two Problems

    Might have || = d 10. Solution: Use radix-d arithmetic.

    Numbers may be very large.

    Solution: Perform computations modulo-q for some q.

  • 7/28/2019 15 String Matching

    21/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 21

    What is q?

    Select q to be a large prime such that dq fits in one memory word.

    all computations can be performed using single-precisionarithmetic.

    To summarize, p is computed using

    p = (P[m] + d(P[m1] + d(P[m2] + + d(P[2] + d P[1]) ))) mod q

    t0 is computed similarly.

    Other tis are computed usingts+1 = (d(tsT[s+1]h) + T[s+m+1]) mod q, where h dm-1 (mod q)

    Unfortunately, we have a new problem: spurious hits.

  • 7/28/2019 15 String Matching

    22/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 22

    Example

    pattern P

    2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1

    3 1 4 1 5

    7

    mod 13

    text T

    7mod 13

    7

    mod 13

    valid

    match

    spurious

    hit

    Al i h

  • 7/28/2019 15 String Matching

    23/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 23

    Algorithm

    We deal with spurious

    hits by performing an

    explicit checkwheneverthere is a potential match.

    RK(T, P, d, q)

    n := length[T];

    m := length[P];h := dm-1 mod q;

    p := 0;

    t0 := 0;

    for i := 1 to m do

    p := (dp + P[i]) mod q;

    t0 := (dt0 + P[i]) mod qod;

    for s := 0 to nm do

    ifp = tsthen

    ifP[1..m] = T[s+1..s+m] then

    printpattern occurs with shift s

    fifi;

    ifs < n-m then

    ts+1 := (d(tsT[s+1]h) + T[s+m+1]) mod q

    fi

    od

  • 7/28/2019 15 String Matching

    24/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 24

    Running TimeWorst-Case:((nm + 1)m). (Again, consider P = am, T = an.)

    Average-Case:

    Some assumptions

    Assume O(1) valid shifts.

    Think of 0, 1, , q 1 like hash buckets.

    Assume each bucket is equally likely.

    We expect O(n/q) spurious hits.

    Expected running time is:

    O(n) + O(m(number of valid shifts + n/q))= O(n+m) choosing q m

  • 7/28/2019 15 String Matching

    25/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 25

    Finite Automata Algorithm

    0 1 2 3 4 5 6 7a b a b a c a

    a

    a

    a

    a

    b

    b

    a b c P0 1 0 0 a

    1 1 2 0 b

    2 3 0 0 a

    3 1 4 0 b

    4 5 0 0 a

    5 1 4 6 c

    6 7 0 0 a

    7 1 2 0

    state

    input

    i -- 1 2 3 4 5 6 7 8 9 10 11

    T[i] -- a b a b a b a c a b a

    state (i) 0 1 2 3 4 5 4 5 6 7 2 3

    Processing time takes (n).But have to first construct FA.

    Main Issue: How to construct FA?

  • 7/28/2019 15 String Matching

    26/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 26

    Need some Notation

    (w) = state FA ends up in after processing w.

    Example:(abab) = 4.

    (x) = max{k: Pksuf x}. Called the suffix function.

    Examples: Let P = ab.

    () = 0(ccaca) = 1(ccab) = 2

    Note: If |P| = m, then (x) = m indicates a match.T: a b a b b a b b a c

    States:0 1..m.m.

    Note Also: x suf y (x) (y).match match

  • 7/28/2019 15 String Matching

    27/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 27

    FA ConstructionGiven: P[1..m]

    Let Q = states = {0, 1, , m}.

    Define transition function as follows:

    (q, a) = (Pqa) for each q and a.

    Example:(5, b) = (P5b)=(ababab)= 4

    Intuition:Encountering a b in state 5 means the current substring

    doesnt match. But, you know this substring ends with abab -- and

    this is the longest suffix that matches the beginning of P. Thus, we

    go to state 4 and continue processing abab .

    initial final

  • 7/28/2019 15 String Matching

    28/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 28

    Time Complexity

    FA takes O(m||) time to construct.

    (Book only gives a O(m3||) algorithm.)

    Total time is O(n + m||).

  • 7/28/2019 15 String Matching

    29/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 29

    Correctness

    Lemma 32.2:(xa) (x) + 1.

    Proof:

    Let r = (xa).

    Case: r = 0. Clearly (xa) (x) + 1.

    Case: r > 0.

    a

    Pr

    x

    Pr-1

    We have:

    Prsuf xa. Pr-1 suf x r1 (x).

  • 7/28/2019 15 String Matching

    30/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 30

    Another Lemma

    Lemma 32.3: q = (x) (xa) = (Pqa) .

    Proof:

    Let q = (x). Pq suf x P

    q

    a suf xa

    Let r = (xa). By Lemma 32.2, r q + 1. We have:

    a

    Pq

    x

    a

    Pr

    (Pqa) = r.

  • 7/28/2019 15 String Matching

    31/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 31

    Main Theorem

    Theorem 32.4:(Ti) = (Ti) for all i = 0, 1, , n.

    Implies: in accepting state

    if and only if

    string processed so far has a match at position (length of string) m

    Proof:

    Induction on i.

    Basis: i = 0. T0 = .(T0) = (T0) = 0.

    Step: Assume (Ti) = (Ti).

    Let q = (Ti), a = T[i+1].

    Then, (Ti) = (Ti) = q, which byLemma 32.3, implies

    (Tia) = (Pqa). (**)

  • 7/28/2019 15 String Matching

    32/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 32

    Proof Continued

    (Ti+1) = (Tia) , Ti+1 = Tia= ((Ti), a) , (wa) = ((w), a)= (q, a) , q = (Ti)

    = (Pqa) , (q, a) = (Pqa)= (Tia) , by (**)= (Ti+1) , Ti+1 = Tia

  • 7/28/2019 15 String Matching

    33/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 33

    Knuth-Morris-Pratt Algorithm

    Achieves (n + m) by avoiding precomputationof.

    Instead, we precompute [1..m] in O(m) time.

    As T is scanned, [1..m] is used to deduceinformation given by in FA algorithm.

  • 7/28/2019 15 String Matching

    34/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 34

    Motivating Example

    b a c b a b

    a b as

    a b a a b c b a b

    b a c a

    T

    P

    q

    Shift s is discovered to be invalid because of mismatch

    of 6th character of P.

    By definition of P, we also know s + 1 is an invalid shift

    However, s + 2 may be a valid shift.

    i i l

  • 7/28/2019 15 String Matching

    35/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 35

    Motivating Example

    b a c b a b

    a b as + 2

    a b a a b c b a b

    b a c a

    T

    P

    k

    The shift s + 2. Note that the first 3 characters of T starting

    at s + 2 dont have to be checked again -- we already know

    what they are.

    i i l

  • 7/28/2019 15 String Matching

    36/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 36

    Motivating Example

    a b

    a b a

    a b a

    Pk

    The longest prefix of P that is also a proper suffix of P5 is P3.

    We will define [5] = 3.

    Pq

    In general, if q characters have matched successfully at shifts, the next potentially valid shift is s = s + (q[q]).

    Th P fi F i

  • 7/28/2019 15 String Matching

    37/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 37

    The Prefix Function

    is called the prefix function for P.

    : {1, 2, , m} {0, 1, , m1}

    [q] = length of the longest prefixof P that is a proper suffixof Pq, i.e.,

    [q] = max{k: k < q and Pksuf Pq}.

    Compute-(P)1 m := length[P];

    2 [1] := 0;3 k := 0;

    4 for q := 2 to m do

    5 while k > 0 and P[k+1] P[q] do6 k := [k]

    od;

    7 ifP[k+1] = P[q] then

    8 k := k + 1

    fi;

    9 [q] := kod;

    10 return

    E l

  • 7/28/2019 15 String Matching

    38/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 38

    Example

    i 1 2 3 4 5 6 7

    P[i] a b a b a c a

    [i] 0 0 1 2 3 0 1

    Same as our

    FA example

    P7 = a b a b a c aa = P1

    P6 = a b a b a c

    = P0

    P5 = a b a b a

    a b a = P3

    P4 = a b a ba b = P2

    P3 = a b a

    a = P1

    P2 = a b

    = P0

    P1 = a = P0

    A h E l i

  • 7/28/2019 15 String Matching

    39/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 39

    Another Explanation

    0 1 2 3 4 5 6 7a b a b a c a

    Essentially KMP is computing a FA with epsilon moves. The spine

    of the FA is implicit and doesnt have to be computed -- its just thepattern P. gives the transitions. There are O(m) such transitions.

    Recall from Comp 455 that a FA with epsilon moves is

    conceptually able to be in several states at the same time (in

    parallel). Thats whats happening here -- were exploring

    pieces of the pattern in parallel.

    A h E l

  • 7/28/2019 15 String Matching

    40/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 40

    Another Example

    i 1 2 3 4 5 6 7 8 9 10

    P[i] a b b a b a a b b a[i] 0 0 0 1 2 1 1 2 3 4

    P7 = a b b a b a aa = P1

    P6 = a b b a b a

    a = P1

    P5 = a b b a b

    a b = P2

    P4 = a b b aa = P1

    P3 = a b b

    = P0

    P2 = a b

    = P0

    P1 = a = P0

    P10 = a b b a b a a b b a a b b a = P4

    P9 = a b b a b a a b b

    a b b = P3

    P8 = a b b a b a a b

    a b = P2

    Ti C l it

  • 7/28/2019 15 String Matching

    41/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 41

    Time Complexity

    Amortized Analysis --

    0 loop q = 2 (1st iteration)1 loop q = 3 (2nd iteration)2 loop q = 4 (3rd iteration)

    loop q = m ((m1)st iteration)m-1

    = potential function = value of k

    Amortized cost:i = ci + ii-1

    iteration actual loop cost

    Ti C l it (C ti d)

  • 7/28/2019 15 String Matching

    42/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 42

    Time Complexity (Continued)

    Total amortized cost:

    1m

    1i01-mi

    1m

    1i

    1iii

    1m

    1i

    i

    c

    )(c

    c

    Ifm-1 0, then amortized cost upper bounds real cost.

    We have 0

    = 0 (initial value of k)

    m-1 0 (final value of k).

    We show i = O(1).

    Ti C l it (C ti d)

  • 7/28/2019 15 String Matching

    43/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 43

    Time Complexity (Continued)

    The value ofi obviously depends on how many times statement

    6 is executed.

    Note that k > [k]. Thus, each execution of statement 6 decreasesk by at least 1.

    So, suppose that statements 5..6 iterate several times, decreasingthe value of k.

    We have: number of iterations koldknew. Thus,

    i O(1) + 2(k

    oldk

    new) +

    i

    i-1

    Hence, i = O(1). Total cost is therefore O(m).

    for statements

    other than 5 & 6= knew = kold

    R t f th Al ith

  • 7/28/2019 15 String Matching

    44/45

    Jim Anderson Comp 750, Fall 2009 String Matching - 44

    Rest of the Algorithm

    KMP(T, P)

    n := length[T];

    m := length[P];

    := Compute-(P);q := 0;

    for i := 1 to n do

    while q > 0 and P[q+1] T[i] doq := [q]

    od;

    ifP[q+1] = T[i] then

    q := q + 1

    fi;

    ifq = m thenprintpattern occurs with shift i m;

    q := [q]fi

    od

    Time complexity

    of loop is O(n)

    (similar to the

    analysis of

    Compute-).

    Total time is

    O(m + n).

    E l

  • 7/28/2019 15 String Matching

    45/45

    Examplei 1 2 3 4 5

    P[i] a b a b c[i] 0 0 1 2 0

    P = a b a b c

    1 2 3 4 5 6 7 8 9 10

    T = a b b a b a b a b c

    Start of 1st loop: q = 0, i = 1 [a]

    2nd loop: q = 1, i = 2 [b]

    3rd loop: q = 2, i = 3 [b]

    4th loop: q = 0, i = 4 [a]

    5th loop: q = 1, i = 5 [b]

    6th loop: q = 2, i = 6 [a]7th loop: q = 3, i = 7 [b]

    mismatch

    detected

    8th loop: q = 4, i = 8 [a]

    9th loop: q = 3, i = 9 [b]

    10th loop: q = 4, i = 10 [c]

    Termination: q = 5

    mismatch

    detected

    match

    detected

    Please see the book for formal correctness proofs.

    (Theyre very tedious.)