(les09)-stringsearch

  • Upload
    hanvwb

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

  • 7/25/2019 (les09)-stringsearch

    1/32

    Algoritmen & Datastructuren

    2012 2013Substring Search

    (Slides by Sedgewick)

    Philip Dutr & Benedict Brown

    Dept. of Computer Science, K.U.Leuven

  • 7/25/2019 (les09)-stringsearch

    2/32

    Copyright Ph.Dutr, Spring 20132

    Goal: Find pattern of length M in a text of length N

    Usually M

  • 7/25/2019 (les09)-stringsearch

    3/32

    Applications

    Copyright Ph.Dutr, Spring 20133

    Parsers

    Spam filters

    Digital libraries

    Word processors

    Web search engines

    Natural language processing

    Computational molecular biology

    Feature detection in digitized images

  • 7/25/2019 (les09)-stringsearch

    4/32

    Application: spam filtering

    Copyright Ph.Dutr, Spring 20134

    Identify patterns indicative

    of spam.

    PROFITS L0SE WE1GHT

    herbal V1AGRA

    There is no catch L0W M0RTGAGE RATES

    This is a one-time mailing.

    This message is sent in

    compliance with spam

    regulations

  • 7/25/2019 (les09)-stringsearch

    5/32

    Application: electronic surveillance

    Copyright Ph.Dutr, Spring 20135

    Look for certain phrases in email traffic

    Detonate Bomb

    Attack Belgians Ik zal een taart in het gezicht van Leonard werpen

    Huge amount of data, incoming streams, no backup

  • 7/25/2019 (les09)-stringsearch

    6/32

    Application: page scraping

    Copyright Ph.Dutr, Spring 20136

    Goal. Extract relevant data from web page.

    E.g. Find string delimited by and after first

    occurrence of pattern Last Trade

  • 7/25/2019 (les09)-stringsearch

    7/32

    Brute force

    Copyright Ph.Dutr, Spring 20137

    Check all possible starting positions,

    check entire length of pattern

  • 7/25/2019 (les09)-stringsearch

    8/32

    Brute force

    Copyright Ph.Dutr, Spring 20138

  • 7/25/2019 (les09)-stringsearch

    9/32

    Brute force: worst case

    Copyright Ph.Dutr, Spring 20139

    Brute-force algorithm can be slow if text and patternare repetitive

    Worst case: ~ M.N character comparisons M.(N-M+1) = M.N M.M + M ~ M.N if M

  • 7/25/2019 (les09)-stringsearch

    10/32

    Challenges

    Copyright Ph.Dutr, Spring 201310

    In many applications, backup in data to be avoided

    Solution 1: maintain buffer (memory!)

    Solution 2: avoid backup

    Challenge: we want linear time complexity!

  • 7/25/2019 (les09)-stringsearch

    11/32

    Brute force: alternate implementation

    Copyright Ph.Dutr, Spring 201311

    Same sequence of character compares

    i points to end of sequence of already-matched chars in text

    j stores number of already-matched chars(end of sequence in pattern)

  • 7/25/2019 (les09)-stringsearch

    12/32

    Knuth-Morris-Pratt (KMP)

    Copyright Ph.Dutr, Spring 201312

    Intuition: Suppose we are searching in text for pattern

    BAAAAAAAAA (alphabet = {A,B})

    Suppose we match 5 chars, with mismatch on 6th char We know previous 6 characters in text are BAAAAB

    Don't need to back up text pointer!

  • 7/25/2019 (les09)-stringsearch

    13/32

    Deterministic finite state automaton (DFA)

    Copyright Ph.Dutr, Spring 201313

    DFA is abstract string-searching machine.

    Finite number of states (including start and halt)

    Exactly one transition for each char in alphabet Accept if sequence of transitions leads to halt state

  • 7/25/2019 (les09)-stringsearch

    14/32

    Interpretation of Knuth-Morris-Pratt DFA

    Copyright Ph.Dutr, Spring 201314

    Interpretation of DFA state after reading in txt[i]?

    State-number = number of characters in pattern that have

    been matched. length of longest prefix of pat[] that is at end of txt[0..i]

    E.g. DFA is in state 3 after reading in character txt[6]

  • 7/25/2019 (les09)-stringsearch

    15/32

    KMP trace

    Copyright Ph.Dutr, Spring 201315

  • 7/25/2019 (les09)-stringsearch

    16/32

    KMP Implementation

    Copyright Ph.Dutr, Spring 201316

    Differences from brute-force implementation:

    Text pointer i never decrements (no backup needed)

    Need to precompute dfa[][] from pattern (some work)

    Running time Simulate DFA on text: at most N character accesses

    Build DFA: how to do efficiently?

  • 7/25/2019 (les09)-stringsearch

    17/32

    How to build DFA from pattern?

    Copyright Ph.Dutr, Spring 201317

    Match transition

    If in state j and next char c == pat.charAt(j),then go to state j+1.

    first j characters of pattern

    have already been matched next char matches

    now first j+1 characters of

    pattern have been matched

  • 7/25/2019 (les09)-stringsearch

    18/32

    How to build DFA from pattern?

    Copyright Ph.Dutr, Spring 201318

    Mismatch transition

    If in state j and next char c != pat.charAt(j), then the last j characters

    of input are pat[1..j-1], followed by c

    To compute dfa[c][j]:

    Simulate pat[1..j-1] on DFA (so far ) and take transition c

    Running time: Seems to require j steps

    Simulation does not start from scratch: remember state X

  • 7/25/2019 (les09)-stringsearch

    19/32

    Copyright Ph.Dutr, Spring 201319

  • 7/25/2019 (les09)-stringsearch

    20/32

    Copyright Ph.Dutr, Spring 201320

  • 7/25/2019 (les09)-stringsearch

    21/32

    KMP Analysis

    Copyright Ph.Dutr, Spring 201321

    KMP accesses no more than M + N chars

    M for constructing DFA

    N for actual search

    KMP constructs dfa[][] in time and space

    proportional to R.M R = size of alphabet

  • 7/25/2019 (les09)-stringsearch

    22/32

    The Art of Computer Programming

    Copyright Ph.Dutr, Spring 201322

    "If you think you're a really good programmer . . . read

    (Knuth's) Art of Computer Programming . . . You shoulddefinitely send me a rsum if you can read the whole

    thing. (Bill Gates)

    Donald

    Knuth

  • 7/25/2019 (les09)-stringsearch

    23/32

    Boyer-Moore

    Copyright Ph.Dutr, Spring 201323

    Intuition:

    Scan characters in pattern from right to left

    Can skip M text chars when finding one not in the pattern

  • 7/25/2019 (les09)-stringsearch

    24/32

    Boyer-Moore

    Copyright Ph.Dutr, Spring 201324

    How much to skip?

    right[c] = rightmost occurrence of character c in pattern

    -1 for characters not in pattern

  • 7/25/2019 (les09)-stringsearch

    25/32

    Boyer-Moore

    Copyright Ph.Dutr, Spring 201325

  • 7/25/2019 (les09)-stringsearch

    26/32

    Boyer-Moore

    Copyright Ph.Dutr, Spring 201326

  • 7/25/2019 (les09)-stringsearch

    27/32

    Boyer-Moore Analysis

    Copyright Ph.Dutr, Spring 201327

    ~ N/M character compares

    (characters in pattern < characters in alphabet)

    Worst-case: ~ M.N

    KMP-like rules to avoid repetition linear ~ N

  • 7/25/2019 (les09)-stringsearch

    28/32

    Rabin-Karp

    Copyright Ph.Dutr, Spring 201328

    Basic idea = modular hashing

    Compute a hash of pattern characters 0 to M-1

    For each i, compute a hash of text characters i to M+i-1 If pattern hash = text substring hash, check for a match

  • 7/25/2019 (les09)-stringsearch

    29/32

    Rabin-Karp

    Copyright Ph.Dutr, Spring 201329

    Intuition: M-digit, base-R integer, modulo Q

    ti = txt.charAt(i)

    Horner's method

    Linear-time method to evaluate degree-M polynomial

  • 7/25/2019 (les09)-stringsearch

    30/32

    Rabin-Karp

    Copyright Ph.Dutr, Spring 201330

    Can update hash function in constant time!

  • 7/25/2019 (les09)-stringsearch

    31/32

    Rabin-Karp

    Copyright Ph.Dutr, Spring 201331

  • 7/25/2019 (les09)-stringsearch

    32/32

    Rabin-Karp

    Copyright Ph.Dutr, Spring 201332

    Las Vegas variant

    If hash is equal, then check pattern

    100% correct Extremely likely to run in linear time (but worst case

    M.N)

    Monte Carlo variant

    If hash is equal, we assume pattern is equal as well

    Always runs in linear time

    Extremely likely to return correct answer

    Probability 1/Q choose large Q