Upload
hanvwb
View
213
Download
0
Embed Size (px)
Citation preview
7/25/2019 (les09)-stringsearch
1/32
Algoritmen & Datastructuren
2012 2013Substring Search
(Slides by Sedgewick)
Philip Dutr & Benedict Brown
Dept. of Computer Science, K.U.Leuven
7/25/2019 (les09)-stringsearch
2/32
Copyright Ph.Dutr, Spring 20132
Goal: Find pattern of length M in a text of length N
Usually M
7/25/2019 (les09)-stringsearch
3/32
Applications
Copyright Ph.Dutr, Spring 20133
Parsers
Spam filters
Digital libraries
Word processors
Web search engines
Natural language processing
Computational molecular biology
Feature detection in digitized images
7/25/2019 (les09)-stringsearch
4/32
Application: spam filtering
Copyright Ph.Dutr, Spring 20134
Identify patterns indicative
of spam.
PROFITS L0SE WE1GHT
herbal V1AGRA
There is no catch L0W M0RTGAGE RATES
This is a one-time mailing.
This message is sent in
compliance with spam
regulations
7/25/2019 (les09)-stringsearch
5/32
Application: electronic surveillance
Copyright Ph.Dutr, Spring 20135
Look for certain phrases in email traffic
Detonate Bomb
Attack Belgians Ik zal een taart in het gezicht van Leonard werpen
Huge amount of data, incoming streams, no backup
7/25/2019 (les09)-stringsearch
6/32
Application: page scraping
Copyright Ph.Dutr, Spring 20136
Goal. Extract relevant data from web page.
E.g. Find string delimited by and after first
occurrence of pattern Last Trade
7/25/2019 (les09)-stringsearch
7/32
Brute force
Copyright Ph.Dutr, Spring 20137
Check all possible starting positions,
check entire length of pattern
7/25/2019 (les09)-stringsearch
8/32
Brute force
Copyright Ph.Dutr, Spring 20138
7/25/2019 (les09)-stringsearch
9/32
Brute force: worst case
Copyright Ph.Dutr, Spring 20139
Brute-force algorithm can be slow if text and patternare repetitive
Worst case: ~ M.N character comparisons M.(N-M+1) = M.N M.M + M ~ M.N if M
7/25/2019 (les09)-stringsearch
10/32
Challenges
Copyright Ph.Dutr, Spring 201310
In many applications, backup in data to be avoided
Solution 1: maintain buffer (memory!)
Solution 2: avoid backup
Challenge: we want linear time complexity!
7/25/2019 (les09)-stringsearch
11/32
Brute force: alternate implementation
Copyright Ph.Dutr, Spring 201311
Same sequence of character compares
i points to end of sequence of already-matched chars in text
j stores number of already-matched chars(end of sequence in pattern)
7/25/2019 (les09)-stringsearch
12/32
Knuth-Morris-Pratt (KMP)
Copyright Ph.Dutr, Spring 201312
Intuition: Suppose we are searching in text for pattern
BAAAAAAAAA (alphabet = {A,B})
Suppose we match 5 chars, with mismatch on 6th char We know previous 6 characters in text are BAAAAB
Don't need to back up text pointer!
7/25/2019 (les09)-stringsearch
13/32
Deterministic finite state automaton (DFA)
Copyright Ph.Dutr, Spring 201313
DFA is abstract string-searching machine.
Finite number of states (including start and halt)
Exactly one transition for each char in alphabet Accept if sequence of transitions leads to halt state
7/25/2019 (les09)-stringsearch
14/32
Interpretation of Knuth-Morris-Pratt DFA
Copyright Ph.Dutr, Spring 201314
Interpretation of DFA state after reading in txt[i]?
State-number = number of characters in pattern that have
been matched. length of longest prefix of pat[] that is at end of txt[0..i]
E.g. DFA is in state 3 after reading in character txt[6]
7/25/2019 (les09)-stringsearch
15/32
KMP trace
Copyright Ph.Dutr, Spring 201315
7/25/2019 (les09)-stringsearch
16/32
KMP Implementation
Copyright Ph.Dutr, Spring 201316
Differences from brute-force implementation:
Text pointer i never decrements (no backup needed)
Need to precompute dfa[][] from pattern (some work)
Running time Simulate DFA on text: at most N character accesses
Build DFA: how to do efficiently?
7/25/2019 (les09)-stringsearch
17/32
How to build DFA from pattern?
Copyright Ph.Dutr, Spring 201317
Match transition
If in state j and next char c == pat.charAt(j),then go to state j+1.
first j characters of pattern
have already been matched next char matches
now first j+1 characters of
pattern have been matched
7/25/2019 (les09)-stringsearch
18/32
How to build DFA from pattern?
Copyright Ph.Dutr, Spring 201318
Mismatch transition
If in state j and next char c != pat.charAt(j), then the last j characters
of input are pat[1..j-1], followed by c
To compute dfa[c][j]:
Simulate pat[1..j-1] on DFA (so far ) and take transition c
Running time: Seems to require j steps
Simulation does not start from scratch: remember state X
7/25/2019 (les09)-stringsearch
19/32
Copyright Ph.Dutr, Spring 201319
7/25/2019 (les09)-stringsearch
20/32
Copyright Ph.Dutr, Spring 201320
7/25/2019 (les09)-stringsearch
21/32
KMP Analysis
Copyright Ph.Dutr, Spring 201321
KMP accesses no more than M + N chars
M for constructing DFA
N for actual search
KMP constructs dfa[][] in time and space
proportional to R.M R = size of alphabet
7/25/2019 (les09)-stringsearch
22/32
The Art of Computer Programming
Copyright Ph.Dutr, Spring 201322
"If you think you're a really good programmer . . . read
(Knuth's) Art of Computer Programming . . . You shoulddefinitely send me a rsum if you can read the whole
thing. (Bill Gates)
Donald
Knuth
7/25/2019 (les09)-stringsearch
23/32
Boyer-Moore
Copyright Ph.Dutr, Spring 201323
Intuition:
Scan characters in pattern from right to left
Can skip M text chars when finding one not in the pattern
7/25/2019 (les09)-stringsearch
24/32
Boyer-Moore
Copyright Ph.Dutr, Spring 201324
How much to skip?
right[c] = rightmost occurrence of character c in pattern
-1 for characters not in pattern
7/25/2019 (les09)-stringsearch
25/32
Boyer-Moore
Copyright Ph.Dutr, Spring 201325
7/25/2019 (les09)-stringsearch
26/32
Boyer-Moore
Copyright Ph.Dutr, Spring 201326
7/25/2019 (les09)-stringsearch
27/32
Boyer-Moore Analysis
Copyright Ph.Dutr, Spring 201327
~ N/M character compares
(characters in pattern < characters in alphabet)
Worst-case: ~ M.N
KMP-like rules to avoid repetition linear ~ N
7/25/2019 (les09)-stringsearch
28/32
Rabin-Karp
Copyright Ph.Dutr, Spring 201328
Basic idea = modular hashing
Compute a hash of pattern characters 0 to M-1
For each i, compute a hash of text characters i to M+i-1 If pattern hash = text substring hash, check for a match
7/25/2019 (les09)-stringsearch
29/32
Rabin-Karp
Copyright Ph.Dutr, Spring 201329
Intuition: M-digit, base-R integer, modulo Q
ti = txt.charAt(i)
Horner's method
Linear-time method to evaluate degree-M polynomial
7/25/2019 (les09)-stringsearch
30/32
Rabin-Karp
Copyright Ph.Dutr, Spring 201330
Can update hash function in constant time!
7/25/2019 (les09)-stringsearch
31/32
Rabin-Karp
Copyright Ph.Dutr, Spring 201331
7/25/2019 (les09)-stringsearch
32/32
Rabin-Karp
Copyright Ph.Dutr, Spring 201332
Las Vegas variant
If hash is equal, then check pattern
100% correct Extremely likely to run in linear time (but worst case
M.N)
Monte Carlo variant
If hash is equal, we assume pattern is equal as well
Always runs in linear time
Extremely likely to return correct answer
Probability 1/Q choose large Q