Upload
rashedulislamriyad
View
241
Download
0
Embed Size (px)
Citation preview
8/12/2019 10 String Algorithms
1/36
CS 97SI: INTRODUCTION TOPROGRAMMING CONTESTS
Jaehyun Park
8/12/2019 10 String Algorithms
2/36
Last Lecture: String Algorithms
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array
Note on String Problems
8/12/2019 10 String Algorithms
3/36
String Matching Problem
Given a text and a pattern , find all theoccurrences of within
Notations:
and : lengths of and
: set of alphabets
Constant size
: th letter of (1-indexed) ,,: single letters in
,,: strings
8/12/2019 10 String Algorithms
4/36
String Matching Example
=AGCATGCTGCAGTCATGCTTAGGCTA
= GCT
A nave method takes () time
We initiate string comparison at every starting point
Each comparison takes time
We can certainly do better!
8/12/2019 10 String Algorithms
5/36
Hash Function
A function that takes a string and outputs a number
A good hash function has few collisions
i.e. If , () with high probability
An easy and powerful hash function is a polynomial
mod some prime
Consider each letter as a number (ASCII value is fine)
= How do we find ( +) from ( )?
8/12/2019 10 String Algorithms
6/36
Hash Table
Main idea: preprocess to speedup queries
Hash every substring of length
is a small constant
For each query , hash the first letters of toretrieve all the occurrences of it within
Dont forget to check collisions!
8/12/2019 10 String Algorithms
7/36
Hash Table
Pros:
Easy to implement
Significant speedup in practice
Cons:
Doesnt help the asymptotic efficiency
Can take () time if hashing is terrible A lot of memory consumption
8/12/2019 10 String Algorithms
8/36
Knuth-Morris-Pratt (KMP) Matcher
A linear time (!) algorithm that solves the string
matching problem by preprocessing in time
Main idea is to skip some comparisons by using the
previous comparison result Uses an auxiliary array that is defined as the
following:
[] is the largest integer smaller than such that [] is a suffix of
Its better to see an example than the definition
8/12/2019 10 String Algorithms
9/36
Table Example (from CLRS)
[]: the largest integer smaller than such that [] is a suffix of e.g. [6] = 4 since abab is a suffix of ababab
e.g. [9] = 0 since no prefix of length 8 ends with c
Lets see why this is useful
1 2 3 4 5 6 7 8 9 10
a b a b a b a b c a
[] 0 0 1 2 3 4 5 6 0 1
8/12/2019 10 String Algorithms
10/36
Using the Table
=ABC ABCDAB ABCDABCDABDE
=ABCDABD
= (0, 0, 0, 0, 1, 2, 0)
Start matching at the first position of:
Mismatch at the 4th letter of!
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
8/12/2019 10 String Algorithms
11/36
Using the Table
There is no point in starting the comparison at , We matched = 3 letters so far
Shift by = 3 letters
Mismatch at again!
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
8/12/2019 10 String Algorithms
12/36
Using the Table
We define 0 = 1
We matched = 0 letters so far
Shift by = 1 letter
Mismatch at !
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
8/12/2019 10 String Algorithms
13/36
Using the Table
[6] = 2 says is a suffix of Shift by 6 [6] = 4 letters
Again, no point in shifting by 1, 2, or 3 letters
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
||
ABCDABD
1234567
8/12/2019 10 String Algorithms
14/36
Using the Table
Mismatch at again!
Currently 2 letters are matched
We shift by 2 = 2 2 letters
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
8/12/2019 10 String Algorithms
15/36
Using the Table
Mismatch at yet again!
Currently no letters are matched
We shift by 1 = 0 0 letters
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
8/12/2019 10 String Algorithms
16/36
Using the Table
Mismatch at
Currently 6 letters are matched
We shift by 4 = 6 6 letters
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
8/12/2019 10 String Algorithms
17/36
Using the Table
Finally, there it is!
Currently all 7 letters are matched
After recording this match (match at ), we shift again in order to find other matches
Shift by 7 = 7 7 letters
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
8/12/2019 10 String Algorithms
18/36
Computing
Observation 1: if [] is a suffix of ,
then is a suffix of Well, obviously
Observation 2: all the prefixes of P that are a
suffix of can be obtained by recursivelyapplying to
e.g. , , are allsuffixes of
8/12/2019 10 String Algorithms
19/36
Computing
A non-obvious conclusion:
First, lets write () as [] applied times to
e.g. =
is equal to 1 1, where is the smallestinteger that satisfies + =
If there is no such , [] = 0
Intuition: we look at all the prefixes of that aresuffixes of and find the longest onewhose next letter matches too
8/12/2019 10 String Algorithms
20/36
Implementation
pi[0] = -1;
int k = -1;
for(int i = 1; i = 0 && P[k+1] != P[i])k = pi[k];
pi[i] = ++k;
}
8/12/2019 10 String Algorithms
21/36
Pattern Matching Implementation
int k = 0;
for(int i = 1; i = 0 && P[k+1] != T[i])
k = pi[k];k++;
if(k == m) {
// P matches T[i-m+1..i]
k = pi[k];
}
}
8/12/2019 10 String Algorithms
22/36
Suffix Trie
Suffix trie of a string is a rooted tree that storesall the suffixes (thus all the substrings)
Each node corresponds to some substring of
Each edge is associated with an alphabet
For each node that corresponds to , there is aspecial pointer called suffix link that leads to the
node corresponding to Surprisingly easy to implement!
8/12/2019 10 String Algorithms
23/36
Suffix Trie Example
(Figure modified from Ukkonens original paper)
8/12/2019 10 String Algorithms
24/36
Incremental Construction
Given the suffix tree for Then we append + = to , creating necessary
nodes
Start at node corresponding to Create an -transition to a new node
Take the suffix link at to go to , corresponding
to Create an -transition to a new node
Create a suffix link from to
8/12/2019 10 String Algorithms
25/36
Incremental Construction
We repeat the previous process:
Take the suffix link at the current node
Make a new -transition there
Create the suffix link from the previous node
We stop if the node already has an -transition
Because from this point, all nodes that are reachable
via suffix links already have an -transition
8/12/2019 10 String Algorithms
26/36
Construction Example
a
b
a
a
b
Given the suffix trie for aba
We want to add a new letter c
8/12/2019 10 String Algorithms
27/36
Construction Example
a
b
a
a
b
1. Start at the green node and make a c-transition
c
2. Then follow the suffix link
8/12/2019 10 String Algorithms
28/36
8/12/2019 10 String Algorithms
29/36
Construction Example
a
b
a
a
b
c
c
c
8/12/2019 10 String Algorithms
30/36
Construction Example
a
b
a
a
b
c
c
c
c
8/12/2019 10 String Algorithms
31/36
Construction Example
a
b
a
a
b
c
c
c
c
8/12/2019 10 String Algorithms
32/36
Suffix Trie Analysis
Construction time is linear in the tree size
But the tree size can be quadratic in
e.g. = aaabbb
8/12/2019 10 String Algorithms
33/36
Pattern Matching
To find , start at the root and keep followingedges labeled with , , etc.
Got stuck? Then doesnt exist in
8/12/2019 10 String Algorithms
34/36
Suffix Array
Input string Get all suffixes Sort the suffixes Take the indices
BANANA
1 BANANA
2 ANANA
3 NANA
4 ANA
5 NA
6 A
6 A
4 ANA
2 ANANA
1 BANANA
5 NA
3 NANA
6,4,2,1,5,3
8/12/2019 10 String Algorithms
35/36
Suffix Array
Memory usage is
Has the same computational power as suffix trie
Can be constructed in time (!)
But its hard to implement
There is an approachable log algorithm
If you want to see how it works, read the paper on the
course website http://cs97si.stanford.edu/suffix-array.pdf
http://cs97si.stanford.edu/suffix-array.pdfhttp://cs97si.stanford.edu/suffix-array.pdfhttp://cs97si.stanford.edu/suffix-array.pdf8/12/2019 10 String Algorithms
36/36
Note on String Problems
Always be aware of the null-terminators
Simple hash works so well in many problems
Even for problems that arent supposed to be solved by
hashing
If a problem involves rotations of a string, consider
concatenating it with itself and see if it helps
Stanford team notebook has implementations ofsuffix arrays and the KMP matcher