10 String Algorithms

Embed Size (px)

Citation preview

  • 8/12/2019 10 String Algorithms

    1/36

    CS 97SI: INTRODUCTION TOPROGRAMMING CONTESTS

    Jaehyun Park

  • 8/12/2019 10 String Algorithms

    2/36

    Last Lecture: String Algorithms

    String Matching Problem

    Hash Table

    Knuth-Morris-Pratt (KMP) Algorithm

    Suffix Trie

    Suffix Array

    Note on String Problems

  • 8/12/2019 10 String Algorithms

    3/36

    String Matching Problem

    Given a text and a pattern , find all theoccurrences of within

    Notations:

    and : lengths of and

    : set of alphabets

    Constant size

    : th letter of (1-indexed) ,,: single letters in

    ,,: strings

  • 8/12/2019 10 String Algorithms

    4/36

    String Matching Example

    =AGCATGCTGCAGTCATGCTTAGGCTA

    = GCT

    A nave method takes () time

    We initiate string comparison at every starting point

    Each comparison takes time

    We can certainly do better!

  • 8/12/2019 10 String Algorithms

    5/36

    Hash Function

    A function that takes a string and outputs a number

    A good hash function has few collisions

    i.e. If , () with high probability

    An easy and powerful hash function is a polynomial

    mod some prime

    Consider each letter as a number (ASCII value is fine)

    = How do we find ( +) from ( )?

  • 8/12/2019 10 String Algorithms

    6/36

    Hash Table

    Main idea: preprocess to speedup queries

    Hash every substring of length

    is a small constant

    For each query , hash the first letters of toretrieve all the occurrences of it within

    Dont forget to check collisions!

  • 8/12/2019 10 String Algorithms

    7/36

    Hash Table

    Pros:

    Easy to implement

    Significant speedup in practice

    Cons:

    Doesnt help the asymptotic efficiency

    Can take () time if hashing is terrible A lot of memory consumption

  • 8/12/2019 10 String Algorithms

    8/36

    Knuth-Morris-Pratt (KMP) Matcher

    A linear time (!) algorithm that solves the string

    matching problem by preprocessing in time

    Main idea is to skip some comparisons by using the

    previous comparison result Uses an auxiliary array that is defined as the

    following:

    [] is the largest integer smaller than such that [] is a suffix of

    Its better to see an example than the definition

  • 8/12/2019 10 String Algorithms

    9/36

    Table Example (from CLRS)

    []: the largest integer smaller than such that [] is a suffix of e.g. [6] = 4 since abab is a suffix of ababab

    e.g. [9] = 0 since no prefix of length 8 ends with c

    Lets see why this is useful

    1 2 3 4 5 6 7 8 9 10

    a b a b a b a b c a

    [] 0 0 1 2 3 4 5 6 0 1

  • 8/12/2019 10 String Algorithms

    10/36

    Using the Table

    =ABC ABCDAB ABCDABCDABDE

    =ABCDABD

    = (0, 0, 0, 0, 1, 2, 0)

    Start matching at the first position of:

    Mismatch at the 4th letter of!

    12345678901234567890123

    ABC ABCDAB ABCDABCDABDE

    ABCDABD

    1234567

  • 8/12/2019 10 String Algorithms

    11/36

    Using the Table

    There is no point in starting the comparison at , We matched = 3 letters so far

    Shift by = 3 letters

    Mismatch at again!

    12345678901234567890123

    ABC ABCDAB ABCDABCDABDE

    ABCDABD

    1234567

  • 8/12/2019 10 String Algorithms

    12/36

    Using the Table

    We define 0 = 1

    We matched = 0 letters so far

    Shift by = 1 letter

    Mismatch at !

    12345678901234567890123

    ABC ABCDAB ABCDABCDABDE

    ABCDABD

    1234567

  • 8/12/2019 10 String Algorithms

    13/36

    Using the Table

    [6] = 2 says is a suffix of Shift by 6 [6] = 4 letters

    Again, no point in shifting by 1, 2, or 3 letters

    12345678901234567890123

    ABC ABCDAB ABCDABCDABDE

    ABCDABD

    ||

    ABCDABD

    1234567

  • 8/12/2019 10 String Algorithms

    14/36

    Using the Table

    Mismatch at again!

    Currently 2 letters are matched

    We shift by 2 = 2 2 letters

    12345678901234567890123

    ABC ABCDAB ABCDABCDABDE

    ABCDABD

    1234567

  • 8/12/2019 10 String Algorithms

    15/36

    Using the Table

    Mismatch at yet again!

    Currently no letters are matched

    We shift by 1 = 0 0 letters

    12345678901234567890123

    ABC ABCDAB ABCDABCDABDE

    ABCDABD

    1234567

  • 8/12/2019 10 String Algorithms

    16/36

    Using the Table

    Mismatch at

    Currently 6 letters are matched

    We shift by 4 = 6 6 letters

    12345678901234567890123

    ABC ABCDAB ABCDABCDABDE

    ABCDABD

    1234567

  • 8/12/2019 10 String Algorithms

    17/36

    Using the Table

    Finally, there it is!

    Currently all 7 letters are matched

    After recording this match (match at ), we shift again in order to find other matches

    Shift by 7 = 7 7 letters

    12345678901234567890123

    ABC ABCDAB ABCDABCDABDE

    ABCDABD

    1234567

  • 8/12/2019 10 String Algorithms

    18/36

    Computing

    Observation 1: if [] is a suffix of ,

    then is a suffix of Well, obviously

    Observation 2: all the prefixes of P that are a

    suffix of can be obtained by recursivelyapplying to

    e.g. , , are allsuffixes of

  • 8/12/2019 10 String Algorithms

    19/36

    Computing

    A non-obvious conclusion:

    First, lets write () as [] applied times to

    e.g. =

    is equal to 1 1, where is the smallestinteger that satisfies + =

    If there is no such , [] = 0

    Intuition: we look at all the prefixes of that aresuffixes of and find the longest onewhose next letter matches too

  • 8/12/2019 10 String Algorithms

    20/36

    Implementation

    pi[0] = -1;

    int k = -1;

    for(int i = 1; i = 0 && P[k+1] != P[i])k = pi[k];

    pi[i] = ++k;

    }

  • 8/12/2019 10 String Algorithms

    21/36

    Pattern Matching Implementation

    int k = 0;

    for(int i = 1; i = 0 && P[k+1] != T[i])

    k = pi[k];k++;

    if(k == m) {

    // P matches T[i-m+1..i]

    k = pi[k];

    }

    }

  • 8/12/2019 10 String Algorithms

    22/36

    Suffix Trie

    Suffix trie of a string is a rooted tree that storesall the suffixes (thus all the substrings)

    Each node corresponds to some substring of

    Each edge is associated with an alphabet

    For each node that corresponds to , there is aspecial pointer called suffix link that leads to the

    node corresponding to Surprisingly easy to implement!

  • 8/12/2019 10 String Algorithms

    23/36

    Suffix Trie Example

    (Figure modified from Ukkonens original paper)

  • 8/12/2019 10 String Algorithms

    24/36

    Incremental Construction

    Given the suffix tree for Then we append + = to , creating necessary

    nodes

    Start at node corresponding to Create an -transition to a new node

    Take the suffix link at to go to , corresponding

    to Create an -transition to a new node

    Create a suffix link from to

  • 8/12/2019 10 String Algorithms

    25/36

    Incremental Construction

    We repeat the previous process:

    Take the suffix link at the current node

    Make a new -transition there

    Create the suffix link from the previous node

    We stop if the node already has an -transition

    Because from this point, all nodes that are reachable

    via suffix links already have an -transition

  • 8/12/2019 10 String Algorithms

    26/36

    Construction Example

    a

    b

    a

    a

    b

    Given the suffix trie for aba

    We want to add a new letter c

  • 8/12/2019 10 String Algorithms

    27/36

    Construction Example

    a

    b

    a

    a

    b

    1. Start at the green node and make a c-transition

    c

    2. Then follow the suffix link

  • 8/12/2019 10 String Algorithms

    28/36

  • 8/12/2019 10 String Algorithms

    29/36

    Construction Example

    a

    b

    a

    a

    b

    c

    c

    c

  • 8/12/2019 10 String Algorithms

    30/36

    Construction Example

    a

    b

    a

    a

    b

    c

    c

    c

    c

  • 8/12/2019 10 String Algorithms

    31/36

    Construction Example

    a

    b

    a

    a

    b

    c

    c

    c

    c

  • 8/12/2019 10 String Algorithms

    32/36

    Suffix Trie Analysis

    Construction time is linear in the tree size

    But the tree size can be quadratic in

    e.g. = aaabbb

  • 8/12/2019 10 String Algorithms

    33/36

    Pattern Matching

    To find , start at the root and keep followingedges labeled with , , etc.

    Got stuck? Then doesnt exist in

  • 8/12/2019 10 String Algorithms

    34/36

    Suffix Array

    Input string Get all suffixes Sort the suffixes Take the indices

    BANANA

    1 BANANA

    2 ANANA

    3 NANA

    4 ANA

    5 NA

    6 A

    6 A

    4 ANA

    2 ANANA

    1 BANANA

    5 NA

    3 NANA

    6,4,2,1,5,3

  • 8/12/2019 10 String Algorithms

    35/36

    Suffix Array

    Memory usage is

    Has the same computational power as suffix trie

    Can be constructed in time (!)

    But its hard to implement

    There is an approachable log algorithm

    If you want to see how it works, read the paper on the

    course website http://cs97si.stanford.edu/suffix-array.pdf

    http://cs97si.stanford.edu/suffix-array.pdfhttp://cs97si.stanford.edu/suffix-array.pdfhttp://cs97si.stanford.edu/suffix-array.pdf
  • 8/12/2019 10 String Algorithms

    36/36

    Note on String Problems

    Always be aware of the null-terminators

    Simple hash works so well in many problems

    Even for problems that arent supposed to be solved by

    hashing

    If a problem involves rotations of a string, consider

    concatenating it with itself and see if it helps

    Stanford team notebook has implementations ofsuffix arrays and the KMP matcher