String Matching Algorithms

Antonio Carzaniga

Faculty of InformaticsUniversità della Svizzera italiana

December 22, 2011

Outline

Problem definition

Naïve algorithm

Knuth-Morris-Pratt algorithm

Boyer-Moore algorithm

Problem

Given the text

Nel mezzo del cammin di nostra vitami ritrovai per una selva oscurache la dritta via era smarrita. . .

Find the string “trova”

Problem

Given the text

A more challenging example: How many times does the string“110011” appear in the following text

001111010101101001100011010111101101011101101110010010101010111110111101100001011011000010111111011110011000011111000100100101001011101110101101111010100110010100101110010000111111100100110111010110100110011011101001010010101000010100111110

Problem

Given the text

A more challenging example: How many times does the string“110011” appear in the following text

001111010101101001100011010111101101011101101110010010101010111110111101100001011011000010111111011110011000011111000100100101001011101110101101111010100110010100101110010000111111100100110111010110100110011011101001010010101000010100111110

String Matching: Definitions

Given a text T

T ∈ Σ∗: finite alphabet Σ

|T | = n: the length of T is n

Given a text T

Given a pattern P P ∈ Σ

∗: same finite alphabet Σ |P| = m: the length of P ism

Given a text T

Both T and P can be modeled as arrays T[1 . . . n] and P[1 . . .m]

Given a text T

Both T and P can be modeled as arrays T[1 . . . n] and P[1 . . .m]

Pattern P occurs with shift s in T iff 0 ≤ s ≤ n −m T[s + i] = P[i] for all positions 1 ≤ i ≤ m

Example

Problem: find all s such that

0 ≤ s ≤ n −m

T[s + i] = P[i] for 1 ≤ i ≤ m

Example

0 ≤ s ≤ n −m

T[s + i] = P[i] for 1 ≤ i ≤ m

n = 14

T a b c a a b a a b a b a c a

Example

0 ≤ s ≤ n −m

T[s + i] = P[i] for 1 ≤ i ≤ m

n = 14

Result

Example

0 ≤ s ≤ n −m

T[s + i] = P[i] for 1 ≤ i ≤ m

n = 14

Ps = 4

Results = 4

Example

0 ≤ s ≤ n −m

T[s + i] = P[i] for 1 ≤ i ≤ m

n = 14

Ps = 7

Results = 4s = 7

Example

0 ≤ s ≤ n −m

T[s + i] = P[i] for 1 ≤ i ≤ m

n = 14

Ps = 9

Results = 4s = 7s = 9

Naïve Algorithm

For each position s in 0 . . . n −m, see if T[s + i] = P[i] for all1 ≤ i ≤ m

Naïve Algorithm

NAIVE-STRING-MATCHING(T, P)

1 n = length(T)2 m = length(P)3 for s = 0 to n −m4 if SUBSTRING-AT(T, P, s)5 OUTPUT(s)

Naïve Algorithm

NAIVE-STRING-MATCHING(T, P)

1 n = length(T)2 m = length(P)3 for s = 0 to n −m4 if SUBSTRING-AT(T, P, s)5 OUTPUT(s)

SUBSTRING-AT(T, P, s)

1 for i = 1 to length(P)2 if T[s + i] , P[i]3 return FALSE

4 return TRUE

Complexity of the Naïve Algorithm

Complexity of NAIVE-STRING-MATCH is O((n −m + 1)m)

Worst case example

T = an, P = am

n︷︸︸︷

aa · · · a, P =

m︷︸︸︷

aa · · · a

Worst case example

T = an, P = am

n︷︸︸︷

aa · · · a, P =

m︷︸︸︷

aa · · · a

So, (n −m + 1)m is a tight bound, so the (worst-case)complexity of NAIVE-STRING-MATCH is

Θ((n −m + 1)m)

Improvement Strategy

Observation

P a b a

Observation

P a b a

Observation

P a b a

Observation

P a b a

Observation

P a b a

What now?

Observation

P a b a

What now?

the naïve algorithm goes back to the second position in T andstarts from the beginning of P

Observation

P a b a

What now?

can’t we simply move along through T?

Observation

P a b a

What now?

can’t we simply move along through T?

Improvement Strategy (2)

Here’s a wrong but insightful strategy

WRONG-STRING-MATCHING(T, P)

1 n = length(T)2 m = length(P)3 q = 0 // number of characters matched in P4 s = 15 while s ≤ n6 s = s + 17 if T[s] == P[q + 1]8 q = q + 19 if q == m10 OUTPUT(s −m)11 q = 012 else q = 0

Example run ofWRONG-STRING-MATCHING

T p a g l i a i o b a g o r d o

P a g o

Output: 10

P a g o Output: 10

Done. Perfect!

P a g o Output: 10

Done. Perfect!

Complexity: Θ(n)

What is wrong withWRONG-STRING-MATCHING?

T a a b a a a b a b a b a c a

P a a b

OUTPUT(0)

P a a b

missed!

P a a b

missed!

SoWRONG-STRING-MATCHING doesn’t work, but it tells ussomething useful

Where didWRONG-STRING-MATCHING go wrong?

P a a b

Wrong: by going all the way back to q = 0 we throw away agood prefix of P that we already matched

Another example

T a b a b a b a c b a c b c a

P a b a b a c

Another example

P a b a b a c

Another example

P a b a b a c

Another example

P a b a b a c

Another example

P a b a b a c

Another example

P a b a b a c

Another example

P a b a b a c

Another example

P a b a b a c

We have matched “ababa”

Another example

P a b a b a c

We have matched “ababa” suffix “aba” can be reused as a prefix

Another example

P a b a b a c

Another example

P a b a b a c

Another example

P a b a b a c

Another example

P a b a b a c

Another example

P a b a b a c

OUTPUT(2)

New Strategy

P[1 . . . q] is the prefix of Pmatched so far

New Strategy

Find the longest prefix of P that is also a suffix of P[2 . . . q]

i.e., find 0 ≤ π < q such that P[q − π + 1 . . . q] = P[1 . . . π] π = 0 means that such a prefix does not exist

New Strategy

P a b a b a c

New Strategy

P a b a b a c

New Strategy

P a b a b a c

q+1π = 3

New Strategy

P a b a b a c

q+1π = 3

New Strategy

P a b a b a c

Restart from q = π

New Strategy

P a b a b a c

Restart from q = π

Iterate as usual

New Strategy

P a b a b a c

Restart from q = π

Iterate as usual

In essence, this is the Knuth-Morris-Pratt algorithm

The Prefix Function

Given a pattern prefix P[1 . . . q], the longest prefix of P that isalso a suffix of P[2 . . . q] depends only on P and q

The Prefix Function

This prefix is identified by its length π(q)

The Prefix Function

Because π(q) depends only on P (and q), π can be computedat the beginning by PREFIX-FUNCTION

we represent π as an array of lengthm

The Prefix Function

Example

P a b a b a c

The Prefix Function

Example

P a b a b a c

π 0 0 1 2 3 0

The Knuth-Morris-Pratt Algorithm

KMP-STRING-MATCHING(T, P)

1 n = length(T)2 m = length(P)3 π = PREFIX-FUNCTION(P)4 q = 0 // number of character matched5 for i = 1 to n // scan the text left-to-right6 while q > 0 and P[q + 1] , T[i]7 q = π[q] // no match: go back using π8 if P[q + 1] == T[i]9 q = q + 110 if q == m11 OUTPUT(i −m)12 q = π[q] // go back for the next match

Prefix Function Algorithm

Computing the prefix function amounts to finding all theoccurrences of a pattern P in itself

Prefix Function Algorithm

Computing the prefix function amounts to finding all theoccurrences of a pattern P in itself

In fact, PREFIX-FUNCTION is remarkably similar toKMP-STRING-MATCHING

PREFIX-FUNCTION(P)

1 m = length(P)2 π[1] = 03 k = 04 for q = 2 tom5 while k > 0 and P[k + 1] , P[q]6 k = π[k]7 if P[k + 1] == P[q]8 k = k + 19 π[q] = k

Prefix Function at Work

PREFIX-FUNCTION(P)

1 m = length(P)2 π[1] = 03 k = 04 for q = 2 to m5 while k > 0 and P[k + 1] , P[q]6 k = π[k]7 if P[k + 1] == P[q]8 k = k + 19 π[q] = k

P a b a b a c

PREFIX-FUNCTION(P)

P a b a b a c

PREFIX-FUNCTION(P)

P a b a b a c

PREFIX-FUNCTION(P)

P a b a b a c

PREFIX-FUNCTION(P)

P a b a b a c

PREFIX-FUNCTION(P)

P a b a b a c

π 0 0

PREFIX-FUNCTION(P)

P a b a b a c

π 0 0

PREFIX-FUNCTION(P)

P a b a b a c

π 0 0

PREFIX-FUNCTION(P)

P a b a b a c

π 0 0 1

PREFIX-FUNCTION(P)

P a b a b a c

π 0 0 1

PREFIX-FUNCTION(P)

P a b a b a c

π 0 0 1

PREFIX-FUNCTION(P)

P a b a b a c

π 0 0 1 2

PREFIX-FUNCTION(P)

P a b a b a c

π 0 0 1 2

PREFIX-FUNCTION(P)

P a b a b a c

π 0 0 1 2

PREFIX-FUNCTION(P)

P a b a b a c

π 0 0 1 2 3

PREFIX-FUNCTION(P)

P a b a b a c

π 0 0 1 2 3

PREFIX-FUNCTION(P)

P a b a b a c

π 0 0 1 2 3

Complexity of KMP

O(n) for the search phase

Complexity of KMP

O(m) for the pre-processing of the pattern

Complexity of KMP

The complexity analysis is non-trivial

Complexity of KMP

The complexity analysis is non-trivial

Can we do better?

Comments on KMP

Knuth-Morris-Pratt is Ω(n)

KMP will always go through at least n character comparisons

it fixes our “wrong” algorithm in the case of periodic patternsand texts

Comments on KMP

Knuth-Morris-Pratt is Ω(n)

KMP will always go through at least n character comparisons

it fixes our “wrong” algorithm in the case of periodic patternsand texts

Perhaps there’s another algorithm that works better on theaverage case

e.g., in the absence of periodic patterns

A New Strategy

h e r e i s a s i m p l e e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

We match the pattern right-to-left

A New Strategy

e x a m p l e

If we find a bad character α in the text, we can shift

so that the pattern skips α , if α is not in the pattern

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

so that the pattern lines up with the rightmost occurrence of αin the pattern, if the pattern contains α

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

so that a pattern prefix lines up with a suffix of the currentpartial (or complete) match

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

A New Strategy

e x a m p l e

In essence, this is the Boyer-Moore algorithm

Comments on Boyer-Moore

Like KMP, Boyer-Moore includes a pre-processing phase

The pre-processing is O(m)

The search phase is O(nm)

The search phase can be as low as O(n/m) in common cases

In practice, Boyer-Moore is the fastest string-matchingalgorithm for most applications

String Matching Algorithms - USI InformaticsString Matching Algorithms Antonio Carzaniga Faculty of...

Documents

Algorithms and Data Structures Exercises - USI · Algorithms and Data Structures Exercises Antonio Carzaniga University of Lugano Edition 1.8 February 2013 1

A parallel ‘String Matching Engine’ for use in high speed network … · and Non-deterministic Finite Automata. 2.3 String matching algorithms There are many string matching algorithms

Pattern matching and text compression algorithms

Efficient Algorithms for Matching

An Experimental Test of House Matching Algorithms

Filter Algorithms for Approximate String Matching

NC Algorithms for Computing a Perfect Matching and a Maximum … · 2020. 5. 6. · Our matching algorithms, also, use matching-mimicking networks on at most three terminals, replacing

Multithreaded Algorithms for Maximum Matching in Bipartite ...€¦ · Multithreaded Algorithms for Maximum Matching in Bipartite Graphs Ariful Azad1, Mahantesh Halappanavar2, Sivasankaran

Multithreaded Algorithms for Maxmum Matching in …Multithreaded Algorithms for Maximum Matching in Bipartite Graphs Ariful Azad1, Mahantesh Halappanavar2, Sivasankaran Rajamanickam

Algebraic Algorithms for Matching and Matroid Problemsnickhar/Publications/AlgebraicMatching/... · Algebraic Algorithms for Matching and Matroid Problems Nicholas J. A. Harvey Computer

Exercises for Algorithms and Data Structures for Algorithms and Data Structures Antonio Carzaniga Faculty of Informatics USI (Università della Svizzera italiana) Edition 2.5 February

Matching Algorithms and Networks. Algorithms and Networks: Matching2 This lecture Matching: problem statement and applications Bipartite matching Matching

Non-Rigid Point Matching: Algorithms, Extensions …noodle.med.yale.edu/thesis/chui.pdfAbstract Non-Rigid Point Matching: Algorithms, Extensions and Applications Haili Chui Yale University

Fingerprint Matching by Genetic Algorithms - VISLabvislab.ucr.edu/PUBLICATIONS/pubs/Journal and Conference Papers... · Fingerprint Matching by Genetic Algorithms Xuejun Tan and Bir

6600089 Handbook of Exact String Matching Algorithms

Algorithms for Graph Similarity and Subgraph Matching

Fast Matching Algorithms for Repetitive Optimization

Algorithms for matching - DFKI

ALGORITHMS FOR VERTEX-WEIGHTED MATCHING IN GRAPHShpc.pnl.gov/people/hala/files/HalappanavarThesisVertexWtdMatchin… · ALGORITHMS FOR VERTEX-WEIGHTED MATCHING IN GRAPHS Mahantesh