32
Survey: String Matching Algorithms Vishal Kumar Jaiswal 1 st yr. M.Tech. CST IIEST Shibpur

Survey of String Matching Algorithm

Embed Size (px)

Citation preview

Page 1: Survey of String Matching Algorithm

Survey: String Matching AlgorithmsVishal Kumar Jaiswal1st yr. M.Tech. CSTIIEST Shibpur

Page 2: Survey of String Matching Algorithm

Outline

❖Problem Statement

❖Exact String Matching ➢ Naive String Matching

➢ Rabin Karp Algorithm

➢ Finite State Automaton

➢ Knuth Morris Pratt Algorithm

❖Approximate String Matching

Page 3: Survey of String Matching Algorithm

String Matching

The objective of string searching is to find the location of a specific text pattern within a larger body of text (e.g., a sentence, a paragraph, a book, etc.).

Formally, let the text is an array T[1...n] of length n and that the pattern is an array P[1..m] of length m<=n. The elements of P and T are characters drawn from a finite alphabet Σ.

Page 4: Survey of String Matching Algorithm

Assumptions and Terminology

The text T is static, given before queries are made, available for preprocessing and storing in a data structure.

The concatenation of two strings x and y, denoted xy, has length |x|+|y| and consists of the characters from x followed by the characters from y.

A string w is a prefix of a string x, denoted w󠁰 x⊏ , if x=wy for some string yϵΣ*.

A string w is a suffix of a string x, denoted w󠁰 x⊐ , if x=wy for some string yϵΣ*.

Page 5: Survey of String Matching Algorithm

Problem Statement

The pattern P occurs with shift s in text T (or, equivalently, that pattern P occurs beginning at position s+1 in text T) if 0<=s<=n-m and T[s+1....s+m]=P[1...m] (that is, if T[s+j]=P[j], for 1<=j<=m).

Fig. 1. Demonstration of string search problem

Page 6: Survey of String Matching Algorithm

Valid / Invalid shifts

If P occurs with shift s in T, then we call s a valid shift; otherwise we call s an invalid shift.The string matching problem is the problem of finding all valid shifts with which a given pattern P occurs in a given text T.

If a variable width encoding is in use then it is slow (time proportional to N) to find the Nth character.

Page 7: Survey of String Matching Algorithm

Algorithms

Finite set of pattern search algorithm:

Aho-Corasick string matching algorithm

Commentz-Walter algorithm

Rabin-Karp string search algorithm

Infinite set of pattern search algorithm:

Represent pattern using regular expression or regular grammar.

Page 8: Survey of String Matching Algorithm

Exact String Matching

Page 9: Survey of String Matching Algorithm

Naive string matching Check at all positions in the text between 0 and n-m, whether an occurrence of the

pattern starts there or not.

After each attempt, shift the pattern by exactly one position to the right.

The time complexity of this searching phase is O(mn) (when searching for am-1b in an for instance).

Requires no preprocessing phase.

Only constant extra space needed.

The expected number of text character comparisons is 2n.

Page 10: Survey of String Matching Algorithm

Naive string matching Algorithm

Page 11: Survey of String Matching Algorithm

Example

Page 12: Survey of String Matching Algorithm

String matching with finite automata

Automaton examines each text character exactly once, taking constant amount of time per text character. So, matching time after preprocessing is (n).𝛉

To search a word x with an automaton, first build the minimal Deterministic Finite Automaton (DFA) A(x) recognizing the language Σ*x.

If Σ is large, then the time to build the automaton can be large.

Page 13: Survey of String Matching Algorithm

String matching with finite automata

The DFA A(x)=(Q,qo,T,E) recognizing the language Σ*x is defined as follows:

Q is the set of all prefixes of x: Q={ ,x[0],x[0..1],....,x[0...m-2],x};ℇ

Q0= ;ℇ

T={x};

For q in Q (q is prefix of x) and a in Σ, (q,a,qa) is in E if and only if qa is also a prefix of x, otherwise (q,a,p) is in E such that p is the longest suffix of qa which is a prefix of x.

Page 14: Survey of String Matching Algorithm

Finite automata: Example

Page 15: Survey of String Matching Algorithm

Rabin Karp String Matching

Performs well in practice and also generalizes to other problems e.g. two dimensional pattern matching.

Uses hashing to find any one of a set of pattern strings in a text.

For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst case time is O(nm).

A practical application of this algorithm is detecting plagiarism.

Page 16: Survey of String Matching Algorithm

Hash Function

Hash󠁰Function󠁰A hash function is a function which converts every string into a numeric value, called its hash value. For example, hash(“hello”)=5.󠁰Hash function should be computationally efficient, highly discriminating.

hash(y[j+1...j+m]) must be easily computable from hash(y[j…j+m-1]) and y[j+m];

hash(y[j+1...j+m])=rehash(y[j],y[j+m],hash(y[j…j+m-1]))

Page 17: Survey of String Matching Algorithm

Hash Function

For a word w of length m let hash(w) be defined as follows:

hash(w[0….m-1])=(w[0]*2m-1+w[1]*2m-2+.....+w[m-1]*20)mod q where q is large number

If two strings are equal, their hash values are also equal.

Compute hash value of the substring we’re searching for, and then look for a substring with the same hash value.

Page 18: Survey of String Matching Algorithm

Rabin Karp Example

Page 19: Survey of String Matching Algorithm

The Knuth Morris Pratt Algorithm

Consider an attempt at a left position j, that is when the window is positioned on the text factor y[j..j+m-1]. Assume that the first mismatch occurs between x[i] and y[i+j] with 0<i<m.Then, x[0...i-1]=y[j...i+j-1]=u and a=x[i]!=y[i+j]=b.

When shifting, a prefix v of the pattern can match some suffix of the portion u of the text. If we want to avoid another immediate mismatch, the character following the prefix v in the pattern must be different from a. The longest such prefix v󠁰is called the tagged border of u.

Page 20: Survey of String Matching Algorithm

Prefix Table

q 1 2 3 4 5 6 7 8

P[q] G C A G A G A G

𝜋[q] 0 0 0 1 0 1 0 1

Page 21: Survey of String Matching Algorithm

Knuth Morris Pratt Example

Page 22: Survey of String Matching Algorithm

The Knuth Morris Pratt Algorithm

Page 23: Survey of String Matching Algorithm

Fig. 2. Comparison of Single pattern search algorithm

Algorithm Preprocessing Time Matching Time

Naïve search Algorithm 0 (no processing time) Θ((n-m)m)

Rabin Karp string search Algorithm

Θ(m) Average Θ(n+m)Worst Θ((n-m)m)

Finite State Automaton Θ(m|Σ|) Θ(n)

Knuth Morris Pratt Algorithm Θ(m) Θ(n)

Boyer Moore string search Algorithm

Θ(m+|Σ|) Best Ω(n/m) Worst O(n)

Bitap Algorithm Θ(m+|Σ|) O(mn)

Note: Boyer Moore String Search Algorithm is the standard benchmark for the practical string search literature

Page 24: Survey of String Matching Algorithm

Approximate String Matching

Page 25: Survey of String Matching Algorithm

Approximate / Fuzzy String Searching Finds strings that match a pattern approximately (rather than exactly).

The closeness of a match is measured in terms of primitive operations necessary to convert the string into an exact match. This number is called the edit󠁰distance between the string and the pattern. The usual primitive operations are:

Insertion: cot->coat

Deletion: coat->cot

Substitution: coat->cost

Transposition: cost->cots

Page 26: Survey of String Matching Algorithm

Problem Definition and Solution Strategy

Given a pattern string P=p1p2.....pm and a text string T=t1t2…..tm, find a substring Tj’,j=tj’.....tj󠁰in T, which, of all substrings of T, has the smallest edit distance to the pattern P.

Brute󠁰force󠁰approach:󠁰

Firstly, Compute the edit distance to P for all substrings of T,

Then choose the substring with the minimum edit distance. However, this algorithm would have the running time O(n3m).

Page 27: Survey of String Matching Algorithm

Dynamic programming based approach

For each position j in text T and each position i in the pattern P, go through all substrings of T ending at position j, and determine which one of them has the minimal edit distance to the i first characters of the pattern P. Write this minimal distance as E(i,j).

After computing E(i,j), for all i and j, we can easily find a solution to the original problem: It is the substring for which E(m,j) is minimal (m being the length of the pattern P).

Page 28: Survey of String Matching Algorithm

Online Matching

Online󠁰Searching:

The pattern can be processed before searching but the text can not. The most improved version of online searching algorithm is bitap󠁰algorithm, which is used by Unix search utility agrep.

Page 29: Survey of String Matching Algorithm

Applications

• Spam Filtering,

• De-duplication,

• Identity Resolution,

• Microsoft’s Spell Checker and Autocorrect feature,

• Google search engine’s “Showing results for”,

• Searching lyrics of a song.

Note: Approximate string search can not be used for most binary data, such as images and music. These applications require different algorithms ,namely ,Acoustic󠁰fingerprinting.

Page 30: Survey of String Matching Algorithm

ApplicationsError󠁰in󠁰Fuzzy󠁰String󠁰Matching!

Page 31: Survey of String Matching Algorithm

Bibliography

1. Christian Charras - Thierry Lecroq, “Exact String Matching Algorithms:- Animation in Java” , http://www-igm.univ-mlv.fr/~lecroq/string/index.html

2. Thomas H. Cormen, ”Introduction to Algorithms”, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 32: String Matching, pp. 906–932.

Page 32: Survey of String Matching Algorithm

Thank YouFind this presentation at:

www.slideshare.net/VishalKumarJaiswal2