View
236
Download
1
Category
Preview:
Citation preview
7/30/2019 String Matching Algorithms and their Applicability in various Applications
1/5
International Journal of Soft Computing and Engineering (IJSCE)
ISSN: 2231-2307, Volume-I, Issue-6, January 2012
218
String Matching Algorithms and their
Applicability in various Applications
Nimisha Singla, Deepak Garg
Abstract- In this paper the applicability of the various stringsmatching algorithms are being described. Which algorithm is best in
which application and why. This describes the optimal algorithm forvarious activities that include string matching as an importantaspect of functionality. In all applications test string and patternclass needs to be matched always.
Keywords: Databases, Dynamic programming, Search engine,String matching algorithms
I. INTRODUCTIONThe problem of string matching is that there are two strings
one is text T [1..n] i.e. is main string given and the other is
pattern P [1.m] i.e. is the given string to be matched with
the given main string given m
7/30/2019 String Matching Algorithms and their Applicability in various Applications
2/5
String Matching Algorithms and their Applicability in various Applications
219
a classic computer scienceproblem, the basis ofdifferent (a
file comparison program that outputs the differences between
two files), and has applications in bioinformatics.
2.vi Algorithm using Levenshtein Distance: Levenshteindistance is a metric for measuring the amount of difference
between two sequences (i.e. an edit distance). The term edit
distance is often used to refer specifically to Levenshtein
distance. The Levenshtein distance between two strings is
defined as the minimum number of edits needed to transform
one string into the other, with the allowable edit operations
being insertion, deletion, or substitution of a single character.
2.vii Smith Waterman Algorithm: It is a well-knownalgorithm for performing local sequence alignment; that is,
for determining similar regions between two nucleotide or
protein sequences. Instead of looking at the total sequence, the
Smith Waterman algorithm compares segments of all possible
lengths and optimizes similarity measure.
2.viii Brute Force Algorithm:It is also known as proof byexhaustion, also known as proof by cases, perfect induction,
or the brute force method, is a method ofmathematicalproofin which the statement to be proved is split into a finite
number of cases and each case is checked to see if the
proposition in question holds. A proof by exhaustion contains
two stages: proof that the cases are exhaustive; i.e., that each
instance of the statement to be proved matches the conditions
of (at least) one of the cases and a proof of each of the cases.
2.ix Boyer Moore Algorithm: It is a particularlyefficient string searching algorithm, and it has been the
standard benchmark for the practical string search literature.
The algorithm preprocesses the target string (key) that is
being searched for, but not the string being
searched in (unlike some algorithms that preprocess the stringto be searched and can then amortize the expense of the
preprocessing by searching repeatedly). The execution time of
the Boyer Moore algorithm, while still linear in the size of the
string being searched, can have a significantly lower constant
factor than many other search algorithms: it doesn't need to
check every character of the string to be searched, but rather
skips over some of them. Generally the algorithm gets faster
as the key being searched for becomes longer. Its efficiency
derives from the fact that with each unsuccessful attempt to
find a match between the search string and the text it is
searching, it uses the information gained from that attempt to
rule out as many positions of the text as possible where the
string cannot match. BMH approach uses only the Bad
Character Heuristic of Boyer Moore for skipping comparisons
rather than both the Bad Character and Good Suffix
Heuristics.
2.x Jaccard Similarity: It is a measure of the similaritybetween two binary vectors. It can be simply used to measure
the similarity of the strings.
Standard
Boyer-
Moore
bad
character
shift
Before Shift:
p -> six plus two*
t -> two plus three equals five
After Shift:
p -> six plus two
*
t -> two plus three equals five
The characters are matched starting at *
and compared in right to the left
manner. The whole pattern shift one by
one along the text to be searched from
left to right. The first only comparison
fails on the r. No r exists in the
pattern; it can be shifted by 12
characters as shown. The next
comparison begins after shifting and at
the second *.
Standard
Boyer-
Moore
repeated
substring
shift
Before Shift:
p ->two plus two
*
t -> count to two hundred four
After Shift:
p -> two plus two
*t ->count to two hundred four
The characters are matched starting at *
and continues examining characters
from right to left. It fails on second o
from the text. Since two is a repeated
substring in the pattern, the pattern is
shifted by 9 characters so that the two
of pattern is aligned with the matching
part of the text. The next comparison
begins after shifting and at the second
*.
III. VARIOUS ALGORITHM ANALYSIS TABLE
Algorithm name Compariso
n Order
Preprocess time
complexity
Search time
complexity
Main characteristic
http://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Diffhttp://en.wikipedia.org/wiki/Bioinformaticshttp://en.wikipedia.org/wiki/String_metrichttp://en.wikipedia.org/wiki/Edit_distancehttp://en.wikipedia.org/wiki/String_(computer_science)http://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Nucleotide_sequencehttp://en.wikipedia.org/wiki/Protein_sequencehttp://en.wikipedia.org/wiki/Mathematical_proofhttp://en.wikipedia.org/wiki/Mathematical_proofhttp://en.wikipedia.org/wiki/String_searching_algorithmhttp://en.wikipedia.org/wiki/Algorithmhttp://en.wikipedia.org/wiki/Preprocessorhttp://en.wikipedia.org/wiki/String_(computer_science)http://en.wikipedia.org/wiki/Amortizationhttp://en.wikipedia.org/wiki/Amortizationhttp://en.wikipedia.org/wiki/String_(computer_science)http://en.wikipedia.org/wiki/Preprocessorhttp://en.wikipedia.org/wiki/Algorithmhttp://en.wikipedia.org/wiki/String_searching_algorithmhttp://en.wikipedia.org/wiki/Mathematical_proofhttp://en.wikipedia.org/wiki/Mathematical_proofhttp://en.wikipedia.org/wiki/Protein_sequencehttp://en.wikipedia.org/wiki/Nucleotide_sequencehttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/String_(computer_science)http://en.wikipedia.org/wiki/Edit_distancehttp://en.wikipedia.org/wiki/String_metrichttp://en.wikipedia.org/wiki/Bioinformaticshttp://en.wikipedia.org/wiki/Diffhttp://en.wikipedia.org/wiki/Computer_science7/30/2019 String Matching Algorithms and their Applicability in various Applications
3/5
International Journal of Soft Computing and Engineering (IJSCE)
ISSN: 2231-2307, Volume-I, Issue-6, January 2012
220
Boyer Moore Right to Left O(m+n) O(mn) Use both good suffix shift and bad
character shift
BM Horspool Not relevant O(m+n) O(mn) Use only bad character shift with
rightmost character
Brute Force Not relevant No
preprocessing
O(mn) Use one by one character shift. Not
an optimal one
KMP Left to Right O(m) O(m+n) Independent of alphabet size, use the
notion of border of the string,
increases performance, decrease
delay and decrease time of comparing
Quick Search Not relevant O(m+n) O(mn) Use only bad character shift, very fast
Rabin Karp Left to Right O(m) O(mn) Use hashing function, very effective
for multiple pattern matching in 1D
matching
Approximate string
matching
Not relevant - O(mn) First matching approximate then
searching dictionary
Needleman Wunsch Left to Right - O(mn) Global alignment of proteins and
nucleotides
Smith Waterman Left to Right - O(mn) Align local sequence in segments of
proteins and nucleotides
IV. APPLICATIONS WITH THEIR OPTIMALALGORITHMS
3.i Text Editor, Digital Library and Search Engine:Every person uses a text editor and every user of a digital
library or search engine, needs to find patterns in a text. The
Boyer Moore algorithm is directly implemented the search
command of practically all text editors. The longest common
subsequence dynamic programming algorithm is implemented
in system commands that test differences between files.
3.ii Multimedia and Computational Biology: It hasshown that a much more generalized theoretical basis of
pattern matching could be of tremendous benefit. Pattern
matching has to adapt itself to increasingly broader definitions
of matching. In computational biology one may be interested
in finding a close mutation, in communications one may want
to adjust for transmission noise, in texts it may be desirable to
allow common typing errors. In multimedia one may want to
adjust for loss compressions, occlusions, scaling, affine
transformations or dimension loss. The largest overlap
heuristic for finding the shortest common superstring has been
used in DNA sequencing. The algorithms that are used in
Needleman Wunsch and Smith Waterman.
3.iii Medical Tests:The BMH algorithm achieves the bestoverall results when used with medical tests. This algorithm
usually performs at least twice as fast as the other algorithms
tested. The time performance of exact string pattern matching
can be greatly improved if an efficient algorithm is used.
Considering the growing amount of text handled in the
electronic patient record, it is worth implementing this
efficient algorithm.
7/30/2019 String Matching Algorithms and their Applicability in various Applications
4/5
String Matching Algorithms and their Applicability in various Applications
221
3.iv String Prefix Matching Problem: this refers to thematching the prefixes of the pattern and the text. It also
checks the longest prefix of some given sequence/text. This
occurs at the starting of the patterns. It also includes
preprocessing of the pattern. KMP algorithm and
deterministic sequential comparison model are used to solve
this problem. This can be done by assigning the lower and the
upper bounds of the prefix to be matched.[1]
3.v Polymorphic String Matching: In this some basicstring matching algorithm are combined to make one or more
efficient algorithms for further computation as in fusion of the
algorithms some functions become easy to define and some of
the time consuming parameters need not be used. In this data
representation technique used is tree. One example is KMP
and Boyer Moore fusion. There combination completes the
task in amortized constant time and cost is also equal to
equality check cost. The quadratic time is dropped. The
features of both the algorithms are combined and used for
producing a better functional algorithm.[2]
3.vi
Network Intrusion Detection System: This problemrequires exact pattern matching technique. This is open source
IDS snort. It reduces computational time and higher order of
context matching is performed. To solve this Boyer Moore
algorithm is used as it needs exact matching of the given
pattern. To achieve this tree data structure is used in modified
algorithms that convert the bad character shift to good prefix
shift and resulting in far better performance. [3]
3.vii Denoising of the Pattern: When the text is long itcreates noise in the pattern. To remove these denoise
technique i.e. DUDE algorithm is used for this. It comprises
of simple string matching technique with radix sort for getting
a noise free sequence. It has 2 passes: 1st collects the
empirical probability distribution and observe the noise
pattern within the double sided window and in 2nd pass those
noise patterns are replaced by denoise pattern/sequence.[4]
3.viii Approximate Similarity Measure: This algorithm isused in real world applications. This is for finite length
strings. It searches the optimal alignment for pattern
searching. It is far-far better than the any other algorithms. In
the previous similarity measure algorithm the variations in the
pattern decreases but now in the new proposed algorithm the
numbers of the comparisons are reduced. This is done by
picking only those strings from the database whose lengths
are either equal or greater than the pattern to be matched.
Time variation depend on the length of query string, L and
number of strings in the database of the length U where (L-
L/2)
7/30/2019 String Matching Algorithms and their Applicability in various Applications
5/5
International Journal of Soft Computing and Engineering (IJSCE)
ISSN: 2231-2307, Volume-I, Issue-6, January 2012
222
V. CONCLUSIONIn todays online scenario finding the appropriate content in
minimum time is most important. String algorithms play a
vital role for this. Different groups are working on software
and hardware levels to make pattern searching faster. By
applying various algorithms in various applications the
approximate best algorithm for different applications is
determined. The recommended algorithms give the reduced
complexity and also reduced computation time. The
algorithms assigned to various applications may not be best
optimal algorithm but better than the general algorithms.
Rather than applying each algorithm to every application one
application is explained with particular optimal algorithm.
And it has been noted that most applications uses Boyer
Moore, BMH or KMP algorithms for their effective and
efficient functionality and other applications uses the basics of
these algorithms for their functionalities as the KMP
algorithm has less time complexity and Boyer Moore and
BMH algorithms has preprocessing time complexity less.
Other algorithms depends upon the type of input and is
efficient for certain or particular application.
REFERENCES
[1] Dany Breslauer, Livio Colussi and Laura Toniolo, Tight ComparisonBounds for the String Prefix Matching Problem, Stiching
Mathematisch Centrum, Amsterdam, 1-9, 1992.
[2] Richard S. Bird, Polymorphic String Matching, 110-115, Haskell052005, Tallin, Estonia.
[3] C. Jason Coit, Stuart Staniford and Joseph McAlerney, Towards FasterString Matching for Intrusion Detection or Exceeding the Speed ofSnort, 1-7.
[4] S. Chen, S. Diggavi, S. Dusad and S. Muthukrishnan, Efficient StringMatching Algorithms for Combinatorial Universal Denoising,1-10.
[5] Narendra Kumar, Vimal Bibhu, Mohammad Islam and ShashankBhardwaj, Approximate string matching Algorithm, International
Journal on Computer Science and Engineering, Vol. 02, No. 03, 2010,641-644.
[6] Yu-lung Lo Chien-Chi Huang, Fault Tolerant Music Retrieval bysimilar String Matching, National Science Council of ROC Grant
NSC98-2221-E-324-027,1-10.
[7] S. Viswanadha Raju and A. Vinayababu. Optimal Parallel algorithmfor String Matching on Mesh Network Structure, International Journal
of Applied Mathematical Sciences ISSN 0973-0176 Vol.3 No.2 2006,
167-175.[8] Ahmad Fadel Klaib, Zurinahni Zainol, Nurul Hashimah Ahamed,
Rosma Ahmad, and Wahidah Hussin, Application of Exact String
Matching Algorithms towards SMILES Representation of Chemical
Structure, World Academy of Science, Engineering and Technology 34
2007,36-40.
[9] Venkata Padmavati Metta, Kamala Kritivasan, Deepak Garg, "OnString Languages Generated by SN P Systems with Anti-Spikes",
International Journal of Foundations in Computer Science, World
Scientific May 2011.[10] www.cs.biu.ac.il.[11] Venkatesan T. Chakaravarthy, Rajasekar Krishnamurthy, The Problem
of Context Sensitive String Matching, 1-12.
AUTHORS PROFILE
Nimisha Singla has received her Bachelor of Technology in ComputerScience & Engineering from Punjab Technical University, Jalandhar, India.
She is pursuing her Master in Engineering in Computer Science &
Engineering from Thapar University, Patiala, India. She is Secretary ofStudent Chapter of Technical Society viz. Association for Computing
Machinery (ACM). Her main research interests include: String matching
problems.
Dr.Deepak Garg is Chair ACM SIGACT North India and Secretary of IEEE
Computer Society, Delhi Section. He is senior member of IEEE and senior
member of ACM. He is Life Member of CSI, IETE, ISCA and ISTE. He has
undertaken research projects funded of Indian Govt. He is on various
advisory boards and technical committees of International workshops andconferences. He is a certified on various technologies. He is guiding many
MTech and PhD students. His active area of research is advance algorithm
design and theory of computer science. He has more than 80 Publications inInternational Journal and Conferences of Repute. He is Faculty at Computer
Science and Engineering Department at Thapar University, Patiala.
Recommended