58
A Guided Tour to Approximate String Matching GONZALO NAVARRO University of Chile We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices. We conclude with some directions for future work and open problems. Categories and Subject Descriptors: F.2.2 [Analysis of algorithms and problem complexity]: Nonnumerical algorithms and problems—Pattern matching, Computations on discrete structures; H.3.3 [Information storage and retrieval]: Information search and retrieval—Search process General Terms: Algorithms Additional Key Words and Phrases: Edit distance, Levenshtein distance, online string matching, text searching allowing errors 1. INTRODUCTION This work focuses on the problem of string matching that allows errors, also called approximate string matching. The general goal is to perform string matching of a pat- tern in a text where one or both of them have suffered some kind of (undesirable) corruption. Some examples are recovering the original signals after their transmis- sion over noisy channels, finding DNA sub- sequences after possible mutations, and text searching where there are typing or spelling errors. Partially supported by Fondecyt grant 1-990627. Author’s address: Department of Computer Science, University of Chile, Blanco Erncalada 2120, Santiago, Chile, e-mail: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or [email protected]. c 2001 ACM 0360-0300/01/0300-0031 $5.00 The problem, in its most general form, is to find a text where a text given pat- tern occurs, allowing a limited number of “errors” in the matches. Each application uses a different error model, which defines how different two strings are. The idea for this “distance” between strings is to make it small when one of the strings is likely to be an erroneous variant of the other under the error model in use. The goal of this survey is to present an overview of the state of the art in ap- proximate string matching. We focus on online searching (that is, when the text ACM Computing Surveys, Vol. 33, No. 1, March 2001, pp. 31–88. Edited by Foxit PDF Editor Edited by Foxit PDF Editor Edited by Foxit PDF Editor Edited by Foxit PDF Editor Edited by Foxit PDF Editor Edited by Foxit PDF Editor Edited by Foxit PDF Editor Edited by Foxit PDF Editor Edited by Foxit PDF Editor Edited by Foxit PDF Editor For Evaluation Only. Copyright (c) by Foxit Software Company, 2004 Edited by Foxit PDF Editor

A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

Embed Size (px)

Citation preview

Page 1: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching

GONZALO NAVARROUniversity of Chile

We survey the current techniques to cope with the problem of string matching thatallows errors. This is becoming a more and more relevant issue for many fast growingareas such as information retrieval and computational biology. We focus on onlinesearching and mostly on edit distance, explaining the problem and its relevance, itsstatistical behavior, its history and current developments, and the central ideas of thealgorithms and their complexities. We present a number of experiments to compare theperformance of the different algorithms and show which are the best choices. Weconclude with some directions for future work and open problems.

Categories and Subject Descriptors: F.2.2 [Analysis of algorithms and problemcomplexity]: Nonnumerical algorithms and problems—Pattern matching,Computations on discrete structures; H.3.3 [Information storage and retrieval]:Information search and retrieval—Search process

General Terms: Algorithms

Additional Key Words and Phrases: Edit distance, Levenshtein distance, online stringmatching, text searching allowing errors

1. INTRODUCTION

This work focuses on the problem of stringmatching that allows errors, also calledapproximate string matching. The generalgoal is to perform string matching of a pat-tern in a text where one or both of themhave suffered some kind of (undesirable)corruption. Some examples are recoveringthe original signals after their transmis-sion over noisy channels, finding DNA sub-sequences after possible mutations, andtext searching where there are typing orspelling errors.

Partially supported by Fondecyt grant 1-990627.Author’s address: Department of Computer Science, University of Chile, Blanco Erncalada 2120, Santiago,Chile, e-mail: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or direct commercial advantage andthat copies show this notice on the first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use anycomponent of this work in other works, requires prior specific permission and/or a fee. Permissions maybe requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212)869-0481, or [email protected]©2001 ACM 0360-0300/01/0300-0031 $5.00

The problem, in its most general form,is to find a text where a text given pat-tern occurs, allowing a limited number of“errors” in the matches. Each applicationuses a different error model, which defineshow different two strings are. The idea forthis “distance” between strings is to makeit small when one of the strings is likely tobe an erroneous variant of the other underthe error model in use.

The goal of this survey is to presentan overview of the state of the art in ap-proximate string matching. We focus ononline searching (that is, when the text

ACM Computing Surveys, Vol. 33, No. 1, March 2001, pp. 31–88.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 2: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

32 G. Navarro

cannot be preprocessed to build an in-dex on it), explaining the problem and itsrelevance, its statistical behavior, its his-tory and current developments, and thecentral ideas of the algorithms and theircomplexities. We also consider some vari-ants of the problem. We present a num-ber of experiments to compare the per-formance of the different algorithms andshow the best choices. We conclude withsome directions for future work and openproblems.

Unfortunately, the algorithmic nature ofthe problem strongly depends on the typeof “errors” considered, and the solutionsrange from linear time to NP-complete.The scope of our subject is so broad that weare forced to narrow our focus on a subsetof the possible error models. We consideronly those defined in terms of replacingsome substrings by others at varying costs.In this light, the problem becomes mini-mizing the total cost to transform the pat-tern and its occurrence in text to makethem equal, and reporting the text posi-tions where this cost is low enough.

One of the best studied cases of this er-ror model is the so-called edit distance,which allows us to delete, insert and sub-stitute simple characters (with a differentone) in both strings. If the different oper-ations have different costs or the costs de-pend on the characters involved, we speakof general edit distance. Otherwise, if allthe operations cost 1, we speak of simpleedit distance or just edit distance (ed ). Inthis last case we simply seek for the min-imum number of insertions, deletions andsubstitutions to make both strings equal.For instance ed ("survey,""surgery") =2. The edit distance has received a lotof attention because its generalized ver-sion is powerful enough for a wide rangeof applications. Despite the fact thatmost existing algorithms concentrate onthe simple edit distance, many of themcan be easily adapted to the generalizededit distance, and we pay attention tothis issue throughout this work. More-over, the few algorithms that exist forthe general error model that we con-sider are generalizations of edit distancealgorithms.

On the other hand, most of the algo-rithms designed for the edit distance areeasily specialized to other cases of inter-est. For instance, by allowing only in-sertions and deletions at cost 1, we cancompute the longest common subsequence(LCS) between two strings. Another sim-plification that has received a lot of atten-tion is the variant that allows only substi-tutions (Hamming distance).

An extension of the edit distance en-riches it with transpositions (i.e. a sub-stitution of the form ab → ba at cost 1).Transpositions are very important in textsearching applications because they aretypical typing errors, but few algorithmsexist to handle them. However, many algo-rithms for edit distance can be easily ex-tended to include transpositions, and wekeep track of this fact in this work.

Since the edit distance is by far thebest studied case, this survey focuses ba-sically on the simple edit distance. How-ever, we also pay attention to extensionssuch as generalized edit distance, trans-positions and general substring substitu-tion, as well as to simplifications such asLCS and Hamming distance. In addition,we also pay attention to some extensionsof the type of pattern to search: when thealgorithms allow it, we mention the possi-bility of searching some extended patternsand regular expressions allowing errors.We now point out what we are not cover-ing in this work.

—First, we do not cover other distancefunctions that do not fit the model ofsubstring substitution. This is becausethey are too different from our focus andthe paper would lose cohesion. Someof these are: Hamming distance (shortsurvey in [Navarro 1998]), reversals[Kececioglu and Sankoff 1995] (whichallows reversing substrings), block dis-tance [Tichy 1984; Ehrenfeucht andHaussler 1988; Ukkonen 1992; Loprestiand Tomkins 1997] (which allows rear-ranging and permuting the substrings),q-gram distance [Ukkonen 1992] (basedon finding common substrings of fixedlength q), allowing swaps [Amir et al.1997b; Lee et al. 1997], etc. Hamming

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 3: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 33

distance, despite being a simplificationof the edit distance, is not covered be-cause specialized algorithms for it existthat go beyond the simplification of anexisting algorithm for edit distance.

—Second, we consider pattern matchingover sequences of symbols, and at mostgeneralize the pattern to a regular ex-pression. Extensions such as approx-imate searching in multidimensionaltexts (short survey in [Navarro andBaeza-Yates 1999a]), in graphs [Amiret al. 1997a; Navarro 2000a] or multi-pattern approximate searching [Muthand Manber 1996; Baeza-Yates andNavarro 1997; Navarro 1997a; Baeza-Yates and Navarro 1998] are not con-sidered. None of these areas is very de-veloped and the algorithms should beeasy to grasp once approximate patternmatching under the simple model is wellunderstood. Many existing algorithmsfor these problems borrow from those wepresent here.

—Third, we leave aside nonstandard al-gorithms, such as approximate,1 prob-abilistic or parallel algorithms [Tarhioand Ukkonen 1988; Karloff 1993;Atallah et al. 1993; Altschul et al. 1990;Lipton and Lopresti 1985; Landau andVishkin 1989].

—Finally, an important area that we leaveaside in this survey is indexed search-ing, i.e. the process of building a per-sistent data structure (an index) on thetext to speed up the search later. Typicalreasons that prevent keeping indices onthe text are: extra space requirements(as the indices for approximate search-ing tend to take many times the textsize), volatility of the text (as buildingthe indices is quite costly and needs to

1 Please do not confuse an approximate algorithm(which delivers a suboptimal solution with some sub-optimality guarantees) with an algorithm for approx-imate string matching. Indeed approximate stringmatching algorithms can be regarded as approxi-mation algorithms for exact string matching (wherethe maximum distance gives the guarantee of opti-mality), but in this case it is harder to find the ap-proximate matches, and of course the motivation isdifferent.

be amortized over many searches) andsimply inadequacy (as the field of in-dexed approximate string matching isquite immature and the speedup thatthe indices provide is not always sat-isfactory). Indexed approximate search-ing is a difficult problem, and the areais quite new and active [Jokinen andUkkonen 1991; Gonnet 1992; Ukkonen1993; Myers 1994a; Holsti and Sutinen1994; Manber and Wu 1994; Cobbs1995; Sutinen and Tarhio 1996; Araujoet al. 1997; Navarro and Baeza-Yates1999b; Baeza-Yates and Navarro 2000;Navarro et al. 2000]. The problem isvery important because the texts insome applications are so large thatno online algorithm can provide ade-quate performance. However, virtuallyall the indexed algorithms are stronglybased on online algorithms, and there-fore understanding and improving thecurrent online solutions is of interestfor indexed approximate searching aswell.

These issues have been put aside to keepa reasonable scope in the present work.They certainly deserve separate surveys.Our goal in this survey is to explain thebasic tools of approximate string match-ing, as many of the extensions we areleaving aside are built on the basic algo-rithms designed for online approximatestring matching.

This work is organized as follows. InSection 2 we present in detail some of themost important application areas for ap-proximate string matching. In Section 3we formally introduce the problem and thebasic concepts necessary to follow the restof the paper. In Section 4 we show someanalytical and empirical results about thestatistical behavior of the problem.

Sections 5–8 cover all the work of inter-est we could trace on approximate stringmatching under the edit distance. Wedivided it in four sections that correspondto different approaches to the problem:dynamic programming, automata, bit-parallelism, and filtering algorithms.Each section is presented as a historicaltour, so that we do not only explain the

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 4: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

34 G. Navarro

work done but also show how it wasdeveloped.

Section 9 presents experimental resultscomparing the most efficient algorithms.Finally, we give our conclusions and dis-cuss open questions and future work inSection 10.

There exist other surveys on approxi-mate string matching, which are howevertoo old for this fast moving area [Halland Dowling 1980; Sankoff and Kruskal1983; Apostolico and Galil 1985; Galil andGiancarlo 1988; Jokinen et al. 1996] (thelast one was in its definitive form in 1991).So all previous surveys lack coverage ofthe latest developments. Our aim is toprovide a long awaited update. This workis partially based in Navarro [1998], butthe coverage of previous work is muchmore detailed here. The subject is alsocovered, albeit with less depth, in sometextbooks on algorithms [Crochemore andRytter 1994; Baeza-Yates and Ribeiro-Neto 1999].

2. MAIN APPLICATION AREAS

The first references to this problem wecould trace are from the sixties and sev-enties, where the problem appeared in anumber of different fields. In those times,the main motivation for this kind of searchcame from computational biology, signalprocessing, and text retrieval. These arestill the largest application areas, and wecover each one here. See also [Sankoff andKruskal 1983], which has a lot of informa-tion on the birth of this subject.

2.1 Computational Biology

DNA and protein sequences can be seenas long texts over specific alphabets (e.g.{A,C,G,T} in DNA). Those sequences rep-resent the genetic code of living beings.Searching specific sequences over thosetexts appeared as a fundamental opera-tion for problems such as assembling theDNA chain from the pieces obtained by theexperiments, looking for given features inDNA chains, or determining how differenttwo genetic sequences are. This was mod-eled as searching for given “patterns” ina “text.” However, exact searching was of

little use for this application, since the pat-terns rarely matched the text exactly: theexperimental measures have errors of dif-ferent kinds and even the correct chainsmay have small differences, some of themsignificant due to mutations and evolu-tionary alterations and others unimpor-tant. Finding DNA chains very similar tothose sought represent significant resultsas well. Moreover, establishing how differ-ent two sequences are is important to re-construct the tree of the evolution (phylo-genetic trees). All these problems requireda concept of “similarity,” as well as an al-gorithm to compute it.

This gave a motivation to “search allow-ing errors.” The errors were those opera-tions that biologists knew were commonin genetic sequences. The “distance” be-tween two sequences was defined as theminimum (i.e. more likely) sequence of op-erations to transform one into the other.With regard to likelihood, the operationswere assigned a “cost,” such that the morelikely operations were cheaper. The goalwas then to minimize the total cost.

Computational biology has since thenevolved and developed a lot, with a specialpush in recent years due to the “genome”projects that aim at the complete decodingof the DNA and its potential applications.There are other, more exotic, problemssuch as structure matching or searchingfor unknown patterns. Even the simpleproblem where the pattern is known isvery difficult under some distance func-tions (e.g. reversals).

Some good references for the applica-tions of approximate pattern matching tocomputational biology are Sellers [1974],Needleman and Wunsch [1970], Sankoffand Kruskal [1983], Altschul et al. [1990],Myers [1991, 1994b], Waterman [1995],Yap et al. [1996], and Gusfield [1997].

2.2 Signal Processing

Another early motivation came from sig-nal processing. One of the largest areasdeals with speech recognition, where thegeneral problem is to determine, givenan audio signal, a textual message whichis being transmitted. Even simplified

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 5: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 35

problems such as discerning a word from asmall set of alternatives is complex, sinceparts of the the signal may be compressedin time, parts of the speech may not be pro-nounced, etc. A perfect match is practicallyimpossible.

Another problem is error correction. Thephysical transmission of signals is error-prone. To ensure correct transmission overa physical channel, it is necessary to beable to recover the correct message aftera possible modification (error) introducedduring the transmission. The probabilityof such errors is obtained from the sig-nal processing theory and used to assigna cost to them. In this case we may noteven know what we are searching for, wejust want a text which is correct (accord-ing to the error correcting code used) andclosest to the received message. Althoughthis area has not developed much withrespect to approximate searching, it hasgenerated the most important measureof similarity, known as the Levenshteindistance [Levenshtein 1965; 1966] (alsocalled “edit distance”).

Signal processing is a very active areatoday. The rapidly evolving field of multi-media databases demands the ability tosearch by content in image, audio andvideo data, which are potential applica-tions for approximate string matching. Weexpect in the next years a lot of pres-sure on nonwritten human-machine com-munication, which involves speech recog-nition. Strong error correcting codes arealso sought, given the current interest inwireless networks, as the air is a low qual-ity transmission medium.

Good references for the relations ofapproximate pattern matching with sig-nal processing are Levenshtein [1965],Vintsyuk [1968], and Dixon and Martin[1979].

2.3 Text Retrieval

The problem of correcting misspelledwords in written text is rather old, per-haps the oldest potential application forapproximate string matching. We couldfind references from the twenties [Masters1927], and perhaps there are older ones.

Since the sixties, approximate stringmatching is one of the most popular toolsto deal with this problem. For instance,80% of these errors are corrected allowingjust one insertion, deletion, substitution,or transposition [Damerau 1964].

There are many areas where this prob-lem appears, and Information Retrieval(IR) is one of the most demanding. IR isabout finding the relevant information ina large text collection, and string match-ing is one of its basic tools.

However, classical string matching isnormally not enough, because the text col-lections are becoming larger (e.g. the Webtext has surpassed 6 terabytes [Lawrenceand Giles 1999]), more heterogeneous (dif-ferent languages, for instance), and moreerror prone. Many are so large and growso fast that it is impossible to control theirquality (e.g. in the Web). A word which isentered incorrectly in the database can-not be retrieved anymore. Moreover, thepattern itself may have errors, for in-stance in cross-lingual scenarios where aforeign name is incorrectly spelled, or inold texts that use outdated versions of thelanguage.

For instance, text collections digitalizedvia optical character recognition (OCR)contain a nonnegligible percentage of er-rors (7–16%). The same happens withtyping (1–3.2%) and spelling (1.5–2.5%)errors. Experiments for typing Dutch sur-names (by the Dutch) reached 38% ofspelling errors. All these percentages wereobtained from Kukich [1992]. Our own ex-periments with the name “Levenshtein” inAltavista gave more than 30% of errors al-lowing just one deletion or transposition.

Nowadays, there is virtually no text re-trieval product that does not allow someextended search facility to recover from er-rors in the text or pattern. Other text pro-cessing applications are spelling checkers,natural language interfaces, commandlanguage interfaces, computer aided tutor-ing and language learning, to name a few.

A very recent extension which becamepossible thanks to word-oriented text com-pression methods is the possibility to per-form approximate string matching at theword level [Navarro et al. 2000]. That

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 6: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

36 G. Navarro

is, the user supplies a phrase to searchand the system searches the text positionswhere the phrase appears with a limitednumber of word insertions, deletions andsubstitutions. It is also possible to disre-gard the order of the words in the phrases.This allows the query to survive from dif-ferent wordings of the same idea, whichextends the applications of approximatepattern matching well beyond the recov-ery of syntactic mistakes.

Good references about the relation ofapproximate string matching and infor-mation retrieval are Wagner and Fisher[1974], Lowrance and Wagner [1975],Nesbit [1986], Owolabi and McGregor[1988], Kukich [1992], Zobel and Dart[1996], French et al. [1997], and Baeza-Yates and Ribeiro-Neto [1999].

2.4 Other Areas

The number of applications for approx-imate string matching grows every day.We have found solutions to the mostdiverse problems based on approximatestring matching, for instance handwritingrecognition [Lopresti and Tomkins 1994],virus and intrusion detection [Kumarand Spaffors 1994], image compression[Luczak and Szpankowski 1997], datamining [Das et al. 1997], pattern recogni-tion [Gonzalez and Thomason 1978], op-tical character recognition [Elliman andLancaster 1990], file comparison [Heckel1978], and screen updating [Gosling1991], to name a few. Many more ap-plications are mentioned in Sankoff andKruskal [1983] and Kukich [1992].

3. BASIC CONCEPTS

We present in this section the importantconcepts needed to understand all the de-velopment that follows. Basic knowledgeof the design and analysis of algorithmsand data structures, basic text algorithms,and formal languages is assumed. If thisis not the case we refer the reader to goodbooks on these subjects, such as Aho et al.[1974], Cormen et al. [1990], Knuth [1973](for algorithms), Gonnet and Baeza-Yates[1991], Crochemore and Rytter [1994],

Apostolico and Galil [1997] (for text algo-rithms), and Hopcroft and Ullman [1979](for formal languages).

We start with some formal definitionsrelated to the problem. Then we coversome data structures not widely knownwhich are relevant for this survey (theyare also explained in Gonnet and Baeza-Yates [1991] and Crochemore and Rytter[1994]). Finally, we make some commentsabout the tour itself.

3.1 Approximate String Matching

In the discussion that follows, we use s,x, y , z, v, w to represent arbitrary strings,and a, b, c, . . . to represent letters. Writinga sequence of strings and/or letters repre-sents their concatenation. We assume thatconcepts such as prefix, suffix and sub-string are known. For any string s ∈ 6∗we denote its length as |s|. We also de-note si the ith character of s, for an inte-ger i ∈ {1..|s|}. We denote si.. j = sisi+1 · · · sj(which is the empty string if i> j ). Theempty string is denoted as ε.

In the Introduction we have defined theproblem of approximate string matchingas that of finding the text positions thatmatch a pattern with up to k errors. Wenow give a more formal definition.

Let 6 be a finite2 alphabet of size|6| = σ .Let T ∈ 6∗ be a text of length n = |T |.Let P ∈ 6∗ be a pattern of lengthm = |P |.Let k ∈ R be the maximum error al-lowed.Let d : 6∗ × 6∗ → R be a distancefunction.The problem is: given T , P , k and d (·),return the set of all the text positionsj such that there exists i such thatd (P, Ti.. j ) ≤ k.

2 However, many algorithms can be adapted to infi-nite alphabets with an extra O(log m) factor in theircost. This is because the pattern can have at mostm different letters and all the rest can be consid-ered equal for our purposes. A table of size σ couldbe replaced by a search structure over at most m+ 1different letters.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Find substring T(i,j) with <= k errors vs pattern P

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 7: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 37

Note that endpoints of occurrences arereported to ensure that the output is of lin-ear size. By reversing all strings we canobtain start points.

In this work we restrict our attention toa subset of the possible distance functions.We consider only those defined in the fol-lowing form:

The distance d (x, y) between two stringsx and y is the minimal cost of a sequenceof operations that transform x into y (and∞ if no such sequence exists). The cost ofa sequence of operations is the sum of thecosts of the individual operations. The op-erations are a finite set of rules of the formδ(z, w) = t, where z and w are differentstrings and t is a nonnegative real num-ber. Once the operation has converted asubstring z into w, no further operationscan be done on w.

Note especially the restriction that for-bids acting many times over the samestring. Freeing the definition from thiscondition would allow any rewriting sys-tem to be represented, and thereforedetermining the distance between twostrings would not be computable ingeneral.

If for each operation of the formδ(z, w) there exists the respective opera-tion δ(w, z) at the same cost, then the dis-tance is symmetric (i.e. d (x, y) = d ( y , x)).Note also that d (x, y) ≥ 0 for all strings xand y , that d (x, x) = 0, and that it alwaysholds d (x, z) ≤ d (x, y)+d ( y , z). Hence,if the distance is symmetric, the space ofstrings forms a metric space.

General substring substitution has beenused to correct phonetic errors [Zobel andDart 1996]. In most applications, however,the set of possible operations is restrictedto:

—Insertion: δ(ε, a), i.e. inserting the lettera.

—Deletion: δ(a, ε), i.e. deleting the lettera.

—Substitution or Replacement: δ(a, b) fora 6= b, i.e. substituting a by b.

—Transposition: δ(ab, ba) for a 6= b, i.e.swap the adjacent letters a and b.

We are now in position to define themost commonly used distance functions(although there are many others).

—Levenshtein or edit distance[Levenshtein 1965]: allows inser-tions, deletions and substitutions. Inthe simplified definition, all the oper-ations cost 1. This can be rephrasedas “the minimal number of insertions,deletions and substitutions to maketwo strings equal.” In the literature thesearch problem in many cases is called“string matching with k differences.”The distance is symmetric, and it holds0 ≤ d (x, y) ≤ max(|x|, | y |).

—Hamming distance [Sankoff andKruskal 1983]: allows only substitu-tions, which cost 1 in the simplifieddefinition. In the literature the searchproblem in many cases is called “stringmatching with k mismatches.” Thedistance is symmetric, and it is finitewhenever |x| = | y |. In this case it holds0 ≤ d (x, y) ≤ |x|.

—Episode distance [Das et al. 1997]:allows only insertions, which cost 1.In the literature the search problem inmany cases is called “episode match-ing,” since it models the case wherea sequence of events is sought, whereall of them must occur within a shortperiod. This distance is not symmetric,and it may not be possible to convertx into y in this case. Hence, d (x, y) iseither | y | − |x| or∞.

—Longest common subsequence distance[Needleman and Wunsch 1970; Apos-tolico and Guerra 1987]: allows onlyinsertions and deletions, all costing 1.The name of this distance refers to thefact that it measures the length of thelongest pairing of characters that canbe made between both strings, sothat the pairings respect the orderof the letters. The distance is thenumber of unpaired characters. Thedistance is symmetric, and it holds0 ≤ d (x, y)≤ |x| + | y |.In all cases, except the episode distance,

one can think that the changes can bemade over x or y . Insertions on x are the

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

t = cost

cost to transform string z to w = t

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 8: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

38 G. Navarro

same as deletions in y and vice versa, andsubstitutions can be made in any of thetwo strings to match the other.

This paper is most concerned with thesimple edit distance, which we denoteed (·). Although transpositions are of in-terest (especially in case of typing errors),there are few algorithms to deal withthem. However, we will consider them atsome point in this work (note that a trans-position can be simulated with an inser-tion plus a deletion, but the cost is dif-ferent). We also point out when the algo-rithms can be extended to have differentcosts of the operations (which is of spe-cial interest in computational biology), in-cluding the extreme case of not allowingsome operations. This includes the otherdistances mentioned.

Note that if the Hamming or edit dis-tance are used, then the problem makessense for 0 < k < m, since if we can per-form m operations we can make the pat-tern match at any text position by meansof m substitutions. The case k = 0 cor-responds to exact string matching and istherefore excluded from this work. Un-der these distances, we call α = k/m theerror level, which given the above condi-tions, satisfies 0 < α < 1. This value givesan idea of the “error ratio” allowed in thematch (i.e. the fraction of the pattern thatcan be wrong).

We finish this section with some notesabout the algorithms we are going to con-sider. Like string matching, this area issuitable for very theoretical and for verypractical contributions. There exist a num-ber of algorithms with important improve-ments in their theoretical complexity, butthey are very slow in practice. Of course,for carefully built scenarios (say, m =100,000 and k = 2) these algorithms couldbe a practical alternative, but these casesdo not appear in applications. Therefore,we now point out the parameters of theproblem that we consider “practical,” i.e.likely to be of use in some applications, andwhen we later say “in practice” we meanunder the following assumptions.

—The pattern length can be as short as 5letters (e.g. text retrieval) and as long

as a few hundred letters (e.g. computa-tional biology).

—The number of errors allowed k satisfiesthat k/m is a moderately low value. Re-asonable values range from 1/m to 1/2.

—The text length can be as short as a fewthousand letters (e.g. computational bi-ology) and as long as megabytes or giga-bytes (e.g. text retrieval).

—The alphabet size σ can be as low as fourletters (e.g. DNA) and a high as 256 let-ters (e.g. compression applications). It isalso reasonable to think in even largeralphabets (e.g. oriental languages orword oriented text compression). The al-phabet may or may not be random.

3.2 Suffix Trees and Suffix Automata

Suffix trees [Weiner 1973; Knuth 1973;Apostolico and Galil 1985] are widely useddata structures for text processing [Apos-tolico 1985]. Any position i in a string Sautomatically defines a suffix of S, namelySi..| S |. In essence, a suffix tree is a triedata structure built over all the suffixesof S. At the leaf nodes the pointers to thesuffixes are stored. Each leaf represents asuffix and each internal node represents aunique substring of S. Every substring ofS can be found by traversing a path fromthe root. Each node representing the sub-string ax has a suffix link that leads to thenode representing the substring x.

To improve space utilization, this trie iscompacted into a Patricia tree [Morrison1968]. This involves compressing unarypaths. At the nodes that root a compressedpath, an indication of which character toinspect is stored. Once unary paths arenot present the tree has O(|S|) nodes in-stead of the worst-case O(|S|2) of the trie(see Figure 1). The structure can be builtin time O(|S|) [McCreight 1976; Ukkonen1995].

A DAWG (Deterministic Acyclic WordGraph) [Crochemore 1986; Blumer et al.1985] built on a string S is a determin-istic automaton able to recognize all thesubstrings of S. As each node in the suf-fix tree corresponds to a substring, theDAWG is no more than the suffix tree

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 9: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 39

Fig. 1 . The suffix trie and suffix tree for a sample string. The “$” is a special marker to denote the end ofthe text. Two suffix links are exemplified in the trie: from "abra" to "bra" and then to "ra". The internalnodes of the suffix tree show the character position to inspect in the string.

augmented with failure links for the let-ters not present in the tree. Since finalnodes are not distinguished, the DAWGis smaller. DAWGs have similar applica-tions to those of suffix trees, and alsoneed O(| S |) space and construction time.Figure 2 illustrates.

A suffix automaton on S is an automatonthat recognizes all the suffixes of S. Thenondeterministic version of this automa-ton has a very regular structure and isshown in Figure 3 (the deterministic ver-sion can be seen in Figure 2).

3.3 The Tour

Sections 5–8 present a historical touracross the four main approaches to on-line approximate string matching (see Fig-ure 4). In those historical discussions, keepin mind that there may be a long gap be-tween the time when a result is discoveredand when it finally gets published in itsdefinitive form. Some apparent inconsis-tencies can be explained in this way (e.g.algorithms which are “finally” analyzedbefore they appear). We did our best in thebibliography to trace the earliest versionof the works, although the full referencecorresponds generally to the final version.

At the beginning of each of these sec-tions we give a taxonomy to help guide thetour. The taxonomy is an acyclic graphwhere the nodes are the algorithms andthe edges mean that the work lower downcan be seen as an evolution of the work inthe upper position (although sometimesthe developments are in fact indepen-dent).

Finally, we specify some notation re-garding time and space complexity. Whenwe say that an algorithm is O(x) time werefer to its worst case (although sometimeswe say that explicitly). If the cost is av-erage, we say so explicitly. We also some-times say that the algorithm is O(x) cost,meaning time. When we refer to spacecomplexity we say so explicitly. The av-erage case analysis normally assumes arandom text, where each character is se-lected uniformly and independently fromthe alphabet. The pattern is not normallyassumed to be random.

4. THE STATISTICS OF THE PROBLEM

A natural question about approximatesearching is: what is the probabilityof a match? This question is not only

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 10: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

40 G. Navarro

Fig. 2 . The DAWG or the suffix automaton for the sample string. If all the states are final, it is a DAWG.If only the 2nd, 5th and rightmost states are final then it is a suffix automaton.

interesting in itself, but also essential forthe average case analysis of many searchalgorithms, as will be seen later. We nowpresent the existing results and an empir-ical validation. In this section we considerthe edit distance only. Some variants canbe adapted to these results.

The effort in analyzing the probabilisticbehavior of the edit distance has not givengood results in general [Kurtz and Myers1997]. An exact analysis of the probabilityof the occurrence of a fixed pattern allow-ing k substitution errors (i.e. Hammingdistance) can be found in Regnier andSzpankowski [1997], although the resultis not easy to average over all the possi-ble patterns. The results we present hereapply to the edit distance model and, al-though not exact, are easier to use ingeneral.

The result of Regnier and Szpankowski[1997] holds under the assumption thatthe characters of the text are indepen-dently generated with fixed probabilities,i.e. a Bernoulli model. In the rest of thispaper we consider a simpler model, the“uniform Bernoulli model,” where all thecharacters occur with the same probabil-ity 1/σ . Although this is a gross simplifi-cation of the real processes that generatethe texts in most applications, the resultsobtained are quite reliable in practice. Inparticular, all the analyses apply quitewell to biased texts if we replace σ by 1/p,where p is the probability that two ran-dom text characters are equal.

Although the problem of the averageedit distance between two strings is closelyrelated to the better studied LCS, thewell known results of Chvatal and Sankoff

[1975] and Deken [1979] can hardly be ap-plied to this case. It can be shown that theaverage edit distance between two randomstrings of length m tends to a constantfraction of m as m grows, but the frac-tion is not known. It holds that for anytwo strings of length m, m − lcs ≤ ed ≤2(m− lcs), where ed is their edit distanceand lcs is the length of their longest com-mon subsequence. As proved in Chvataland Sankoff [1975], the average LCS is be-tween m/

√σ and me/

√σ for large σ , and

therefore the average edit distance is be-tween m (1−e/

√σ ) and 2m (1−1/

√σ ). For

large σ it is conjectured that the true valueis m (1 − 1/

√σ ) [Sankoff and Mainville

1983].For our purposes, bounding the proba-

bility of a match allowing errors is moreimportant than the average edit distance.Let f (m, k) be the probability of a randompattern of length m matching a given textposition with k errors or less under the editdistance (i.e. the text position is reportedas the end of a match). In Baeza-Yatesand Navarro [1999], Navarro [1998], andNavarro and Baeza-Yates [1999b] upperand lower bounds on the maximum errorlevel α∗ for which f (m, k) is exponentiallydecreasing on m are found. This is impor-tant because many algorithms search forpotential matches that have to be verifiedlater, and the cost of such verifications ispolynomial in m, typically O(m2). There-fore, if that event occurs with probabilityO(γm) for some γ < 1 then the total costof verifications is O(m2γm) = o(1), whichmakes the verification cost negligible.

We first show the analytical bounds forf (m, k), then give a new result on average

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 11: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 41

Fig. 3 . A nondeterministic suffix automaton to recognize any suffix of "abracadabra.” Dashed lines rep-resent ε-transitions (i.e. they occur without consuming any input).

edit distance, and finally present an exper-imental verification.

4.1 An Upper Bound

The upper bound for α∗ comes from theproof that the matching probability isf (m, k) = O(γm) for

γ =(

1

σα2α

1−α (1− α)2

)1−α≤(

e2

σ (1− α)2

)1−α

(1)

where we note that γ is 1/σ for α = 0 andgrows to 1 as α grows. This matching prob-ability is exponentially decreasing on m aslong as γ < 1, which is equivalent to

α < 1− e√σ− O(1/σ ) ≤ 1− e√

σ(2)

Therefore, α <1− e/√σ is a conserva-

tive condition on the error level which en-sures “few” matches. Therefore, the maxi-mum level α∗ satisfies α∗> 1− e/

√σ .

The proof is obtained using a combi-natorial model. Based on the observationthat m − k common characters must ap-pear in the same order in two strings thatmatch with k errors, all the possible alter-natives to select the matching charactersfrom both strings are enumerated. Thismodel, however, does not take full advan-tage of the properties of the edit distance:even if m − k characters match, the dis-tance can be larger than k. For example,in ed (abc, bcd ) = 2, i.e. although two char-acters match, the distance is not 1.

4.2 A Lower Bound

On the other hand, the only optimisticbound we know of is based on the consider-ation that only substitutions are allowed

(i.e. Hamming distance). This distance issimpler to analyze but its matching proba-bility is much lower. Using a combinatorialmodel again it is shown that the matchingprobability is f (m, k) ≥ δm m−1/2, where

δ =(

1(1− α)σ

)1−α

Therefore an upper bound for the maxi-mum α∗ value is α∗ ≤ 1−1/σ , since other-wise it can be proved that f (m, k) is notexponentially decreasing on m (i.e. it isÄ(m−1/2)).

4.3 A New Result on Average Edit Distance

We can now prove that the average editdistance is larger than m (1 − e/

√σ ) for

any σ (recall that the result of Chvatal andSankoff [1975] holds for large σ ). We definep(m, k) as the probability that the edit dis-tance between two strings of length m isat most k. Note that p(m, k) ≤ f (m, k) be-cause in the latter case we can match withany text suffix of length from m−k to m+k.Then the average edit distance is

m∑k= 0

kPr(ed = k) =m∑

k= 0

Pr(ed > k)

=m∑

k= 0

1− p(m, k) = m−m∑

k= 0

p(m, k)

which, since p(m, k) increases with k, islarger than

m−(K p(m, K )+(m−K )) = K (1−p(m, K ))

for any K of our choice. In particu-lar, for K /m< 1 − e/

√σ we have that

p(m, K ) ≤ f (m, K ) = O(γm) for γ <1.Therefore choosing K =m (1− e/

√σ )− 1

yields that the edit distance is at least

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 12: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

42 G. Navarro

Fig. 4 . Taxonomy of the types of solutions for online searching.

m (1 − e/√σ )+O(1), for any σ . As we

see later, this proof converts a conjectureabout the average running time of an algo-rithm [Chang and Lampe 1992] into a fact.

4.4 Empirical Verification

We verify the analysis experimentally inthis section (this is also taken from Baeza-Yates and Navarro [1999] and Navarro[1998]). The experiment consists of gener-ating a large random text (n = 10 MB) andrunning the search of a random pattern onthat text, allowing k = m errors. At eachtext character, we record the minimum al-lowed error k for which that text positionmatches the pattern. We repeat the exper-iment with 1,000 random patterns.

Finally, we build the cumulative his-togram, finding how many text positionshave matched with up to k errors, for eachk value. We consider that k is “low enough”up to where the histogram values becomesignificant, that is, as long as few text posi-tions have matched. The threshold is set ton/m2, since m2 is the normal cost of verify-ing a match. However, the selection of thisthreshold is not very important, since thehistogram is extremely concentrated. Forexample, for m in the hundreds, it movesfrom almost zero to almost n in just five orsix increments of k.

Figure 5 shows the results for σ = 32. Onthe left we show the histogram we havebuilt, where the matching probability

undergoes a sharp increase at α∗. On theright we show the α∗ value as m grows. It isclear that α∗ is essentially independent ofm, although it is a bit lower for short pat-terns. The increase in the left plot at α∗ isso sharp that the right plot would be thesame if we plotted the value of the averageedit distance divided by m.

Figure 6 uses a stable m = 300 toshow the α∗ value as a function of σ . Thecurve α= 1− 1/

√σ is included to show its

closeness to the experimental data. Leastsquares give the approximation α∗ =1− 1.09/

√σ , with a relative error smaller

than 1%. This shows that the upperbound analysis (Eq. (2)) matches realitybetter, provided we replace e by 1.09 inthe formulas.

Therefore, we have shown that thematching probability has a sharp behav-ior: for low α it is very low, not as low as1/σm like exact string matching, but stillexponentially decreasing in m, with anexponent base larger than 1/σ . At some αvalue (that we call α∗) it sharply increasesand quickly becomes almost 1. This pointis close to α∗ = 1− 1/

√σ in practice.

This is why the problem is of inter-est only up to a given error level, sincefor higher errors almost all text positionsmatch. This is also the reason that somealgorithms have good average behavioronly for low enough error levels. The pointα∗ = 1− 1/

√σ matches the conjecture of

Sankoff and Mainville [1983].

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

4 approaches to approx. string matching

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

DP=Dyn. Progr.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 13: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 43

Fig. 5 . On the left, probability of an approximate match as a function of the error level (m =300). On the right, the observed α∗ error level as a function of the pattern length. Both casescorrespond to random text with σ = 32.

5. DYNAMIC PROGRAMMING ALGORITHMS

We start our tour with the oldest amongthe four areas, which directly inheritsfrom the earliest work. Most of the theo-retical breakthroughs in the worst-case al-gorithms belong to this category, althoughonly a few of them are really competitive inpractice. The latest practical work in thisarea dates back to 1992, although thereare recent theoretical improvements. Themajor achievements are O(kn) worst-casealgorithms and O(kn/

√σ )average-case al-

gorithms, as well as other recent theoreti-cal improvements on the worst-case.

We start by presenting the first algo-rithm that solved the problem and thengive a historical tour on the improvementsover the initial solution. Figure 7 helpsguide the tour.

5.1 The First Algorithm

We now present the first algorithm to solvethe problem. It has been rediscoveredmany times in the past, in different ar-eas, e.g. Vintsyuk [1968], Needleman andWunsch [1970], Sankoff [1972], Sellers[1974], Wagner and Fisher [1974], andLowrance and Wagner [1975] (there aremore in Ullman [1977], Sankoff andKruskal [1983], and Kukich [1992]). How-ever, this algorithm computed the edit dis-tance, and it was converted into a searchalgorithm only in 1980 by Sellers [Sellers1980]. Although the algorithm is not very

efficient, it is among the most flexible onesin adapting to different distance functions.

We first show how to compute the editdistance between two strings x and y .Later, we extend that algorithm to searcha pattern in a text allowing errors. Finally,we show how to handle more general dis-tance functions.

5.1.1 Computing Edit Distance. The algo-rithm is based on dynamic programming.Imagine that we need to compute ed (x, y).A matrix C0..|x|,0..| y | is filled, where Ci, j rep-resents the minimum number of opera-tions needed to match x1..i to y1.. j . This iscomputed as follows:

Ci,0 = iC0, j = jCi, j = if (xi = y j ) then Ci−1, j−1

else 1+ min(Ci−1, j , Ci, j−1, Ci−1, j−1)

where at the end C|x|,| y | = ed (x, y). The ra-tionale for the above formula is as follows.First, Ci,0 and C0, j represent the edit dis-tance between a string of length i or j andthe empty string. Clearly i (respectivelyj ) deletions are needed on the nonemptystring. For two nonempty strings of lengthi and j , we assume inductively that allthe edit distances between shorter stringshave already been computed, and try toconvert x1..i into y1.. j .

Consider the last characters xi and y j .If they are equal, then we do not need to

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Text searching: pp 46

delete i characters to get 0 length string

insert j characters

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 14: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

44 G. Navarro

Fig. 6 . Theoretical and practical values for α∗, for m = 300 and different σ values.

consider them and we proceed in the bestpossible way to convert x1..i−1 into y1.. j−1.On the other hand, if they are not equal,we must deal with them in some way. Fol-lowing the three allowed operations, wecan delete xi and convert in the best wayx1..i−1 into y1.. j , insert y j at the end ofx1..i and convert in the best way x1..i intoy1.. j−1, or substitute xi by y j and convertin the best way x1..i−1 into y1.. j−1. In allcases, the cost is 1 plus the cost for the restof the process (already computed). Noticethat the insertions in one string are equiv-alent to deletions in the other.

An equivalent formula which is alsowidely used is

C′i, j = min(Ci−1, j−1 + δ(xi, y j ),

Ci−1, j + 1, Ci, j−1 + 1)

where δ(a, b) = 0 if a = b and 1 otherwise.It is easy to see that both formulas areequivalent because neighboring cells dif-fer in at most one (just recall the meaningof Ci, j ), and therefore when δ(xi, y j ) = 0we have that Ci−1, j−1 cannot be largerthan Ci−1, j + 1 or Ci, j−1 + 1.

The dynamic programming algorithmmust fill the matrix in such a way thatthe upper, left, and upper-left neighborsof a cell are computed prior to computingthat cell. This is easily achieved by either

a row-wise left-to-right traversal or acolumn-wise top-to-bottom traversal, butwe will see later that, using a differencerecurrence, the matrix can also be filledby (upper-left to lower-right) diagonals or“secondary” (upper-right to lower-left) di-agonals. Figure 8 illustrates the algorithmto compute ed ("survey," "surgery").

Therefore, the algorithm is O(|x|| y |)time in the worst and average case.However, the space required is onlyO(min(|x|, | y |)). This is because, in thecase of a column-wise processing, only theprevious column must be stored in orderto compute the new one, and thereforewe just keep one column and update it.We can process the matrix row-wise orcolumn-wise so that the space require-ment is minimized.

On the other hand, the sequences of op-erations performed to transform x into ycan be easily recovered from the matrix,simply by proceeding from the cell C|x|,| y |to the cell C0,0 following the path (i.e. se-quence of operations) that matches the up-date formula (multiple paths may exist).In this case, however, we need to store thecomplete matrix or at least an area aroundthe main diagonal.

This matrix has some properties thatcan be easily proved by induction (see,e.g. Ukkonen [1985a]) and which make it

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 15: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 45

Fig. 7 . Taxonomy of algorithms based on the dynamic programming matrix. References areshortened to first letters (single authors) or initials (multiple authors), and to the last two digitsof years.Key: Vin68 = [Vintsyuk 1968], NW70 = [Needleman and Wunsch 1970], San72 = [Sankoff 1972],Sel74= [Sellers 1974], WF74= [Wagner and Fisher 1974], LW75= [Lowrance and Wagner 1975],Sel80 = [Sellers 1980], MP80 = [Masek and Paterson 1980], Ukk85a & Ukk85b = [Ukkonen1985a; 1985b], Mye86a & Mye86b = [Myers 1986a; 1986b], LV88 & LV89 = [Landau and Vishkin1988; 1989], GP90 = [Galil and Park 1990], UW93 = [Ukkonen and Wood 1993], GG88 = [Galiland Giancarlo 1988], CL92= [Chang and Lampe 1992], CL94= [Chang and Lawler 1994], SV97=[Sahinalp and Vishkin 1997], CH98 = [Cole and Hariharan 1998], and BYN99 = [Baeza-Yatesand Navarro 1999].

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 16: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

46 G. Navarro

Fig. 8 . The dynamic programming algorithm tocompute the edit distance between "survey" and"surgery." The bold entries show the path to thefinal result.

possible to design better algorithms. Someof the most useful are that the values ofneighboring cells differ in at most one,and that upper-left to lower-right diago-nals are nondecreasing.

5.1.2 Text Searching. We now show how toadapt this algorithm to search a short pat-tern P in a long text T . The algorithmis basically the same, with x= P andy =T (proceeding column-wise so thatO(m) space is required). The only differ-ence is that we must allow that any textposition is the potential start of a match.This is achieved by setting C0, j = 0 forall j ∈ 0..n. That is, the empty patternmatches with zero errors at any text posi-tion (because it matches with a text sub-string of length zero).

The algorithm then initializes its col-umn C0..m with the values Ci = i, and pro-cesses the text character by character. Ateach new text character Tj , its column vec-tor is updated to C′0..m. The update for-mula is

C′i = if (Pi = Tj ) then Ci−1

else 1+min(C′i−1, Ci, Ci−1)

and the text positions are where Cm ≤ k isreported.

The search time of this algorithm isO(mn) and its space requirement is O(m).This is a sort of worst case in the analy-sis of all the algorithms that we considerlater. Figure 9 exemplifies this algorithmapplied to search the pattern "survey" inthe text "surgery" (a very short text in-

Fig. 9 . The dynamic programming algorithm tosearch "survey" in the text "surgery" with two er-rors. Each column of this matrix is a value of theC vector. Bold entries indicate matching text posi-tions.

deed) with at most k = 2 errors. In thiscase there are 3 occurrences.

5.1.3 Other Distance Functions. It is easyto adapt this algorithm for the otherdistance functions mentioned. If the op-erations have different costs, we addthe cost instead of adding 1 when comput-ing Ci, j , i.e.

C0,0 = 0Ci, j = min(Ci−1, j−1 + δ(xi , y j ),

Ci−1, j + δ(xi, ε), Ci, j−1 + δ(ε, y j ))

where we assume δ(a, a) = 0 for any a ∈ 6and that C−1, j = Ci,−1 = ∞ for all i, j .

For distances that do not allow some op-erations, we just take them out of the min-imization formula, or, which is the same,we assign∞ to their δ cost. For transposi-tions, we allow a fourth rule that says thatCi, j can be Ci−2, j−2 + 1 if xi−1xi = y j y j−1[Lowrance and Wagner 1975].

The most complex case is to allow gen-eral substring substitutions, in the form ofa finite set R of rules. The formula is givenin Ukkonen [1985a].

C0,0 = 0Ci, j = min(Ci−1, j−1 if xi = y j ,

Ci−|s1|, j−|s2| + δ(s1, s2)for each (s1, s2) ∈ R, x1..i = x ′s1,y1.. j = y ′s2)

An interesting problem is how to com-pute this recurrence efficiently. A naiveapproach takes O(|R|mn), where |R| is thesum of all the lengths of the strings in

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

2->2 means: same character

delete char0 -> 1 replace character

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

I downloaded Seller's paper

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

I think it's hard to apply Seller's alg. on genes...n is too big (too much storage space needed)

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 17: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 47

Fig. 10 . The Masek and Paterson algorithm partitions the dynamic programming matrix in cells(r = 2 in this example). On the right, we shaded the entries of adjacent cells that influence thecurrent one.

R. A better solution is to build two Aho–Corasick automata [Aho and Corasick1975] with the left and right hand sidesof the rules, respectively. The automataare run as we advance in both strings (lefthand sides in x and right hand sides iny). For each pair of states (i1, i2) of the au-tomata we precompute the set of substitu-tions that can be tried (i.e. those δ’s whoseleft and right hand sides match the suf-fixes of x and y , respectively, representedby the automata states). Hence, we knowin constant time (per cell) the set of pos-sible substitutions. The complexity is nowmuch lower, in the worst case it is O(cmn)where c is the maximum number of rulesapplicable to a single text position.

As mentioned, the dynamic program-ming approach is unbeaten in flexibility,but its time requirements are indeed high.A number of improved solutions have beenproposed over the years. Some of themwork only for the edit distance, while oth-ers can still be adapted to other distancefunctions. Before considering the improve-ments, we mention that there exists a wayto see the problem as a shortest path prob-lem on a graph built on the pattern andthe text [Ukkonen 1985a]. This reformula-tion has been conceptually useful for morecomplex variants of the problem.

5.2 Improving the Worst Case

5.2.1 Masek and Paterson (1980). It is in-teresting that one important worst-case

theoretical result in this area is as oldas the Sellers algorithm [Sellers 1980] it-self. In 1980, Masek and Paterson [1980]found an algorithm whose worst case costis O(mn/ log2

σ n) and requires O(n) extraspace. This is an improvement over theO(mn) classical complexity.

The algorithm is based on the Four-Russians technique [Arlazarov et al.1975]. Basically, it replaces the alphabet6 by r-tuples (i.e. 6r ) for a small r. Con-sidered algorithmically, it first builds a ta-ble of solutions of all the possible problems(i.e. portions of the matrix) of size r × r,and then uses the table to solve the origi-nal problem in blocks of size r. Figure 10illustrates.

The values inside the r × r size cells de-pend on the corresponding letters in thepattern and the text, which gives σ 2r pos-sibilities. They also depend on the valuesin the last column and row of the upperand left cells, as well as the bottom-rightstate of the upper left cell (see Figure 10).Since neighboring cells differ in at mostone, there are only three choices for adja-cent cells once the current cell is known.Therefore, this adds only m (32r ) possibil-ities. In total, there are m (3σ )2r differentcells to precompute. Using O(n) memorywe have enough space for r = log3σ n, andsince we finally compute mn/r2 cells, thefinal complexity follows.

The algorithm is only of theoretical in-terest, since as the same authors estimate,it will not beat the classical algorithm for

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 18: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

48 G. Navarro

Fig. 11 . On the left, the O(k2) algorithm to compute the edit distance. Onthe right, the way to compute the strokes in diagonal transition algorithms.The solid bold line is guaranteed to be part of the new stroke of e errors,while the dashed part continues as long as both strings match.

texts below 40 GB size (and it would needthat extra space!). Adapting it to otherdistance functions does not seem difficult,but the dependencies among differentcells may become more complex.

5.2.2 Ukkonen (1983). In 1983, Ukkonen[1985a] presented an algorithm able tocompute the edit distance between twostrings x and y in O(ed (x, y)2) time, orto check in time O(k2) whether that dis-tance was ≤k or not. This is the firstmember of what has been called “diago-nal transition algorithms,” since it is basedon the fact that the diagonals of the dy-namic programming matrix (running fromthe upper-left to the lower-right cells) aremonotonically increasing (more than that,Ci+1, j+1 ∈ {Ci, j , Ci, j +1}). The algorithm isbased on computing in constant time thepositions where the values along the di-agonals are incremented. Only O(k2) suchpositions are computed to reach the lower-right decisive cell.

Figure 11 illustrates the idea. Each di-agonal stroke represents a number of er-rors, and is a sequence where both stringsmatch. When a stroke of e errors starts, itcontinues until the adjacent strokes of e−1errors continue or until it keeps matchingthe text. To compute each stroke in con-stant time we need to know at what pointit matches the text. The way to do this inconstant time is explained shortly.

5.2.3 Landau and Vishkin (1985). In 1985and 1986, Landau and Vishkin found thefirst worst-case time improvements for thesearch problem. All of them and the threadthat followed were diagonal transition al-gorithms. In 1985, Landau and Vishkin[1988] showed an algorithm which wasO(k2n) time and O(m) space, and in 1986they obtained O(kn) time and O(n) space[Landau and Vishkin 1989].

The main idea in Landau and Vishkinwas to adapt the Ukkonen’s diagonaltransition algorithm for edit distance[Ukkonen 1985a] to text searching. Ba-sically, the dynamic programming matrixwas computed diagonal-wise (i.e. strokeby stroke) instead of column-wise. Theywanted to compute the length of eachstroke in constant time (i.e. the pointwhere the values along a diagonal wereto be incremented). Since a text positionwas to be reported when matrix row m wasreached before incrementing more than ktimes the values along the diagonal, thisimmediately gave the O(kn) algorithm.Another way to see it is that each diago-nal is abandoned as soon as the kth strokeends, there are n diagonals and hence nkstrokes, each of them computed in con-stant time (recall Figure 11).

A recurrence on diagonals (d ) and num-ber of errors (e), instead of rows (i) andcolumns ( j ), is set up in the followingway:

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 19: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 49

Fig. 12 . The diagonal transition matrix to search"survey" in the text "surgery" with two errors. Boldentries indicate matching diagonals. The rows are evalues and the columns are the d values.

Ld ,−1 = Ln+1,e = −1, for all e, dLd ,|d |−2 = |d | − 2, for −(k + 1) ≤ d ≤ −1Ld ,|d |−1 = |d | − 1, for −(k + 1) ≤ d ≤ −1

Ld ,e = i+ max`

(Pi+1..i+`=Td+i+1..d+i+`)

where i = max(Ld ,e−1 + 1,Ld−1,e−1, Ld+1,e−1 + 1)

where the external loop updates e from0 to k and the internal one updates dfrom −e to n. Negative numbered diag-onals are those virtually starting beforethe first text position. Figure 12 shows oursearch example using this recurrence.

Note that the L matrix has to be filledby diagonals, e.g. L0,3, L1,2, L2,1, L0,4, L1,3,L2,2, L0,5, . . . . The difficult part is how tocompute the strokes in constant time (i.e.the max`(·)). The problem is equivalentto knowing which is the longest prefix ofPi..m that matches Tj ..n. This data is called“matching statistics.” The algorithms ofthis section differ basically in how theymanage to quickly compute the matchingstatistics.

We defer the explanation of Landau andVishkin [1988] for later (together withGalil and Park [1990]). In Landau andVishkin [1989], the longest match is ob-tained by building the suffix tree (seeSection 3.2) of T ; P (text concatenatedwith pattern), where the huge O(n) ex-tra space comes from. The longest prefixcommon to both suffixes Pi..m and Tj ..n canbe visualized in the suffix tree as follows:imagine the root to leaf paths that end ineach of the two suffixes. Both parts sharethe beginning of the path (at least theyshare the root). The last suffix tree nodecommon to both paths represents a sub-string which is precisely the longest com-mon prefix. In the literature, this last com-mon node is called lowest common ancestor(LCA) of two nodes.

Despite being conceptually clear, it isnot easy to find this node in constant time.In 1986, the only existing LCA algorithmwas that of Harel and Tarjan [1984], whichhad constant amortized time, i.e. it an-swered n′>n LCA queries in O(n′) time.In our case we have kn queries, so eachone finally cost O(1). The resulting algo-rithm, however, is quite slow in practice.

5.2.4 Myers (1986). In 1986, Myers alsofound an algorithm with O(kn) worst-casebehavior [Myers 1986a, 1986b]. It neededO(n) extra space, and shared the idea ofcomputing the k new strokes using theprevious ones, as well as the use of asuffix tree on the text for the LCA algo-rithm. Unlike other algorithms, this oneis able to report the O(kn) matching sub-strings of the text (not only the endpoints)in O(kn) time. This makes the algorithmsuitable for more complex applications,for instance in computational biology. Theoriginal reference is a technical report andnever went to press, but it has recentlybeen included in a larger work [Landauet al. 1998].

5.2.5 Galil and Giancarlo (1988). In 1988,Galil and Giancarlo [1988] obtained thesame time complexity as Landau andVishkin using O(m) space. Basically, thesuffix tree of the text is built by over-lapping pieces of size O(m). The algo-rithm scans the text four times, being evenslower than [Landau and Vishkin 1989].Therefore, the result was of theoretical in-terest.

5.2.6 Galil and Park (1989). One year later,in 1989, Galil and Park [1990] obtainedO(kn) worst-case time and O(m2) space,worse in theory than Galil and Giancarlo[1988] but much better in practice. Theiridea is rooted in the work of Landauand Vishkin [1988] (which had obtainedO(k2n) time). In both cases, the idea is tobuild the matching statistics of the pat-tern against itself (longest match betweenPi..m and Pj ..m), resembling in some sensethe basic ideas of Knuth et al. [1977]. Butthis algorithm is still slow in practice.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 20: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

50 G. Navarro

Fig. 13 . On the left, the progress of the stroke-wise algorithm. The relevantstrokes are enclosed in a dotted triangle and the last k strokes computed are inbold. On the right, the selection of the k relevant strokes to cover the last textarea. We put in bold the parts of the strokes that are used.

Consider again Figure 11, and in par-ticular the new stroke with e errors at theright. The beginning of the stroke is dic-tated by the three neighboring strokes ofe− 1 errors, but after the longest of thethree ceases to affect the new stroke, howlong it continues (dashed line) dependsonly on the similarity between pattern andtext. More specifically, if the dotted line(suffix of a stroke) at diagonal d spansrows i1 to i1 + `, the longest match be-tween Td+i1.. and Pi1.. has length `. There-fore, the strokes computed by the algo-rithm give some information about longestmatches between text and pattern. Thedifficult part is how to use that infor-mation.

Figure 13 illustrates the algorithm. Asexplained, the algorithm progresses bystrokes, filling the matrix of Figure 12 di-agonally, so that when a stroke is com-puted, its three neighbors are alreadycomputed. We have enclosed in a dot-ted triangle the strokes that may containthe information on longest matches rel-evant to the new strokes that are beingcomputed. The algorithm of Landau andVishkin [1988] basically searches the rel-evant information in this triangle andhence it is in O(k2n) time.

This is improved in Galil and Park[1990] to O(kn) by considering carefullythe relevant strokes. Let us call e-strokea stroke with e errors. First consider a0-stroke. This full stroke (not only a suf-fix) represents a longest match between

pattern and text. So, from the k previous0-strokes we can keep the one that lastslonger in the text, and up to that text posi-tion we have all the information we needabout longest matches. We consider nowall the 1-strokes. Although only a suffix ofthose strokes really represents a longestmatch between pattern and text, we knowthat this is definitely true after the lasttext position is reached by a 0-stroke (sinceby then no 0-stroke can “help” a 1-stroketo last longer). Therefore, we can keep the1-stroke that lasts longer in the text anduse it to define longest matches betweenpattern and text when there are no moreactive 0-strokes. This argument continuesfor all the k errors, showing that in factthe complete text area that is relevant canbe covered with just k strokes. Figure 13(right) illustrates this idea.

The algorithm of Galil and Park [1990]basically keeps this list of k relevantstrokes3 up to date all the time. Each timea new e-stroke is produced, it is comparedagainst the current relevant e-stroke, andif the new one lasts longer in the text thanthe old one, it replaces the old stroke. Sincethe algorithm progresses in the text, oldstrokes are naturally eliminated with thisprocedure.

A final problem is how to use the in-direct information given by the relevantstrokes to compute the longest matches

3 Called “reference triples” there.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 21: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 51

between pattern and text. What we haveis a set of longest matches covering thetext area of interest, plus the precomputedlongest matches of the pattern against it-self (starting at any position). We nowknow where the dashed line of Figure 11starts (say it is Pi1 and Td+i1 ) and wantto compute its length. To know wherethe longest match between pattern andtext ends, we find the relevant strokewhere the beginning of the dashed linefalls. That stroke represents a maximalmatch between Td+i1.. and some Pj1... Aswe know by preprocessing the longestmatch between Pi1.. and Pj1.., we can de-rive the longest match between Pi1.. andTd+i1... There are some extra complica-tions to take care of when both longestmatches end at the same position or onehas length zero, but all them can be sortedout in O(k) time per diagonal of theL matrix.

Finally, Galil and Park show that theO(m2) extra space needed to store the ma-trix of longest matches can be reduced toO(m) by using a suffix tree of the pattern(not the text as in previous work) and LCAalgorithms, so we add different entries inFigure 7 (note that Landau and Vishkin[1988] already had O(m) space). Galil andPark also show how to add transpositionsto the edit operations at the same com-plexity. This technique can be extended toall these diagonal transition algorithms.We believe that allowing different integralcosts for the operations or forbidding someof them can be achieved with simple mod-ifications of the algorithms.

5.2.7 Ukkonen and Wood (1990). An ideasimilar to that of using the suffix tree ofthe pattern (and similarly slow in practice)was independently discovered by Ukko-nen and Wood in 1990 [Ukkonen and Wood1993]. They use a suffix automaton (de-scribed in Section 3.2) on the pattern tofind the matching statistics, instead of thetable. As the algorithm progresses overthe text, the suffix automaton keeps countof the pattern substrings that match thetext at any moment. Although they reportO(m2) space for the suffix automaton, itcan take O(m) space.

5.2.8 Chang and Lawler (1994). In 1990,Chang and Lawler [1994] repeated theidea that was briefly mentioned in Galiland Park [1990]: that matching statisticscan be computed using the suffix tree ofthe pattern and LCA algorithms. However,they used a newer and faster LCA algo-rithm [Schieber and Vishkin 1988], trulyO(1), and reported the best time amongalgorithms with guaranteed O(kn) perfor-mance. However, the algorithm is still notcompetitive in practice.

5.2.9 Cole and Hariharan (1998). In 1998,Cole and Hariharan [1998] presented analgorithm with worst case O(n(1+kc/m)),where c = 3 if the pattern is “mostly ape-riodic” and c = 4 otherwise.4 The idea isthat, unless a pattern has a lot of self-repetition, only a few diagonals of a diag-onal transition algorithm need to be com-puted.

This algorithm can be thought of as a fil-ter (see the following sections) with worstcase guarantees useful for very small k. Itresembles some ideas about filters devel-oped in Chang and Lawler [1994]. Proba-bly other filters can be proved to have goodworst cases under some periodicity as-sumptions on the pattern, but this threadhas not been explored up to now. This algo-rithm is an improvement over a previousone [Sahinalp and Vishkin 1997], which ismore complex and has a worse complexity,namely O(nk8(α log∗ n)1/ log 3). In any case,the interest of this work is theoretical too.

5.3 Improving the Average Case

5.3.1 Ukkonen (1985). The first improve-ment to the average case is due to Ukko-nen in 1985. The algorithm, a shortnote at the end of Ukkonen [1985b], im-proved the dynamic programming algo-rithm to O(kn) average time and O(m)space. This algorithm was later called the“cut-off heuristic.” The main idea is that,since a pattern does not normally matchin the text, the values at each column

4 The definition of “mostly aperiodic” is rather tech-nical and related to the number of auto-repetitionsthat occur in the pattern. Most patterns are “mostlyaperiodic.”

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 22: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

52 G. Navarro

(from top to bottom) quickly reach k + 1(i.e. mismatch), and that if a cell has avalue larger than k + 1, the result ofthe search does not depend on its exactvalue. A cell is called active if its value isat most k. The algorithm simply keepscount of the last active cell and avoidsworking on the rest of the cells.

To keep the last active cell, we mustbe able to recompute it for each new col-umn. At each new column, the last ac-tive cell can be incremented in at mostone, so we check if we have activated thenext cell at O(1) cost. However, it is alsopossible that the last active cell now be-comes inactive. In this case we have tosearch upwards for the new last active cell.Although we can work O(m) in a givencolumn, we cannot work more than O(n)overall, because there are at most n incre-ments of this value in the whole process,and hence there are no more than n decre-ments. Hence, the last active cell is main-tained at O(1) amortized cost per column.

Ukkonen conjectured that this algo-rithm was O(kn) on average, but this wasproven only in 1992 by Chang and Lampe[1992]. The proof was refined in 1996 byBaeza-Yates and Navarro [1999]. The re-sult can probably be extended to morecomplex distance functions, although withsubstrings the last active cell must exceedk by enough to ensure that it can never re-turn to a value smaller than k. In partic-ular, it must have the value k+ 2 if trans-positions are allowed.

5.3.2 Myers (1986). An algorithm inMyers [1986a] is based on diagonal tran-sitions like those in the previous sections,but the strokes are simply computed bybrute force. Myers showed that the result-ing algorithm was O(kn) on average. Thisis clear because the length of the strokesis σ/(σ − 1)=O(1) on average. The samealgorithm was proposed again in 1989 byGalil and Park [1990]. Since only the kstrokes need to be stored, the space isO(k).

5.3.3 Chang and Lampe [1992]. In 1992,Chang and Lampe [1992] produced a newalgorithm called “column partitioning,”

based on exploiting a different propertyof the dynamic programming matrix.They again consider the fact that, alongeach column, the numbers are normallyincreasing. They work on “runs” of con-secutive increasing cells (a run ends whenCi+1 6= Ci + 1). They manage to workO(1) per run in the column actualizationprocess.

To update each run in constant time,they precompute loc( j , x) = min j ′≥ j Pj ′ =x for all pattern positions j and all char-acters x (hence it needs O(mσ ) space). Ateach column of the matrix, they considerthe current text character x and the cur-rent row j , and know in constant timewhere the run is going to end (i.e. nextcharacter match). The run can end beforethis, namely where the parallel run of theprevious column ends.

Based on empirical observations, theyconjecture that the average length of theruns is O(

√σ ). Notice that this matches

our result that the average edit distanceis m (1−e/

√σ ), since this is the number of

increments along columns, and thereforethere are O(m/

√σ ) nonincrements (i.e.

runs). From there it is clear that eachrun has average length O(

√σ ). Therefore,

we have just proved Chang and Lampe’sconjecture.

Since the paper uses the cut-off heuris-tic of Ukkonen, their average search timeis O(kn/

√σ ). This is, in practice, the

fastest algorithm of this class.Unlike the other algorithms in this

section, it seems difficult to adapt [Changand Lampe 1992] to other distance func-tions, since their idea relies strongly onthe unitary costs. It is mentioned thatthe algorithm could run in average timeO(kn log log(m)/σ ) but it would not bepractical.

6. ALGORITHMS BASED ON AUTOMATA

This area is also rather old. It is interest-ing because it gives the best worst-casetime algorithm (O(n), which matches thelower bound of the problem). However,there is a time and space exponentialdependence on m and k that limits itspracticality.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 23: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 53

Fig. 14 . Taxonomy of algorithms based on deterministic automata. References are shortenedto first letters (single authors) or initials (multiple authors), and to the last two digits of years.Key: Ukk85b = [Ukkonen 1985b], MP80 = [Masek and Paterson 1980], Mel96 = [Melichar 1996],Kur96 = [Kurtz 1996], Nav97b = [Navarro 1997b], and WMM96 = [Wu et al. 1996].

We first present the basic solution andthen discuss the improvements. Figure 14shows the historical map of this area.

6.1 An Automaton for Approximate Search

An alternative and very useful way to con-sider the problem is to model the searchwith a nondeterministic automaton(NFA). This automaton (in its determin-istic form) was first proposed in Ukkonen[1985b], and first used in nondeterminis-tic form (although implicitly) in Wu andManber [1992b]. It is shown explicitly inBaeza-Yates [1991], Baeza-Yates [1996],and Baeza-Yates and Navarro [1999].

Consider the NFA for k = 2 errorsunder the edit distance shown in Fig-ure 15. Every row denotes the number oferrors seen (the first row zero, the secondrow one, etc.). Every column representsmatching a pattern prefix. Horizontalarrows represent matching a character(i.e. if the pattern and text charactersmatch, we advance in the pattern and inthe text). All the others increment thenumber of errors (move to the next row):vertical arrows insert a character in thepattern (we advance in the text but notin the pattern), solid diagonal arrowssubstitute a character (we advance in the

text and pattern), and dashed diagonalarrows delete a character of the pattern(they are ε-transitions, since we advancein the pattern without advancing inthe text). The initial self-loop allows amatch to start anywhere in the text. Theautomaton signals (the end of) a matchwhenever a rightmost state is active. Ifwe do not care about the number of errorsin the occurrences, we can consider finalstates those of the last full diagonal.

It is not hard to see that once a statein the automaton is active, all the statesof the same column and higher num-bered rows are active too. Moreover, ata given text character, if we collect thesmallest active rows at each column,we obtain the vertical vector of thedynamic programming algorithm (inthis case [0, 1, 2, 3, 3, 3, 2]; compare toFigure 9).

Other types of distances (Hamming,LCS, and Episode) are obtained bydeleting some arrows of the automaton.Different integer costs for the operationscan also be modeled by changing thearrows. For instance, if insertions cost2 instead of 1, we make the verticalarrows move from rows i to rows i+ 2.Transpositions are modeled by adding anextra state Si, j between each pair of states

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 24: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

54 G. Navarro

Fig. 15 . An NFA for approximate string matching of the pattern "survey" with two errors. Theshaded states are those active after reading the text "surgery".

at position (i, j ) and (i+ 1, j + 2), andarrows labeled Pi+ 2 from state (i, j ) to Si, jand Pi+ 1 between Si, j and (i+ 1, j + 2)[Melichar 1996]. Adapting to general sub-string substitution needs more complexsetups but it is always possible.

This automaton can simply be madedeterministic to obtain O(n) worst-casesearch time. However, as we see next, themain problem becomes the constructionof the DFA (deterministic finite automa-ton). An alternative solution is based onsimulating the NFA instead of making itdeterministic.

6.2 Implementing the Automaton

6.2.1 Ukkonen (1985). In 1985, Ukkonenproposed the idea of a deterministicautomaton for this problem [Ukkonen1985b]. However, an automaton likethat of Figure 15 was not explicitlyconsidered. Rather, each possible set ofvalues for the columns of the dynamicprogramming matrix is a state of theautomaton. Once the set of all possiblecolumns and the transitions among themwere built, the text was scanned with theresulting automaton, performing exactlyone transition per character read.

The big problem with this scheme wasthat the automaton had a potentiallyhuge number of states, which had to bebuilt and stored. To improve space usage,

Ukkonen proved that all the elementsin the columns that were larger thank+ 1 could be replaced by k+ 1 withoutaffecting the output of the search (thelemma was used in the same paper todesign the cut-off heuristic described inSection 5.3). This reduced the potentialnumber of different columns. He alsoshowed that adjacent cells in a columndiffered in at most one. Hence, the columnstates could be defined as a vector of mincremental values in the set {−1, 0, 1}.

All this made it possible in Ukkonen[1985b] to obtain a nontrivial bound onthe number of states of the automaton,namely O(min(3m, m(2mσ )k)). This size,although much better than the obviousO((k+ 1)m), is still very large except forshort patterns or very low error levels.The resulting space complexity of thealgorithm is m times the above value.This exponential space complexity hasto be added to the O(n) time complexity,as the preprocessing time to build theautomaton.

As a final comment, Ukkonen suggestedthat the columns could be computed onlypartially up to, say, 3k/2 entries. Sincehe conjectured (and later was provedcorrect in Chang and Lampe [1992]) thatthe columns of interest were O(k) onaverage, this would normally not affectthe algorithm, though it will reduce thenumber of possible states. If at somepoint the states not computed were really

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 25: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 55

Fig. 16 . On the left, the automaton of Ukkonen [1985b] where each column isa state. On the right, the automaton of Wu et al. [1996] where each region is astate. Both compute the columns of the dynamic programming matrix.

needed, the algorithm would computethem by dynamic programming.

Notice that to incorporate transposi-tions and substring substitutions intothis conception we need to consider thateach state is the set of the j last columnsof the dynamic programming matrix,where j is the longest left-hand side ofa rule. In this case it is better to buildthe automaton of Figure 15 explicitly andmake it deterministic.

6.2.2 Wu, Manber and Myers (1992). It wasnot until 1992 that Wu et al. looked intothis problem again [Wu et al. 1996]. Theidea was to trade time for space usinga Four Russians technique [Arlazarovet al. 1975]. Since the cells could be ex-pressed using only values in {−1, 0, 1}, thecolumns were partitioned into blocks of rcells (called “regions”) which took 2r bitseach. Instead of precomputing the tran-sitions from a whole column to the next,the transitions from a region to the nextregion in the column were precomputed,although the current region could nowdepend on three previous regions (seeFigure 16). Since the regions were smallerthan the columns, much less space wasnecessary. The total amount of workwas O(m/r) per column in the worstcase, and O(k/r) on average. The spacerequirement was exponential in r. Byusing O(n) extra space, the algorithm wasO(kn/ log n) on average and O(mn/ log n)in the worst case. Notice that this shares

the Four Russians approach with [Masekand Paterson 1980], but there is an im-portant difference: the states in this casedo not depend on the letters of the patternand text. The states of the “automaton” ofMasek and Paterson [1980], on the otherhand, depend on the text and pattern.

This Four Russians approach is so flexi-ble that this work was extended to handleregular expressions allowing errors [Wuet al. 1995]. The technique for exact reg-ular expression searching is to pack por-tions of the deterministic automaton inbits and compute transition tables foreach portion. The few transitions amongportions are left nondeterministic andsimulated one by one. To allow errors,each state is no longer active or inactive,but they keep count of the minimumnumber of errors that makes it active, inO(log k) bits.

6.2.3 Melichar (1995). In 1995, Melichar[1996] again studied the size of the de-terministic automaton. By consideringthe properties of the NFA of Figure 15,he refined the bound of Ukkonen [1985b]to O(min(3m, m(2mt)k , (k+ 2)m−k(k+ 1)!)),where t= min(m+ 1, σ ). The space com-plexity and preprocessing time of theautomaton is t times the number ofstates. Melichar also conjectured that thisautomaton is bigger when there are pe-riodicities in the pattern, which matchesthe results of Cole and Hariharan [1998](Section 5.2), in the sense that periodic

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 26: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

56 G. Navarro

patterns are more problematic. This isin fact a property shared with otherproblems in string matching.

6.2.4 Kurtz (1996). In 1996, Kurtz [1996]proposed another way to reduce the spacerequirements to at most O(mn). It is anadaptation of Baeza-Yates and Gonnet[1994], who first proposed it for the Ham-ming distance. The idea was to build theautomaton in lazy form, i.e. build only thestates and transitions actually reached inthe processing of the text. The automatonstarts as just one initial state and thestates and transitions are built as needed.By doing this, all those transitions thatUkkonen [1985b] considered and thatwere not necessary were not built in fact,without the need to guess. The price wasthe extra overhead of a lazy constructionversus a direct construction, but the ideapays off. Kurtz also proposed buildingonly the initial part of the automaton(which should be the most commonlytraversed states) to save space.

Navarro [1997b; 1998] studied thegrowth of the complete and lazy automataas a function of m, k and n (this lastvalue for the lazy automaton only). Theempirical results show that the lazyautomaton grows with the text at a rateof O(nβ), for 0 < β < 1, depending onσ , m, and k. Some replacement policiesdesigned to work with bounded memoryare proposed in Navarro [1998].

7. BIT-PARALLELISM

These algorithms are based on exploitingthe parallelism of the computer when itworks on bits. This is also a new (after1990) and very active area. The basic ideais to “parallelize” another algorithm us-ing bits. The results are interesting fromthe practical point of view, and are espe-cially significant when short patterns areinvolved (typical in text retrieval). Theymay work effectively for any error level.

In this section we find elements whichcould strictly belong to other sections,since we parallelize other algorithms.There are two main trends: parallelize the

work of the nondeterministic automatonthat solves the problem (Figure 15),or parallelize the work of the dynamicprogramming matrix.

We first explain the technique and thenthe results achieved by using it. Figure 17shows the historical development of thisarea.

7.1 The Technique of Bit-Parallelism

This technique, in common use in stringmatching [Baeza-Yates 1991; 1992], wasintroduced in the Ph.D. thesis of Baeza-Yates [1989]. It consists in takingadvantage of the intrinsic parallelism ofthe bit operations inside a computer word.By using this fact cleverly, the numberof operations that an algorithm performscan be cut down by a factor of at mostw, where w is the number of bits in acomputer word. Since in current architec-tures w is 32 or 64, the speedup is verysignificant in practice and improves withtechnological progress. In order to relatethe behavior of bit-parallel algorithmsto other work, it is normally assumedthat w=2(log n), as dictated by the RAMmodel of computation. We, however, preferto keep w as an independent value. Wenow introduce some notation we use forbit-parallel algorithms.

—The length of a computer word (in bits)is w.

—We denote as b`..b1 the bits of a maskof length `. This mask is stored some-where inside the computer word. Sincethe length w of the computer word isfixed, we are hiding the details on wherewe store the ` bits inside it.

—We use exponentiation to denote bit rep-etition (e.g. 031 = 0001).

—We use C-like syntax for operationson the bits of computer words: “|” isthe bitwise-or, “&” is the bitwise-and,“ ” is the bitwise-xor, and “∼” com-plements all the bits. The shift-leftoperation, “<<,” moves the bits to theleft and enters zeros from the right,i.e. bmbm−1..b2b1 << r = bm−r ...b2b10r .The shift-right “>>” moves the bitsin the other direction. Finally, we

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 27: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 57

Fig. 17 . Taxonomy of bit-parallel algorithms. References are shortened to first letters(single authors) or initials (multiple authors), and to the last two digits of years.Key: BY89= [Baeza-Yates 1989], WM92b= [Wu and Manber 1992b], Wri94= [Wright1994], BYN99 = [Baeza-Yates and Navarro 1999], and Mye99 = [Myers 1999].

can perform arithmetic operationson the bits, such as addition andsubtraction, which operate the bits asif they formed a number. For instance,b`..bx10000− 1 = b`..bx01111.

We now explain the first bit-parallelalgorithm, Shift-Or [Baeza-Yates andGonnet 1992], since it is the basis ofmuch of what follows. The algorithmsearches a pattern in a text (withouterrors) by parallelizing the operation ofa nondeterministic finite automaton thatlooks for the pattern. Figure 18 illustratesthis automaton.

This automaton has m+ 1 states, andcan be simulated in its nondeterministicform in O(mn) time. The Shift-Or algo-rithm achieves O(mn/w) worst-case time(i.e. optimal speedup). Notice that if weconvert the nondeterministic automatonto a deterministic one with O(n) search

time, we get an improved version of theKMP algorithm [Knuth et al. 1977]. How-ever KMP is twice as slow for m ≤ w.

The algorithm first builds a table Bwhich for each character c stores a bitmask B[c] = bm..b1. The mask in B[c] hasthe bit bi set if and only if Pi = c. Thestate of the search is kept in a machineword D = dm..d1, where di is 1 wheneverP1..i matches the end of the text readup to now (i.e. the state numbered i inFigure 18 is active). Therefore, a match isreported whenever dm = 1.

D is set to 1m originally, and for eachnew text character Tj , D is updated usingthe formula5

D′ ← ((D ¿ 1) | 0m−11) & B[Tj ]

5 The real algorithm uses the bits with the inversemeaning and therefore the operation “| 0m−11” is notnecessary. We preferred to explain this more didacticversion.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 28: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

58 G. Navarro

Fig. 18 . Nondeterministic automaton that searches "survey" exactly.

The formula is correct because the ithbit is set if and only if the (i− 1)th bit wasset for the previous text character and thenew text character matches the pattern atposition i. In other words, Tj−i+ 1.. j = P1..iif and only if Tj−i+ 1.. j−1 = P1..i−1 and Tj =Pi. It is possible to relate this formula tothe movement that occurs in the nonde-terministic automaton for each new textcharacter: each state gets the value of theprevious state, but this happens only if thetext character matches the correspondingarrow.

For patterns longer than the computerword (i.e. m>w), the algorithm usesdm/we computer words for the simulation(not all them are active all the time). Thealgorithm is O(n) on average.

It is easy to extend Shift-Or to handleclasses of characters. In this extension,each position in the pattern matches aset of characters rather than a singlecharacter. The classical string matchingalgorithms are not extended so easily.In Shift-Or, it is enough to set the ithbit of B[c] for every c ∈ Pi (Pi is now aset). For instance, to search for "survey"in case-insensitive form, we just set to1 the first bit of B["s"] and B["S"], andthe same with the rest. Shift-Or can alsosearch for multiple patterns (where thecomplexity is O(mn/w) if we consider thatm is the total length of all the patterns);it was later enhanced [Wu and Manber1992b] to support a larger set of extendedpatterns and even regular expressions.

Many online text algorithms can beseen as implementations of an automaton(classically, in its deterministic form).Bit-parallelism has since its inventionbecome a general way to simulate simplenondeterministic automata instead ofconverting them to deterministic form.It has the advantage of being muchsimpler, in many cases faster (since it

makes better usage of the registers ofthe computer word), and easier to extendin handling complex patterns than itsclassical counterparts. Its main disadvan-tage is the limitation it imposes on thesize of the computer word. In many casesits adaptations in coping with longerpatterns are not very efficient.

7.2 Parallelizing Nondeterministic Automata

7.2.1 Wu and Manber (1992). In 1992, Wuand Manber [1992b] published a numberof ideas that had a great impact on the fu-ture of practical text searching. They firstextended the Shift-Or algorithm to handlewild cards (i.e. allow an arbitrary num-ber of characters between two given posi-tions in the pattern), and regular expres-sions (the most flexible pattern that can besearched efficiently). Of more interest tous is that they presented a simple schemeto combine any of the preceding extensionswith approximate string matching.

The idea is to simulate, using bit-parallelism, the NFA of Figure 15, so thateach row i of the automaton fits in a com-puter word Ri (each state is representedby a bit). For each new text character,all the transitions of the automaton aresimulated using bit operations amongthe k+ 1 computer words. Notice that allthe k+ 1 computer words have the samestructure (i.e. the same bit is aligned onthe same text position). The update for-mula to obtain the new R ′i values at textposition j from the current Ri values is

R ′0 = ((R0 ¿ 1) | 0m−11) & B[Tj ]R ′i+ 1 = ((Ri+ 1 ¿ 1) & B[Tj ]) | Ri |

(Ri ¿ 1) | (R ′i ¿ 1)

and we start the search with Ri = 0m− i1i.As expected, R0 undergoes a simple

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 29: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 59

Shift-Or process, while the other rowsreceive ones (i.e. active states) from previ-ous rows as well. In the formula for R ′i+ 1,expressed in that order, are horizontal,vertical, diagonal and dashed diagonalarrows.

The cost of this simulation is O(kdm/wen) in the worst and average case, whichis O(kn) for patterns typical in text search-ing (i.e. m ≤ w). This is a perfect speedupover the serial simulation of the automa-ton, which would cost O(mkn) time. Noticethat for short patterns, this is competitiveto the best worst-case algorithms.

Thanks to the simplicity of the con-struction, the rows of the pattern canbe changed by a different automaton. Aslong as one is able to solve a problemfor exact string matching, make k+ 1copies of the resulting computer word,and perform the same operations in thek+ 1 words (plus the arrows that connectthe words), one has an algorithm to findthe same pattern allowing errors. Hence,with this algorithm one is able to performapproximate string matching with setsof characters, wild cards, and regularexpressions. The algorithm also allowssome extensions unique in approximatesearching: a part of the pattern can besearched with errors that another maybe forced to match exactly, and differentinteger costs of the edit operations canbe accommodated (including not allowingsome of them). Finally, one is able tosearch a set of patterns at the same time,but this capability is very limited (since allthe patterns must fit in a computer word).

The great flexibility obtained encour-aged the authors to build a softwarecalled Agrep [Wu and Manber 1992a],6where all these capabilities are imple-mented (although some particular casesare solved in a different manner). Thissoftware has been taken as a reference inall the subsequent research.

7.2.2 Baeza-Yates and Navarro (1996). In1996, Baeza-Yates and Navarro presenteda new bit-parallel algorithm able toparallelize the computation of the au-

6 Available at ftp.cs.arizona.edu.

tomaton even more [Baeza-Yates andNavarro 1999]. The classical dynamicprogramming algorithm can be thoughtof as a column-wise “parallelization” ofthe automaton [Baeza-Yates 1996]; Wuand Manber [1992b] proposed a row-wiseparallelization. Neither algorithm wasable to increase the parallelism (evenif all the NFA states fit in a computerword) because of the ε-transitions ofthe automaton, which caused what wecall zero-time dependencies. That is, thecurrent values of two rows or two columnsdepend on each other, and hence cannotbe computed in parallel.

In Baeza-Yates and Navarro [1999]the bit-parallel formula for a diagonalparallelization was found. They packedthe states of the automaton along diag-onals instead of rows or columns, whichrun in the same direction of the diagonalarrows (notice that this is totally differ-ent from the diagonals of the dynamicprogramming matrix). This idea had beenmentioned much earlier by Baeza-Yates[1991] but no bit-parallel formula wasfound. There are m− k+ 1 complete diag-onals (the others are not really necessary)which are numbered from 0 to m− k. Thenumber Di is the row of the first activestate in diagonal i (all the subsequentstates in the diagonal are active because ofthe ε-transitions). The new D′i values afterreading text position j are computed as

D′i = min(Di + 1, Di+ 1+ 1, g (Di−1, Tj ))

where the first term represents the sub-stitutions, the second term the insertions,and the last term the matches (deletionsare implicit since we represent only thelowest-row active state of each diagonal).The main problem is how to compute thefunction g , defined as

g (Di, Tj ) = min({k+ 1} ∪{r/r ≥ Di ∧ Pi+ r = Tj })

Notice that an active state that crossesa horizontal edge has to propagate all theway down by the diagonal. This was finallysolved in 1996 [Baeza-Yates and Navarro1999; Navarro 1998] by representing

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 30: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

60 G. Navarro

the Di values in unary form and usingarithmetic operations on the bits whichhave the desired propagation effects. Theformula can be understood either nu-merically (operating the Di ’s) or logically(simulating the arrows of the automaton).

The resulting algorithm is O(n) worstcase time and very fast in practice ifall the bits of the automaton fit in thecomputer word (while Wu and Manber[1992b] keeps O(kn)). In general, it isO(dk(m − k)/wen) worst case time, andO(dk2/wen) on average since the Ukkonencut-off heuristic is used (see Section 5.3).The scheme can handle classes of char-acters, wild cards and different integralcosts in the edit operations.

7.3 Parallelizing the DynamicProgramming Matrix

7.3.1 Wright (1994). In 1994, Wright[1994] presented the first work usingbit-parallelism on the dynamic program-ming matrix. The idea was to considersecondary diagonals (i.e. those that runfrom the upper-right to the bottom-left)of the matrix. The main observation isthat the elements of the matrix follow therecurrence7

Ci, j = Ci−1, j−1 if Pi =Tjor Ci−1, j

= Ci−1, j−1 − 1or Ci, j−1

= Ci−1, j−1 − 1Ci−1, j−1 + 1 otherwise

which shows that the new secondarydiagonal can be computed using the twoprevious ones. The algorithm stores thedifferences between Ci, j and Ci−1, j−1 andrepresents the recurrence using modulo4 arithmetic. The algorithm packs manypattern and text characters in a computerword and performs in parallel a numberof pattern versus text comparisons, thenusing the vector of the results of thecomparisons to update many cells of thediagonal in parallel. Since it has to storecharacters of the alphabet in the bits,

7 The original one in Wright [1994] has errors.

the algorithm is O(nm log(σ )/w) in theworst and average case. This was compet-itive at that time for very small alphabets(e.g. DNA). As the author recognizes,it seems quite difficult to adapt thisalgorithm for other distance functions.

7.3.2 Myers (1998). In 1998, Myers [1999]found a better way to parallelize the com-putation of the dynamic programmingmatrix. He represented the differencesalong columns instead of the columnsthemselves, so that two bits per cell wereenough (in fact this algorithm can be seenas the bit-parallel implementation of theautomaton which is made deterministicin Wu et al. [1996], see Section 6.2). Anew recurrence is found where the cellsof the dynamic programming matrix areexpressed using horizontal and verticaldifferences, i.e. 1vi, j = Ci, j − Ci−1, j and1hi, j = Ci, j − Ci, j−1:

1vi, j = min(−Eqi, j ,1vi, j−1,1hi−1, j )+ (1−1hi−1, j )

1hi, j = min(−Eqi, j ,1vi, j−1,1hi−1, j )+ (1−1vi, j−1)

where Eqi, j is 1 if Pi = Tj and zerootherwise. The idea is to keep packedbinary vectors representing the current(i.e. j th) values of the differences, andfinding the way to update the vectorsin a single operation. Each cell Ci, j isseen as a small processor that receivesinputs 1vi, j−1, 1hi−1, j , and Eqi, j andproduces outputs 1vi, j and 1hi, j . Thereare 3 × 3 × 2 = 18 possible inputs, and asimple formula is found to express the celllogic (unlike Wright [1994], the approachis logical rather than arithmetical). Thehard part is to parallelize the work alongthe column because of the zero-timedependency problem. The author findsa solution which, despite the fact that avery different model is used, resemblesthat of Baeza-Yates and Navarro [1999].

The result is an algorithm that usesthe bits of the computer word better, witha worst case of O(dm/wen) and an aver-age case of O(dk/wen) since it uses theUkkonen cut-off (Section 5.3). The updateformula is a little more complex than that

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 31: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 61

of Baeza-Yates and Navarro [1999] andhence the algorithm is a bit slower, but itadapts better to longer patterns becausefewer computer words are needed.

As it is difficult to surpass O(kn)algorithms, this algorithm may be thelast word with respect to asymptoticefficiency of parallelization, except for thepossibility to parallelize an O(kn) worstcase algorithm. As it is now common toexpect of bit-parallel algorithms, thisscheme is able to search some extendedpatterns as well, but it seems difficult toadapt it to other distance functions.

8. FILTERING ALGORITHMS

Our last category is quite new, startingin 1990 and still very active. It is formedby algorithms that filter the text, quicklydiscarding text areas that do not match.Filtering algorithms address only theaverage case, and their major interest isthe potential for algorithms that do notinspect all text characters. The major the-oretical achievement is an algorithm withaverage cost O(n(k+ logσ m)/m), whichwas proven optimal. In practice, filteringalgorithms are the fastest too. All of them,however, are limited in their applicabilityby the error level α. Moreover, they need anonfilter algorithm to check the potentialmatches.

We first explain the general concept andthen consider the developments that haveoccurred in this area. See Figure 19.

8.1 The Concept of Filtering

Filtering is based on the fact that it maybe much easier to tell that a text positiondoes not match than to tell that it matches.For instance, if neither "sur"nor "vey" ap-pear in a text area, then "survey" cannotbe found there with one error under theedit distance. This is because a single editoperation cannot alter both halves of thepattern.

Most filtering algorithms take advan-tage of this fact by searching pieces of thepattern without errors. Since the exactsearching algorithms can be much fasterthan approximate searching ones, filter-ing algorithms can be very competitive

(in fact, they dominate in a large range ofparameters).

It is important to notice that a filteringalgorithm is normally unable to discoverthe matching text positions by itself.Rather, it is used to discard (hopefullylarge) areas of the text that cannot containa match. For instance, in our example, itis necessary that either "sur" or "vey"appear in an approximate occurrence, butit is not sufficient. Any filtering algorithmmust be coupled with a process thatverifies all those text positions that couldnot be discarded by the filter.

Virtually any nonfiltering algorithmcan be used for this verification, and inmany cases the developers of a filteringalgorithm do not care to look for the bestverification algorithm, but just use thedynamic programming algorithm. Theselection is normally independent, butthe verification algorithm must behavewell on short texts because it can bestarted at many different text positionsto work on small text areas. By carefulprogramming it is almost always possibleto keep the worst-case behavior of theverifying algorithm (i.e. avoid verifyingoverlapping areas).

Finally, the performance of filteringalgorithms is very sensitive to the errorlevel α. Most filters work very well on lowerror levels and very badly otherwise. Thisis related to the amount of text that thefilter is able to discard. When evaluatingfiltering algorithms, it is important notonly to consider their time efficiency butalso their tolerance for errors. One possi-ble measure for this filtration efficiency isthe total number of matches found dividedby the total number of potential matchespointed out by the filtration algorithm[Sutinen 1998].

A term normally used when referringto filters is “sublinearity.” It is said thata filter is sublinear when it does notinspect all the characters in the text (likethe Boyer–Moore algorithms [Boyer andMoore 1977] for exact searching, whichcan at best be O(n/m)). However, noonline algorithm can be truly sublinear,i.e. o(n), if m is independent of n. This isonly achievable with indexing algorithms.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 32: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

62 G. Navarro

Fig. 19 . Taxonomy of filtering algorithms. Complexities are all on average. References are shortenedto first letters (single authors) or initials (multiple authors), and to the last two digits of years.Key: TU93 = [Tarhio and Ukkonen 1993], JTU96 = [Jokinen et al. 1996], Nav97a = [Navarro 1997a],CL94= [Chang and Lawler 1994], Ukk92= [Ukkonen 1992], BYN99= [Baeza-Yates and Navarro 1999],WM92b = [Wu and Manber 1992b], BYP96 = [Baeza-Yates and Perleberg 1996], Shi96 = [Shi 1996],NBY99c= [Navarro and Baeza-Yates 1999c], Tak94= [Takaoka 1994], CM94= [Chang and Marr 1994],NBY98a = [Navarro and Baeza-Yates 1998a], NR00 = [Navarro and Raffinot 2000], ST95 = [Sutinenand Tarhio 1995], and GKHO97 = [Giegerich et al. 1997].

We divide this area in two parts: moder-ate and very long patterns. The algorithmsfor the two areas are normally different,since more complex filters are only worth-while for longer patterns.

8.2 Moderate Patterns

8.2.1 Tarhio and Ukkonen (1990). Tarhioand Ukkonen [1993]8 launched this areain 1990, publishing an algorithm that

8 See also Jokinen et al. [1996], which has a correc-tion to the algorithm.

used Boyer–Moore–Horspool techniques[Boyer and Moore 1977; Horspool 1980]to filter the text. The idea is to align thepattern with a text window and scanthe text backwards. The scanning endswhere more than k “bad” text charactersare found. A “bad” character is one that notonly does not match the pattern positionit is aligned with, but also does not matchany pattern character at a distance of kcharacters or less. More formally, assumethat the window starts at text positionj + 1, and therefore Tj + i is aligned with

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 33: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 63

Pi. Then Tj + i is bad when Bad (i, Tj + i),where Bad (i, c) has been precomputed asc 6∈ {Pi−k , Pi−k+ 1, . . . , Pi, . . . , Pi+ k}.

The idea of the bad characters is that weknow for sure that we have to pay an errorto match them, i.e. they will not match asa byproduct of inserting or deleting othercharacters. When more than k charactersthat are errors for sure are found, the cur-rent text window can be abandoned andshifted forward. If, on the other hand, thebeginning of the window is reached, thearea Tj + 1−k.. j+m must be checked with aclassical algorithm.

To know how much we can shift thewindow, the authors show that there isno point in shifting P to a new positionj ′ where none of the k+ 1 text charac-ters that are at the end of the currentwindow (Tj +m−k, .., Tj +m) match thecorresponding character of P , i.e. whereTj +m−r 6= Pm−r−( j ′− j ). If those differencesare fixed with substitutions, we makek+ 1 errors, and if they can be fixedwith less than k+ 1 operations, then it isbecause we aligned some of the involvedpattern and text characters using inser-tions and deletions. In this case, we wouldhave obtained the same effect by aligningthe matching characters from the start.

So for each pattern position i ∈{m − k..m} and each text character athat could be aligned to position i (i.e.for all a ∈ 6), the shift to align a in thepattern is precomputed, i.e. Shift(i, a) =mins>0{Pi−s=a} (or m if no such s exists).Later, the shift for the window is com-puted as mini∈m−k..m Shift(i, Tj + i). Thislast minimum is computed together withthe backward window traversal.

The analysis in Tarhio and Ukkonen[1993] shows that the search time isO(kn(k/σ + 1/(m − k))), without consid-ering verification. In Appendix A.1 weshow that the amount of verification isnegligible for α < e−(2k+ 1)/σ . The analysisis valid for mÀ σ >k, so we can simplifythe search time to O(k2n/σ ). The algo-rithm is competitive in practice for lowerror levels. Interestingly, the versionk = 0 corresponds exactly to the Horspoolalgorithm [Horspool 1980]. Like Horspool,it does not take proper advantage of very

long patterns. The algorithm can proba-bly be adapted to other simple distancefunctions if we define k as the minimumnumber of errors needed to reject a string.

8.2.2 Jokinen, Tarhio, and Ukkonen (1991).In 1991, Jokinen, Tarhio and Ukkonen[Jokinen et al. 1996] adapted a previ-ous filter for the k-mismatches problem[Grossi and Luccio 1989]. The filter isbased on the simple fact that inside anymatch with at most k errors there mustbe at least m − k letters belonging to thepattern. The filter does not care aboutthe order of those letters. This is a simpleversion of Chang and Lawler [1994] (seeSection 8.3), with less filtering efficiencybut simpler implementation.

The search algorithm slides a windowof length m over the text9 and keeps countof the number of window characters thatbelong to the pattern. This is easily donewith a table that, for each character a,stores a counter of a’s in the pattern whichhas not yet been seen in the text window.The counter is incremented when an a en-ters the window and decremented whenit leaves the window. Each time a posi-tive counter is decremented, the windowcharacter is considered as belonging to thepattern. When there are m− k such char-acters, the area is verified with a classicalalgorithm.

The algorithm was analyzed by Navarro[1997a] using a model of urns and balls. Heshows that the algorithm is O(n) time forα < e−m/σ . Some possible extensions arestudied in Navarro [1998].

The resulting algorithm is competitivein practice for short patterns, but it wors-ens for long ones. It is simple to adapt toother distance functions by just determin-ing how many characters must match inan approximate occurrence.

8.2.3 Wu and Manber (1992). In 1992, avery simple filter was proposed by Wu andManber [1992b] (among many other ideasin that work). The basic idea is in fact very

9 The original version used a variable size window.This simplification is from Navarro [1997a].

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 34: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

64 G. Navarro

old [Rivest 1976]: if a pattern is cut in k+ 1pieces, then at least one of the pieces mustappear unchanged in an approximateoccurrence. This is evident, since k errorscannot alter the k+ 1 pieces. The proposalwas then to split the pattern in k+ 1approximately equal pieces, search thepieces in the text, and check the neighbor-hood of their matches (of length m+ 2k).They used an extension of Shift-Or[Baeza-Yates and Gonnet 1992] to searchall the pieces simultaneously in O(mn/w)time. In the same year, 1992, Baeza-Yates and Perleberg [1996] suggestedbetter algorithms for the multipatternsearch: an Aho–Corasick machine [Ahoand Corasick 1975] to guarantee O(n)search time (excluding verifications), orCommentz-Walter [1979].

Only in 1996 was the improvementreally implemented [Baeza-Yates andNavarro 1999], by adapting the Boyer–Moore–Sunday algorithm [Sunday 1990]to multipattern search (using a trie ofpatterns and a pessimistic shift table).The resulting algorithm is surprisinglyfast in practice for low error levels.

There is no closed expression for theaverage case cost of this algorithm [Baeza-Yates and Regnier 1990], but we show inAppendix A.2 that a gross approximationis O(kn logσ (m)/σ ). Two independentproofs in Baeza-Yates and Navarro [1999]and Baeza-Yates and Perleberg [1996]show that the cost of the search dominatesfor α <1/(3 logσ m). A simple way to seethis is to consider that checking a textarea costs O(m2) and is done when any ofthe k+ 1 pieces of length m/(k+ 1) match,which happens with probability neark/σ 1/α. The result follows from requiringthe average verification cost to be O(1).

This filter can be adapted, with somecare, to other distance functions. The mainissue is to determine how many pieces anedit operation can destroy and how manyedit operations can be made before sur-passing the error threshold. For example,a transposition can destroy two pieces inone operation, so we would need to splitthe pattern in 2k+ 1 pieces to ensure thatone is unaltered. A more clever solution forthis case is to leave a hole of one character

between each pair of pieces, so that thetransposition cannot alter both.

8.2.4 Baeza-Yates and Navarro (1996).The bit-parallel algorithms presented inSection 7 [Baeza-Yates and Navarro 1999]were also the basis for novel filteringtechniques. As the basic algorithm islimited to short patterns, the algorithmssplit longer patterns in j parts, makingthem short enough to be searchable withthe basic bit-parallel automaton (usingone computer word).

The method is based on a more generalversion of the partition into k+ 1 pieces[Myers 1994a; Baeza-Yates and Navarro1999]. For any j , if we cut the pattern in jpieces, then at least one of them appearswith bk/j c errors in any occurrence of thepattern. This is clear, since if each pieceneeds more than k/j errors to match,then the complete match needs more thank errors.

Hence, the pattern was split in j pieces(of length m/j ) which were searched withk/j errors using the basic algorithm. Eachtime a piece was found, the neighborhoodwas verified to check for the complete pat-tern. Notice that the error level α for thepieces is kept unchanged.

The resulting algorithm is O(n√

mk/w)on average. Its maximum α value is1− emO(1/

√w)/√σ , smaller than 1− e/

√σ

and worsening as m grows. This may besurprising since the error level α is thesame for the subproblems. The reasonis that the verification cost keeps O(m2)but the matching probability is O(γm/j ),larger than O(γm) (see Section 4).

In 1997, the technique was enrichedwith “superimposition” [Baeza-Yatesand Navarro 1999]. The idea is to avoidperforming one separate search foreach piece of the pattern. A multipat-tern approximate searching is designedusing the ability of bit-parallelism tosearch for classes of characters. As-sume that we want to search "survey"and "secret." We search the pattern"s[ue][rc][vr]e[yt],"where [ab] means{a, b}. In the NFA of Figure 15, the hor-izontal arrows are traversable by morethan one letter. Clearly, any match of each

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 35: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 65

Fig. 20 . The hierarchical verification method fora pattern split in four parts. The boxes (leaves) arethe elements which are really searched, and the rootrepresents the whole pattern. At least one patternat each level must match in any occurrence of thecomplete pattern. If the bold box is found, all thebold lines may be verified.

of the two patterns is also a match of thesuperimposed pattern, but not vice-versa(e.g. "servet" matches zero errors). Sothe filter is weakened but the search ismade faster. Superimposition allowedlowering the average search time to O(n)for α <1− emO(1/

√w)√

m/σ√

w and toO(n

√mk/(σw)) for the maximum α of

the 1996 version. By using a j valuesmaller than the one necessary to putthe automata in single machine words,an intermediate scheme was obtainedthat softly adapted to higher error levels.The algorithm was O(kn log(m)/w) forα < 1− e/

√σ .

8.2.5 Navarro and Baeza-Yates (1998). Thefinal twist in the previous scheme was theintroduction of “hierarchical verification”in 1998 [Navarro and Baeza-Yates 1998a].For simplicity assume that the pattern ispartitioned in j = 2r pieces, although thetechnique is general. The pattern is splitin two halves, each one to be searchedwith bk/2c errors. Each half is recursivelysplit in two and so on, until the pattern isshort enough to make its NFA fit in a com-puter word (see Figure 20). The leaves ofthis tree are the pieces actually searched.When a leaf finds a match, instead ofchecking the whole pattern as in theprevious technique, its parent is checked(in a small area around the piece thatmatched). If the parent is not found, theverification stops, otherwise it continueswith the grandparent until the root (i.e.the whole pattern) is found. This is correctbecause the partitioning scheme appliesto each level of the tree: the grandparent

cannot appear if none of its childrenappear, even if a grandchild appeared.

Figure 20 shows an example. If onesearches the pattern "aaabbbcccddd" withfour errors in the text "xxxbbxxxxxxx,"and splits the pattern in four pieces to besearched with one error, the piece "bbb"will be found in the text. In the originalapproach, one would verify the completepattern in the text area, while with thenew approach one verifies only its parent"aaabbb" and immediately determinesthat there cannot be a complete match.

An orthogonal hierarchical verificationtechnique is also presented in Navarroand Baeza-Yates [1998a] to includesuperimposition in this scheme. If thesuperimposition of four patterns matches,the set is split in two sets of two patternseach, and it is checked whether some ofthem match instead of verifying all thefour patterns one by one.

The analysis in Navarro [1998] andNavarro and Baeza-Yates [1998a] showsthat the average verification cost drops toO((m/j )2). Only now the problem scaleswell (i.e. O(γm/j ) verification probabilityand O((m/j )2) verification cost). Withhierarchical verification, the verificationcost stays negligible for α < 1− e/

√σ . All

the simple extensions of bit-parallel algo-rithms apply, although the partition intoj pieces may need some redesign for otherdistances. Notice that it is very difficultto break the barrier of α∗ = 1 − e/

√σ for

any filter because, as shown in Section 4,there are too many real matches, and eventhe best filters must check real matches.

In the same year, 1998, the sameauthors [Navarro and Baeza-Yates 1999c;Navarro 1998] added hierarchical verifi-cation to the filter that splits the patternin k+ 1 pieces and searches them withzero errors. The analysis shows thatwith this technique the verification costdoes not dominate the search time forα <1/ logσ m. The resulting filter is thefastest for most cases of interest.

8.2.6 Navarro and Raffinot (1998). In 1998Navarro and Raffinot [Navarro andRaffinot 2000; Navarro 1998] presented anovel approach based on suffix automata

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 36: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

66 G. Navarro

Fig. 21 . The construction to search any reverse prefix of "survey" allowing 2 errors.

(see Section 3.2). They adapted an exactstring matching algorithm, BDM, to allowerrors.

The idea of the original BDM algorithmis as follows [Crochemore et al. 1994;Crochemore and Rytter 1994]. The deter-ministic suffix automaton of the reversepattern is built so that it recognizes thereverse prefixes of the pattern. Then thepattern is aligned with a text window, andthe window is scanned backwards withthe automaton (this is why the patternis reversed). The automaton is active aslong as what it has read is a substringof the pattern. Each time the automatonreaches a final state, it has seen a patternprefix, so we remember the last timeit happened. If the automaton arriveswith active states at the beginning ofthe window then the pattern has beenfound, otherwise what is there is nota substring of the pattern and hencethe pattern cannot be in the window. Inany case the last window position thatmatched a pattern prefix gives the nextinitial window position. The algorithmBNDM [Navarro and Raffinot 2000] isa bit-parallel implementation (usingthe nondeterministic suffix automaton,see Figure 3) which is much faster inpractice and allows searching for classesof characters, etc.

A modification of Navarro and Raffinot[2000] is to build a NFA to search the re-

versed pattern allowing errors, modify itto match any pattern suffix, and apply es-sentially the same BNDM algorithm us-ing this automaton. Figure 21 shows theresulting automaton.

This automaton recognizes any reverseprefix of P allowing k errors. The win-dow will be abandoned when no patternsubstring matches what was read with kerrors. The window is shifted to the nextpattern prefix found with k errors. Thematches must start exactly at the initialwindow position. The window length ism − k, not m, to ensure that if there isan occurrence starting at the windowposition then a substring of the patternoccurs in any suffix of the window (so thatwe do not abandon the window beforereaching the occurrence). Reaching thebeginning of the window does not guaran-tee a match, however, so we have to checkthe area by computing edit distance fromthe beginning of the window (at mostm+ k text characters).

In Appendix A.3 it is shown thatthe average complexity10 is O(n(α+α∗logσ (m)/m)/((1−α)α∗ −α)) and the filterworks well for α < (1− e/

√σ )/(2− e/

√σ ),

which for large alphabets tends to 1/2.The result is competitive for low errorlevels, but the pattern cannot be very

10 The original analysis of Navarro [1998] is inaccu-rate.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 37: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 67

Fig. 22 . Algorithms LET and SET. LET covers all the text with pattern substrings, whileSET works only at block beginnings and stops when it finds k differences.

long because of the bit-parallel imple-mentation. Notice that trying to do thiswith the deterministic BDM would havegenerated a very complex construction,while the algorithm with the nondeter-ministic automaton is simple. Moreover, adeterministic automaton would have toomany states, just as in Section 6.2. Allthe simple extensions of bit-parallelismapply, provided the window length m − kis carefully reconsidered.

A recent software program, called{\em nrgrep}, capable of fast, exact, andapproximate searching of simple andcomplex patterns has been built with thismethod [Navarro 2000b].

8.3 Very Long Patterns

8.3.1 Chang and Lawler (1990). In 1990,Chang and Lawler [1994] presented twoalgorithms (better analyzed in Giegerichet al. [1997]). The first one, called LET(for “linear expected time”), works asfollows: the text is traversed linearly, andat each time the longest pattern substringthat matches the text is maintained.When the substring cannot be extendedfurther, it starts again from the currenttext position; Figure 22 illustrates.

The crucial observation is that, if lessthan m − k text characters have beencovered by concatenating k longest sub-strings, then the text area does not matchthe pattern. This is evident because amatch is formed by k+ 1 correct strokes(recall Section 5.2) separated by k errors.Moreover, the strokes need to be ordered,which is not required by the filter.

The algorithm uses a suffix treeon the pattern to determine in a linearpass the longest pattern substring that

matches the text seen up to now. Noticethat the article is from 1990, the sameyear that Ukkonen and Wood [1993] didthe same with a suffix automaton (seeSection 5.2). Therefore, the filtering is inO(n) time. The authors use Landau andVishkin [1989] as the verifying algorithmand therefore the worst case is O(kn).The authors show that the filtering timedominates for α < 1/ logσ m+O(1). Theconstants are involved, but practicalfigures are α ≤ 0.35 for σ = 64 or α ≤ 0.15for σ = 4.

The second algorithm presented iscalled SET (for “sublinear expectedtime”). The idea is similar to LET, exceptthat the text is split in fixed blocks of size(m− k)/2, and the check for k contiguousstrokes starts only at block boundaries.Since the shortest match is of lengthm− k, at least one of these blocks isalways contained completely in a match.If one is able to discard the block, nooccurrence can contain it. This is alsoillustrated in Figure 22.

The sublinearity is clear once it isproven that a block is discarded on av-erage in O(k logσ m) comparisons. Since2n/(m − k) blocks are considered, theaverage time is O(α n logσ (m)/(1 − α)).The maximum α level stays the same as inLET, so the complexity can be simplifiedto O(α n logσ m). Although the proof thatlimits the comparisons per block is quiteinvolved, it is not hard to see intuitivelywhy it is true: the probability of finding astroke of length ` in the pattern is limitedby m/σ `, and the detailed proof showsthat ` = logσ m is on average the longeststroke found. This contrasts with theresult of Myers [1986a] (Section 5.3), thatshows that k strokes add up O(k) length.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 38: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

68 G. Navarro

Fig. 23 . Q-gram algorithm. The left one [Ukkonen 1992] counts the number of pattern q-gramsin a text window. The right one [Sutinen and Tarhio 1995] finds sequences of pattern q-grams inapproximately the same text positions (we have put in bold a text sample and the possible q-gramsto match it).

The difference is that here we can takethe strokes from anywhere in the pattern.

Both LET and SET are effective forvery long patterns only, since their over-head does not pay off on short patterns.Different distance functions can be accom-modated after rereasoning the adequatek values.

8.3.2 Ukkonen (1992). In 1992, Ukkonen[1992] independently rediscovered someof the ideas of Chang and Lampe. Hepresented two filtering algorithms, one ofwhich (based on what he called “maximalmatches”) is similar to the LET of Changand Lawler [1994] (in fact Ukkonenpresents it as a new “block distance”computable in linear time, and shows thatit serves as a filter for the edit distance).The other filter is the first reference to“q-grams” for online searching (there aremuch older ones in indexed searching[Ullman 1977]).

A q-gram is a substring of length q.A filter was proposed based on countingthe number of q-grams shared betweenthe pattern and a text window (this ispresented in terms of a new “q-gramdistance” which may be of interest onits own). A pattern of length m has(m−q+ 1) overlapping q-grams. Eacherror can alter q q-grams of the pattern,and therefore (m−q+ 1 − kq) patternq-grams must appear in any occurrence;Figure 23 illustrates.

Notice that this is a generalizationof the counting filter of Jokinen et al.[1996] (Section 8.2), which corresponds

to q = 1. The search algorithm is similaras well, although of course keeping atable with a counter for each of the σ q

q-grams is impractical (especially be-cause only m− q+ 1 of them are present).Ukkonen uses a suffix tree to keep countof the last q-gram seen in linear time (therelevant information can be attached tothe m− q+ 1 important nodes at depth qin the suffix tree).

The filter therefore takes linear time.There is no analysis to show which isthe maximum error level tolerated bythe filter, so we attempt a gross analysisin Appendix A.4, valid for large m. Theresult is that the filter works well forα < O(1/ logσ m), and that the optimalq to obtain it is q = logσ m. The searchalgorithm is more complicated than thatof Jokinen et al. [1996]. Therefore, usinglarger q values only pays off for largerpatterns. Different distance functionsare easily accommodated by recomputingthe number of q-grams that must bepreserved in any occurrence.

8.3.3 Takaoka (1994). In 1994, Takaoka[1994] presented a simplification ofChang and Lawler [1994]. He consideredh-samples of the text (which are non-overlapping q-grams of the text taken eachh characters, for h≥q). The idea is that ifone h-sample is found in the pattern, thena neighborhood of the area is verified.

By using h = b(m−k−q+ 1)/(k+ 1)c onecannot miss a match. The easiest way tosee this is to start with k = 0. Clearly, weneed h = m−q+ 1 to not lose any matches.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 39: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 69

For larger k, recall that if the pattern issplit in k+ 1 pieces some of them mustappear with no errors. The filter dividesh by k+ 1 to ensure that any occurrence ofthose pieces will be found (we are assum-ing q < m/(k+ 1)).

Using a suffix tree of the pattern, theh-sample can be found in O(q) time.Therefore the filtering time is O(qn/h),which is O(αn logσ (m)/(1 − α)) if theoptimal q = logσ m is used. The errorlevel is again α < O(1/ logσ m), whichmakes the time O(αn logσ m).

8.3.4 Chang and Marr (1994). It looks likeO(αn logσ m) is the best complexity achiev-able by using filters, and that it will workonly for α = O(1/ logσ m). But in 1994,Chang and Marr obtained an algorithmwhich was

O(

k+ logσ mm

n)

for α < ρσ , where ρσ depends only on σand it tends to 1 − e/

√σ for very large σ .

At the same time, they proved that thiswas a lower bound for the average com-plexity of the problem (and therefore theiralgorithm was optimal on average). Thisis a major theoretical breakthrough.

The lower bound is obtained by takingthe maximum (or sum) of two simple facts:the first one is the O(n logσ (m)/m) boundof Yao [1979] for exact string matching,and the second one is the obvious factthat in order to discard a block of m textcharacters, at least k characters should beexamined to find the k errors (and henceO(kn/m) is a lower bound). Also, themaximum error level is optimal accordingto Section 4. What is impressive is thatan algorithm with such complexity wasfound.

The algorithm is a variation of SET[Chang and Lawler 1994]. It is of poly-nomial space in m, i.e. O(mt) space forsome constant t which depends on σ . It isbased on splitting the text in contiguoussubstrings of length ` = t logσ m. Insteadof finding in the pattern the longest exactmatches starting at the beginning ofblocks of size (m − k)/2, it searches the

text substrings of length ` in the patternallowing errors.

The algorithm proceeds as follows. Thebest matches allowing errors inside P areprecomputed for every `-tuple (hence theO(mt) space). Starting at the beginning ofthe block, it searches consecutive `-tuplesin the pattern (each in O(`) time), untilthe total number of errors made exceedsk. If by that time it has not yet coveredm− k text characters, the block can besafely skipped.

The reason why this works is a simpleextension of SET. We have found an areacontained in the possible occurrence whichcannot be covered with k errors (even al-lowing the use of unordered portions of thepattern for the match). The algorithm isonly practical for very long patterns, andcan be extended for other distances withthe same ideas as the other filtration andq-gram methods.

It is interesting to notice thatα≤ 1− e/

√σ is the limit we have dis-

cussed in Section 4, which is a firmbarrier for any filtering mechanism.Chang and Lawler proved an asymptoticresult, while a general bound is provedin Baeza-Yates and Navarro [1999]. Thefilters of Chang and Marr [1994] andNavarro and Baeza-Yates [1998a] reducethe problem to fewer errors instead of tozero errors. An interesting observation isthat it seems that all the filters that par-tition the problem into exact search canbe applied for α = O(1/ logσ m), and thatin order to improve this to 1 − e/

√σ we

must partition the problem into (smaller)approximate searching subproblems.

8.3.5 Sutinen and Tarhio (1995). Sutinenand Tarhio [1995] generalized theTakaoka filter in 1995, improving itsfiltering efficiency. This is the first filterthat takes into account the relative posi-tions of the pattern pieces that match inthe text (all the previous filters matchedpieces of the pattern in any order). Thegeneralization is to force s q-grams of thepattern to match (not just one). The piecesmust conserve their relative ordering inthe pattern and must not be more thank characters away from their correct

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 40: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

70 G. Navarro

position (otherwise we need to make morethan k errors to use them). This methodis also illustrated in Figure 23.

In this case, the sampling step is re-duced to h=b(m− k − q+ 1)/(k+ s)c. Thereason for this reduction is that, to ensurethat s pieces of the pattern match, weneed to cut the pattern into k+ s pieces.The pattern is divided in k+ s pieces anda hashed set is created for each pieceso that the pieces are forced not to betoo far away from their correct positions.The set contains the q-grams of the pieceand some neighboring ones too (becausethe sample can be slightly misaligned). Atsearch time, instead of a single h-sample,they consider text windows of contiguoussequences of k+ s h-samples. Each ofthese h-samples is searched in the corr-esponding set, and if at least s are foundthe area is verified. This is a sort ofHamming distance, and the authorsresort to an efficient algorithm for thatdistance [Baeza-Yates and Gonnet 1992]to process the text.

The resulting algorithm is O(αn logσ m)on average using optimal q= logσ m, andworks well for α <1/ logσ m. The algo-rithm is better suited for long patterns,although with s= 2 it can be reasonablyapplied to short ones as well. In fact theanalysis is done for s = 2 only in Sutinenand Tarhio [1995].

8.3.6 Shi (1996). In 1996 Shi [1996] pro-posed to extend the idea of the k+ 1 pieces(explained in Section 8.2) to k+ s pieces,so that at least s pieces must match. Thisidea is implicit in the filter of Sutinen andTarhio but had not been explicitly writtendown. Shi compared his filter againstthe simple one, finding that the filteringefficiency was improved. However, thisimprovement will be noticeable onlyfor long patterns. Moreover, the onlinesearching efficiency is degraded becausethe pieces are shorter (which affects anyBoyer–Moore-like search), and becausethe verification logic is more complex. Noanalysis is presented in the paper, butwe conjecture that the optimum s is O(1)and therefore the same complexity andtolerance to errors is maintained.

8.3.7 Giegerich, Kurtz, Hischke, and Ohle-busch (1996). Also in 1996, a generalmethod to improve filters was developed[Giegerich et al. 1997]. The idea is tomix the phases of filtering and checking,so that the verification of a text areais abandoned as soon as the combinedinformation from the filter (number ofguaranteed differences left) and theverification in progress (number of actualdifferences seen) shows that a match isnot possible. As they show, however, theimprovement occurs in a very narrowarea of α. This is a consequence of thestatistics of this problem that we havediscussed in Section 4.

9. EXPERIMENTS

In this section we make empirical compar-isons among the algorithms described inthis work. Our goal is to show the bestoptions at hand depending on the case.Nearly 40 algorithms have been surveyed,some of them without existing implemen-tations and many of them already knownto be impractical. To avoid excessively longcomparisons among algorithms known notto be competitive, we have left many ofthem aside.

9.1 Included and Excluded Algorithms

A large group of excluded algorithms isfrom the theoretical side based on thedynamic programming matrix. Althoughthese algorithms are not competitive inpractice, they represent (or representedat their time) a valuable contributionto the development of the algorithmicaspect of the problem. The dynamicprogramming algorithm [Sellers 1980] isexcluded because the cut-off heuristic ofUkkonen [1985b] is known to be faster(e.g. in Chang and Lampe [1992] andin our internal tests); the Masek andPaterson algorithm [1980] is argued inthe same paper to be worse than dynamicprogramming (which is quite bad) forn < 40 GB; Landau and Vishkin [1988]has bad complexity and was improvedlater by many others in theory andpractice; Landau and Vishkin [1989] is

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 41: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 71

implemented with a better LCA algorithmin Chang and Lampe [1992] and foundtoo slow; Myers [1986a] is consideredslow in practice by the same author inWu et al. [1996]; Galil and Giancarlo[1988] is clearly slower than Landau andVishkin [1989]; Galil and Park [1990], oneof the fastest among the O(kn) worst casealgorithms, is shown to be extremely slowin Ukkonen and Wood [1993], Chang andLampe [1992], and Wright [1994] and ininternal tests done by ourselves; Ukkonenand Wood [1993] is shown to be slow inJokinen et al. [1996]; the O(kn) algorithmimplemented in Chang and Lawler [1994]is in the same paper argued to be thefastest of the group and shown to be notcompetitive in practice; Sahinalp andVishkin [1997] and Cole and Hariharan[1998] are clearly theoretical, their com-plexities show that the patterns have tobe very long and the error level too lowto be of practical application. To give anidea of how slow is “slow,” we found Galiland Park [1990] 10 times slower thanUkkonen’s cut-off heuristic (a similarresult is reported by Chang and Lampe[1992]). Finally, other O(kn) average timealgorithms are proposed in Myers [1986a]and Galil and Park [1990], and they areshown to be very similar to Ukkonen’scut-off [Ukkonen 1985b] in Chang andLampe [1992]. Since the cut-off heuristicis already not very competitive we leaveaside the other similar algorithms. There-fore, from the group based on dynamicprogramming we consider only the cut-offheuristic (mainly as a reference) andChang and Lampe [1992], which is theonly one competitive in practice.

From the algorithms based on au-tomata we consider the DFA algorithm[Ukkonen 1985b], but prefer its lazyversion implemented in Navarro [1997b],which is equally fast for small automataand much faster for large automata. Wealso consider the Four Russians algorithmof Wu et al. [1996]. From the bit-parallelalgorithms we consider Wu and Manber[1992b], Baeza-Yates and Navarro [1999],and Myers [1999], leaving aside Wright[1994]. As shown in the 1996 versionof Baeza-Yates and Navarro [1999], the

algorithm of Wright [1994] was compet-itive only on binary text, and this wasshown to not hold anymore in Myers[1999].

From the filtering algorithms, we haveincluded Tarhio and Ukkonen [1993];the counting filter proposed in Jokinenet al. [1996] (as simplified in Navarro[1997a]); the algorithm of Navarro andRaffinot [2000]; and those of Sutinen andTarhio [1995] and Takaoka [1994] (thislast seen as the case s = 1 of Sutinen andTarhio [1995], since this implementationworked better). We have also includedthe filters proposed in Baeza-Yates andNavarro [1999], Navarro and Baeza-Yates[1998a], and Navarro [1998], preferringto present only the last version whichincorporates all the twists of superimpo-sition, hierarchical verification and mixedpartitioning. Many previous versionsare outperformed by this one. We havealso included the best version of thefilters that partition the pattern in k+ 1pieces, namely the one incorporatinghierarchical verification [Navarro andBaeza-Yates 1999c; Navarro 1998]. Inthose publications it is shown that thisversion clearly outperforms the previousones proposed in Wu and Manber [1992b],Baeza-Yates and Perleberg [1996], andBaeza-Yates and Navarro [1999]. Finally,we are discarding some filters [Chang andLawler 1994; Ukkonen 1992; Chang andMarr 1994; Shi 1996] which are applica-ble only to very long patterns, since thiscase is excluded from our experimentsas explained shortly. Some comparisonsamong them were carried out by Changand Lampe [1992], showing that LET isequivalent to the cut-off algorithm withk = 20, and that the time for SET is 2αtimes that of LET. LET was shown to bethe fastest with patterns of a hundredletters long and a few errors in Jokinenet al. [1996], but we recall that manymodern filters were not included in thatcomparison.

We now list the included algorithmsand the relevant comments about them.All the algorithms implemented by usrepresent our best coding effort andhave been found similar or faster than

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 42: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

72 G. Navarro

other implementations found elsewhere.The implementations coming from otherauthors were checked with the samestandards and in some cases their codewas improved with better register us-age and I/O management. The numberin parenthesis following the name ofeach algorithm is the number of linesof the C implementation we use. Thisgives a rough idea of how complex theimplementation of each algorithm is.

CTF (239) The cut-off heuristic ofUkkonen [1985b] implemented by us.

CLP (429) The column partitioning algo-rithm of Chang and Lampe [1992], im-plemented by them. We replaced theirI/O by ours, which is faster.

DFA (291) The lazy deterministic automa-ton of Navarro [1997b], implemented byus.

RUS (304) The Four-Russians algorithmof Wu et al. [1996], implemented bythem. We tried different r values (re-lated to the time/space tradeoff) andfound that the best option is alwaysr = 5 in our machine.

BPR (229) The NFA bit-parallelizedby rows [Wu and Manber 1992b],implemented by us and restricted tom ≤ w. Separate code is used for k = 1,2, 3 and k > 3. We could continue writ-ing separate versions but decided thatthis is reasonable up to k = 3, as at thatpoint the algorithm is not competitiveanyway.

BPD (249 – 1,224) The NFA bit-para-llelized by diagonals [Baeza-Yates andNavarro 1999], implemented by us.Here we do not include any filteringtechnique. The first number (249)corresponds to the plain techniqueand the second one (1,224) to handlingpartitioned automata.

BPM (283 – 722) The bit-parallel imple-mentation of the dynamic programmingmatrix [Myers 1999], implemented bythat author. The two numbers have thesame meaning as in the previous item.

BMH (213) The adaptation of Horspoolto allow errors [Tarhio and Ukkonen1993], implemented by them. We use

their algorithm 2 (which is faster), im-prove some register usage and replacetheir I/O by ours, which is faster.

CNT (387) The counting filter of Jokinenet al. [1996], as simplified in Navarro[1997a] and implemented by us.

EXP (877) Partitioning in k+ 1 piecesplus hierarchical verification [Navarroand Baeza-Yates 1999c; Navarro 1998],implemented by us.

BPP (3,466) The bit-parallel algorithmsof Baeza-Yates and Navarro [1999],Navarro and Baeza-Yates [1998a], andNavarro [1998] using pattern partition-ing, superimposition, and hierarchicalverification. The implementation isours and is packaged software that canbe downloaded from the Web page ofthe author.

BND (375) The BNDM algorithm adaptedto allow errors in Navarro and Raffinot[2000] and Navarro [1998] implementedby us and restricted to m ≤ w. Separatecode is used for k = 1, 2, 3 and k > 3.We could continue writing separate ver-sions but decided that this is reasonableup to k = 3.

QG2 (191) The q-gram filter of Sutinenand Tarhio [1995], implemented bythem and used with s= 2 (since s= 1is the algorithm [Takaoka 1994], seenext item; and s> 2 worked well onlyfor very long patterns). The code isrestricted to k ≤ w/2 − 3, and it is alsonot run when q is found to be 1 since theperformance is very poor. We improvedregister usage and replaced the I/Omanagement by our faster versions.

QG1 (191) The q-gram algorithm ofTakaoka [1994], run as the special cases = 1 of the previous item. The samerestrictions on the code apply.

We did our best to uniformize thealgorithms. The I/O is the same in allcases: the text is read in chunks of 64 KBto improve locality (this is the optimumin our machine) and care is taken to notlose or repeat matches in the borders;open is used instead of fopen because itis slower. We also uniformize internalconventions: only a final special character

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 43: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 73

(zero) is used at the end of the bufferto help algorithms recognize it; andonly the number of matches found isreported.

In the experiments we separate the fil-tering and nonfiltering algorithms. This isbecause the filters can in general use anynonfilter to check for potential matches, sothe best algorithm is formed by a combina-tion of both. All the filtering algorithms inthe experiments use the cut-off algorithm[Ukkonen 1985b] as their verification en-gine, except for BPP (whose very essenceis to switch smoothly to BPD) and BND(which uses a reverse BPR to search in thewindow and a forward BPR for the verifi-cations).

9.2 Experimental Setup

Apart from the algorithms and their de-tails, we describe our experimental setup.We measure CPU times and show the re-sults in tenths of seconds per megabyte.Our machine is a Sun UltraSparc-1 with167 MHz and 64 MB in main memory, werun Solaris 2.5.1 and the texts are on a lo-cal disk of 2 GB. Our experiments were runon texts of 10 MB and repeated 20 times(with different search patterns). The samepatterns were used for all the algorithms.

In the applications, we have selectedthree types of texts.

DNA: This file is formed by concatenatingthe 1.34 MB DNA chain of h.influenzaewith itself until 10 MB is obtained.Lines are cut at 60 characters. Thepatterns are selected randomly fromthe text, avoiding line breaks if possible.The alphabet size is four, save for a fewexceptions along the file, and the res-ults are similar to a random four-lettertext.

Natural language: This file is formed by1.29 MB from the work of BenjaminFranklin filtered to lower-case andseparators converted to a space (exceptline breaks which are respected). Thismimics common information retrievalscenarios. The text is replicated toobtain 10 MB and search patterns arerandomly selected from the same text

at word beginnings. The results areroughly equivalent to a random textover 15 characters.

Speech:We obtained speech files fromdiscussions of U.S. law from IndianaUniversity, in PCM format with 8 bitsper sample. Of course, the standard editdistance is of no use here, since it has totake into account the absolute values ofthe differences between two characters.We simplified the problem in order touse edit distance: we reduced the rangeof values to 64 by quantization, consid-ering two samples that lie in the samerange as equal. We used the first 10 MBof the resulting file. The results aresimilar to those on a random text of 50letters, although the file shows smoothchanges from one letter to the next.

We present results using different pat-tern lengths and error levels in two fla-vors: we fix m and show the effect of in-creasing k, or we fix α and show the effectof increasing m. A given algorithm maynot appear at all in a plot when its timesare above the y range or its restrictionson m and k do not intersect with the xrange. In particular, filters are shown onlyfor α ≤ 1/2. We remind readers that inmost applications the error levels of inter-est are low.

9.3 Results

Figure 24 shows the results for shortpatterns (m= 10) and varying k. In non-filtering algorithms BPD is normally thefastest, up to 30% faster than the nextone, BPM. The DFA is also quite closein most cases. For k= 1, a specializedversion of BPR is slightly faster thanBPD (recall that for k> 3 BPR starts touse a nonspecialized algorithm, hencethe jump). An exception occurs in DNAtext, where for k= 4 and k= 5, BPDshows a nonmonotonic behavior and BPMbecomes the fastest. This behavior comesfrom its O(k(m− k)n/w) complexity,11

11 Another reason for this behavior is that there areinteger round-off effects that produce nonmonotonicresults.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 44: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

74 G. Navarro

Fig. 24 . Results for m = 10 and varying k. The left plots show nonfiltering and the right plotsshow filtering algorithms. Rows 1–3 show DNA, English, and speech files, respectively.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 45: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 75

which in texts with larger alphabets is notnoticeable because the cut-off heuristickeeps the cost unchanged. Indeed, thebehavior of BPD would have been totallystable if we had chosen m= 9 instead ofm= 10, because the problem would fitin a computer word all the time. BPM,on the other hand, handles much longerpatterns, maintaining stability, althoughit takes up to 50% more time than BPD.

With respect to filters, EXP is thefastest for low error levels. The value of“low” increases for larger alphabets. Atsome point, BPP starts to dominate. BPPadapts smoothly to higher error levelsby slowly switching to BPD, so BPP is agood alternative for intermediate errorlevels, where EXP ceases to work untilit switches to BPD. However, this rangeis void on DNA and English text form= 10. Other filters competitive withEXP are BND and BMH. In fact, BNDis the fastest for k= 1 on DNA, althoughno filter works very well in that case.Finally, QG2 does not appear because itonly works for k= 1 and it was worse thanQG1.

The best choice for short patterns seemsto be EXP while it works and switchingto the best bit-parallel algorithm forhigher errors. Moreover, the verificationalgorithm for EXP should be BPR orBPD (which are the fastest where EXPdominates).

Figure 25 shows the case of longer pat-terns (m= 30). Many of the observationsare still valid in this case. However, inthis case the algorithm BPM shows itsadvantage over BPD, since the entireproblem still fits in a computer word forBPM and it does not for BPD. Hence inthe left plots the best algorithm is BPMexcept for low k, where BPR or BPD arebetter. With respect to filters, EXP orBND are the fastest, depending on thealphabet, until a certain error level isreached. At that point BPP becomes thefastest, in some cases still faster thanBPM. Notice that for DNA a specializedversion of BND for k= 4 and even 5 couldbe the fastest choice.

In Figure 26 we consider the case offixed α= 0.1 and growing m. The results

repeat somewhat those for nonfilteringalgorithms: BPR is the best for k= 1 (i.e.m= 10), then BPD is the best until acertain pattern length is reached (whichvaries from 30 on DNA to 80 on speech),and finally BPM becomes the fastest. Notethat for such a low error level the numberof active columns is quite small, whichpermits algorithms like BPD and BPMto keep their good behavior for patternsmuch longer than what they could handlein a single machine word. The DFA isalso quite competitive until its memoryrequirements become unreasonable.

The real change, however, is in thefilters. In this case PEX becomes the starfilter in English and speech texts. The situ-ation for DNA, on the other hand, is quitecomplex. For m≤ 30, BND is the fastest,and indeed an extended implementationallowing longer patterns could keep it be-ing the fastest for a few more points. How-ever, that case would have to handle fourerrors, and only a specialized implementa-tion for fixed k= 4, 5,. . . . could maintaina competitive performance. We havedetermined that such specialized code isworthwhile up to k= 3 only. When BNDceases to be applicable, PEX becomes thefastest algorithm, and finally QG2 beatsit (for m≥ 60). However, notice that form> 30, all the filters are beaten by BPMand therefore make little sense (on DNA).

There is a final phenomenon thatdeserves mention with respect to filters.The algorithms QG1 and QG2 improve asm grows. These algorithms are the mostpractical, and the only ones we tested inthe family of algorithms suitable for verylong patterns. Thus, although all thesealgorithms would not be competitive inour tests (where m≤ 100), they shouldbe considered in scenarios where thepatterns are much longer and the errorlevel is kept very low. In such a scenario,those algorithms would finally beat allthe algorithms we consider here.

The situation becomes worse for thefilters when we consider α = 0.3 andvarying m (Figure 27). On DNA, no filtercan beat the nonfiltering algorithms, andamong them the tricks to maintain a fewactive columns do not work well. This

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 46: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

76 G. Navarro

Fig. 25 . Results for m = 30 and varying k. The left plots show nonfiltering and the right plotsshow filtering algorithms. Rows 1–3 show DNA, English and speech files, respectively.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 47: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 77

Fig. 26 . Results for α = 0.1 and varying m. The left plots show nonfiltering and the right plotsshow filtering algorithms. Rows 1–3 show DNA, English and speech files, respectively.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 48: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

78 G. Navarro

Fig. 27 . Results for α = 0.3 and varying m. The left plots show nonfiltering and the right plotsshow filtering algorithms. Rows 1–3 show DNA, English and speech files, respectively.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 49: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 79

Fig. 28 . The areas where each algorithm is the best; gray is that of filtering algorithms.

favors the algorithms that pack more in-formation per bit, which makes BPM thebest in all cases except for m= 10 (whereBPD is better). The situation is almostthe same on English text, except thatBPP works reasonably well and becomesquite similar to BPM (the periods whereeach one dominates are interleaved). Onspeech, on the other hand, the scenario issimilar to that for nonfiltering algorithms,but the PEX filter still beats all of them,as 30% of errors is low enough on thespeech files. Note in passing that the errorlevel is too high for QG1 and QG2, whichcan only be applied in a short range andyield bad results.

To give an idea of the areas where eachalgorithm dominates, Figure 28 shows thecase of English text. There is more infor-mation in Figure 28 than can be inferredfrom previous plots, such as the areawhere RUS is better than BPM. We haveshown the nonfiltering algorithms and su-perimposed in gray the area where the fil-ters dominate. Therefore, in the gray areathe best choice is to use the correspondingfilter using the dominating nonfilter as itsverification engine. In the nongray area itis better to use the dominating nonfilter-ing algorithm directly, with no filter.

A code implementing such a heuristic(including EXP, BPD and BPP only) ispublicly available from the author’s Web

page.12 This combined code is fasterthan each isolated algorithm, although ofcourse it is not really a single algorithmbut the combination of the best choices.

10. CONCLUSIONS

We reach the end of this tour on approxi-mate string matching. Our goal has beento present and explain the main ideasbehind the existing algorithms, to classifythem according to the type of approachproposed, and to show how they performin practice in a subset of possible prac-tical scenarios. We have shown that theoldest approaches, based on the dynamicprogramming matrix, yield the mostimportant theoretical developments, butin general the algorithms have beenimproved by modern developments basedon filtering and bit-parallelism. In par-ticular, the fastest algorithms combine afast filter to discard most of the text witha fast nonfilter algorithm to check forpotential matches.

We show some plots summarizing thecontents of the survey. Figure 29 showsthe historical order in which the algo-rithms appeared in the different areas.

12 http://www.dcc.uchile.cl/∼gnavarro/pubcode.To apply EXP the option-ep must be used.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 50: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

80 G. Navarro

Fig. 29 . Historical development of the different areas. References are shortened to first letters(single authors) or initials (multiple authors), and to the last two digits of years.Key: Sel80 = [Sellers 1980], MP80 = [Masek and Paterson 1980], LV88 = [Landau and Vishkin1988], Ukk85b= [Ukkonen 1985b], LV89= [Landau and Vishkin 1989], Mye86a= [Myers 1986a],GG88 = [Galil and Giancarlo 1988], GP90 = [Galil and Park 1990], CL94 = [Chang and Lawler1994], UW93= [Ukkonen and Wood 1993], TU93= [Tarhio and Ukkonen 1993], JTU96= [Jokinenet al. 1996], CL92 = [Chang and Lampe 1992], WMM96 = [Wu et al. 1996], WM92b = [Wu andManber 1992b], BYP96 = [Baeza-Yates and Perleberg 1996], Ukk92 = [Ukkonen 1992], Wri94 =[Wright 1994], CM94 = [Chang and Marr 1994], Tak94 = [Takaoka 1994], Mel96 = [Melichar1996], ST95 = [Sutinen and Tarhio 1995], Kur96 = [Kurtz 1996], BYN99 = [Baeza-Yates andNavarro 1999], Shi96 = [Shi 1996], GKHO97 = [Giegerich et al. 1997], SV97 = [Sahinalp andVishkin 1997], Nav97a = [Navarro 1997a], CH98 = [Cole and Hariharan 1998], Mye99 = [Myers1999], NBY98a & NBY99b = [Navarro and Baeza-Yates 1998a; 1999b], and NR00 = [Navarro andRaffinot 2000].

Figure 30 shows a worst case time/spacecomplexity plot for the nonfiltering al-gorithms. Figure 31 considers filtrationalgorithms, showing their average casecomplexity and the maximum error levelα for which they work. Some practicalassumptions have been made to order thedifferent functions of k, m, σ , w, and n.

Approximate string matching is avery active research area, and it shouldcontinue in that status in the foreseeable

future: strong genome projects in com-putational biology, the pressure for oralhuman-machine communication and theheterogeneity and spelling errors presentin textual databases are just a sample ofthe reasons that drive researchers to lookfor faster and more flexible algorithms forapproximate pattern matching.

It is interesting to point out theoreti-cal and practical questions that are stillopen.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 51: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 81

Fig. 30 . Worst case time and space complexity of nonfiltering algorithms. We replaced w by2(log n). References are shortened to first letters (single authors) or initials (multiple authors),and to the last two digits of years.Key: Sel80 = [Sellers 1980], LV88 = [Landau and Vishkin 1988], WM92b = [Wu and Manber1992b], GG88 = [Galil and Giancarlo 1988], UW93 = [Ukkonen and Wood 1993], GP90 = [Galiland Park 1990], CL94 = [Chang and Lawler 1994], Mye86a = [Myers 1986a], LV89 = [Landauand Vishkin 1989], CH98 = [Cole and Hariharan 1998], BYN99 = [Baeza-Yates and Navarro1999], Mye99 = [Myers 1999], WMM96 = [Wu et al. 1996], MP80 = [Masek and Paterson 1980],and Ukk85a = [Ukkonen 1985a].

—The exact matching probability and av-erage edit distance between two randomstrings is a difficult open question. Wefound a new bound in this survey, butthe problem is still open.

—A worst-case lower bound of theproblem is clearly O(n), but the onlyalgorithms achieving it have space andpreprocessing cost exponential in m ork. The only improvements to the worstcase with polynomial space complexityare the O(kn) algorithms and, for verysmall k, O(n(1+ k4/m)). Is it possibleto improve the algorithms or to find abetter lower bound for this case?

—The previous question also has apractical side: Is it possible to find an

algorithm which is O(kn) in the worstcase and efficient in practice? Using bit-parallelism, there are good practical al-gorithms that achieve O(kn/w) on aver-age and O(mn/w) in the worst case.

—The lower bound of the problem forthe average case is known to beO(n(k+ logσ m)/m), and there existsan algorithm achieving it, so from thetheoretical point of view that problemis closed. However, from the practicalside, the algorithms approaching thoselimits work well only for very long pat-terns, while a much simpler algorithm(EXP) is the best for moderate andshort patterns. Is it possible to find aunified approach, good in practice andwith that theoretical complexity?

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 52: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

82 G. Navarro

Fig. 31 . Average time and maximum tolerated error level for the filtration algorithms. Referencesare shortened to first letters (single authors) or initials (multiple authors), and to the last two digitsof years.Key: BYN99 = [Baeza-Yates and Navarro 1999], NBY98a = [Navarro and Baeza-Yates 1998a],JTU96= [Jokinen et al. 1996], Ukk92= [Ukkonen 1992], CL94= [Chang and Lawler 1994], WM92b=[Wu and Manber 1992b], TU93 = [Tarhio and Ukkonen 1993], Tak94 = [Takaoka 1994], Shi96 = [Shi1996], ST95 = [Sutinen and Tarhio 1995], NR00 = [Navarro and Raffinot 2000], and CM94 = [Changand Marr 1994].

—Another practical question on filteringalgorithms is: Is it possible in practice toimprove over the current best existingalgorithms?

—Finally, there are many other openquestions related to offline approxi-mate searching, which is a much lessmature area needing more research.

APPENDIX A. SOME ANALYSES

Since some of the source papers lack ananalysis, or they do not analyze exactlywhat is of interest to us, we provide a sim-ple analysis. This is not the purpose of

this survey, so we content ourselves withrough figures. In particular, our analysesare valid for σ <<m. All refer to filtersand are organized according to the originalorder, so the reader should first read thealgorithm description to understand theterminology.

A.1 Tarhio and Ukkonen (1990)

First, the probability of a text charac-ter being “bad” is that of not matching2k+ 1 pattern positions, i.e. Pbad=(1− 1/σ )2k+ 1≈ e−(2k+ 1)/σ , so we try onaverage 1/Pbad characters until we finda bad one. Since k+ 1 bad characters

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 53: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 83

have to be found, we make O(k/Pbad)leave the window. On the other hand, theprobability of verifying a text windowis that of reaching its beginning. Weapproximate that probability by equatingm to the average portion of the traversedwindow (k/Pbad), to obtain α < e−(2k+ 1)/σ .

A.2 Wu and Manber (1992)

The Sunday algorithm can be analyzed asfollows. To see how far we can verify inthe current window, consider that (k+ 1)patterns have to fail. Each one failson average in logσ (m/(k+ 1)) charactercomparisons, but the time for all them tofail is longer. By Yao’s bound [Yao 1979],this cannot be less than logσ m. Otherwisewe could split the test of a single patterninto (k+ 1) tests of subpatterns, and all ofthem would fail in less than logσ m time,breaking the lower bound. To compute theaverage shift, consider that k charactersmust be different from the last windowcharacter, and therefore the average shiftis σ/k. The final complexity is thereforeO(kn logσ (m)/σ ). This is optimistic, butwe conjecture that it is the correct com-plexity. An upper bound is obtained byreplacing k by k2 (i.e. adding the times forall the pieces to fail).

A.3 Navarro and Raffinot (1998)

The automaton matches the text windowwith k errors until almost surely k/α∗characters have been inspected (so thatthe error level becomes lower than α∗).From there on, it becomes exponentiallydecreasing on γ , which can be made1/σ in O(k) total steps. From that pointon, we are in a case of exact stringmatching and then logσ m characters areinspected, for a total of O(k/α∗ + logσ m).When the window is shifted to the lastprefix that matched with k errors, thisis also at k/α∗ distance from the endof the window, on average. The windowlength is m− k, and therefore we shiftthe window in m− k− k/α∗ on average.Therefore, the total amount of work isO(n(α+α∗ logσ (m)/m)/((1−α)α∗−α)). Thefilter works well unless the probability

of finding a pattern prefix with errors atthe beginning of the window is high. Thisis the same as saying that k/α∗ =m− k,which gives α < (1− e/

√σ )/(2− e/

√σ ).

A.4 Ukkonen (1992)

The probability of finding a given q-gramin the text window is 1 − (1 − 1/σ q)m ≈1 − e−m/σ q

. So the probability of verifyingthe text position is that of finding(m−q+ 1−kq) q-grams of the pattern, i.e.(m−q+ 1

kq

)(1 − e−m/σ q

)m−q+ 1−kq . This mustbe O(1/m2) in order not to interfere withthe search time. Taking logarithms andapproximating the combinatorials usingStirling’s n! = (n/e)n

√2πn(1+O(1/n)),

we arrive at

2 logσ m

kq <+ (m− q + 1) logσ

(1− e−m/σ q)

logσ(1− e−m/σ q

)+ logσ (kq)− logσ (m− q + 1)

from which, by replacing q = logσ m, weobtain

α <1

logσ m(logσ α + logσ logσ m)

= O(

1logσ m

)a quite common result for this type of fil-ter. The q = logσ m is chosen because theresult improves as q grows, but it is nec-essary that q ≤ logσ m holds, since other-wise logσ (1− e−m/σ q

) becomes zero and theresult worsens.

ACKNOWLEDGMENTS

The author thanks the many researchers in thisarea for their willingness to exchange ideas and/orshare their implementations: Amihood Amir,Ricardo Baeza-Yates, William Chang, Udi Manber,Gene Myers, Erkki Sutinen, Tadao Takaoka, JormaTarhio, Esko Ukkonen, and Alden Wright. Thereferees also provided important suggestions thatimproved the presentation.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Page 54: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

84 G. Navarro

REFERENCES

AHO, A. AND CORASICK, M. 1975. Efficient stringmatching: an aid to bibliographic search. Com-mun. ACM 18, 6, 333–340.

AHO, A., HOPCROFT, J., AND ULLMAN, J. 1974. TheDesign and Analysis of Computer Algorithms.Addison-Wesley, Reading, MA.

ALTSCHUL, S., GISH, W., MILLER, W., MYERS, G., AND

LIPMAN, D. 1990. Basic local alignment searchtool. J. Mol. Biol. 215, 403–410.

AMIR, A., LEWENSTEIN, M., AND LEWENSTEIN, N. 1997a.Pattern matching in hypertext. In Proceedingsof the 5th International Workshop on Algo-rithms and Data Structures (WADS ’97). LNCS,vol. 1272, Springer-Verlag, Berlin, 160–173.

AMIR, A., AUMANN, Y., LANDAU, G., LEWENSTEIN, M.,AND LEWENSTEIN, N. 1997b. Pattern matchingwith swaps. In Proceedings of the Foundationsof Computer Science (FOCS’97), 1997, 144–153.

APOSTOLICO, A. 1985. The myriad virtues of subwordtrees. In Combinatorial Algorithms on Words.Springer-Verlag, Barlin, 85–96.

APOSTOLICO, A. AND GALIL, Z. 1985. Combinato-rial Algorithms on Words. NATO ISI Series.Springer-Verlag, Berlin.

APOSTOLICO, A. AND GALIL, Z. 1997. Pattern Match-ing Algorithms. Oxford University Press, Ox-ford, UK.

APOSTOLICO, A. AND GUERRA, C. 1987. The LongestCommon Subsequence problem revisited. Algo-rithmica 2, 315–336.

ARAUJO, M., NAVARRO, G., AND ZIVIANI, N. 1997. Largetext searching allowing errors. In Proceedings ofthe 4th South American Workshop on String Pro-cessing (WSP ’97), Carleton Univ. Press. 2–20.

ARLAZAROV, V., DINIC, E., KONROD, M., AND FARADZEV,I. 1975. On economic construction of the tran-sitive closure of a directed graph. Sov. Math.Dokl. 11, 1209, 1210. Original in Russian inDokl. Akad. Nauk SSSR 194, 1970.

ATALLAH, M., JACQUET, P., AND SZPANKOWSKI, W. 1993.A probabilistic approach to pattern matchingwith mismatches. Random Struct. Algor. 4, 191–213.

BAEZA-YATES, R. 1989. Efficient Text Searching.Ph.D. thesis, Dept. of Computer Science, Univer-sity of Waterloo. Also as Res. Rep. CS-89-17.

BAEZA-YATES, R. 1991. Some new results on approx-imate string matching. In Workshop on DataStructures, Dagstuhl, Germany. Abstract.

BAEZA-YATES, R. 1992. Text retrieval: Theory andpractice. In 12th IFIP World Computer Congress.Elsevier Science, Amsterdam. vol. I, 465–476.

BAEZA-YATES, R. 1996. A unified view of stringmatching algorithms. In Proceedings of the The-ory and Practice of Informatics (SOFSEM ’96).LNCS, vol. 1175, Springer-Verlag, Berlin, 1–15.

BAEZA-YATES, R. AND GONNET, G. 1992. A new ap-proach to text searching. Commun. ACM 35, 10,

74–82. Preliminary version in ACM SIGIR’89.

BAEZA-YATES, R. AND GONNET, G. 1994. Fast stringmatching with mismatches. Information andComputation 108, 2, 187–199. Preliminaryversion as Tech. Rep. CS-88-36, Data Structur-ing Group, Univ. of Waterloo, Sept. 1988.

BAEZA-YATES, R. AND NAVARRO, G. 1997. Multiple ap-proximate string matching. In Proceedings of the5th International Workshop on Algorithms andData Structures (WADS ’97). LNCS, vol. 1272,1997, Springer-Verlag, Berlin, 174–184.

BAEZA-YATES, R. AND NAVARRO, G. 1998. New andfaster filters for multiple approximate stringmatching. Tech. Rep. TR/DCC-98-10, Dept.of Computer Science, University of Chile.Random Struct. Algor. to appear. ftp://ftp.dcc.ptuchile.cl/pub/users/gnavarro/multi.ps.gz.

BAEZA-YATES, R. AND NAVARRO, G. 1999. Faster ap-proximate string matching. Algorithmica 23, 2,127–158. Preliminary versions in Proceedings ofCPM ’96 (LNCS, vol. 1075, 1996) and in Proceed-ings of WSP’96, Carleton Univ. Press, 1996.

BAEZA-YATES, R. AND NAVARRO, G. 2000. Block-addressing indices for approximate text re-trieval. J. Am. Soc. Inf. Sci. (JASIS) 51, 1 (Jan.),69–82.

BAEZA-YATES, R. AND PERLEBERG, C. 1996. Fast andpractical approximate pattern matching. Infor-mation Processing Letters 59, 21–27. Prelimi-nary version in CPM ’92 (LNCS, vol. 644. 1992).

BAEZA-YATES, R. AND REGNIER, M. 1990. Fast algo-rithms for two dimensional and multiple patternmatching. In Proceedings of Scandinavian Work-shop on Algorithmic Theory (SWAT ’90). LNCS,vol. 447, Springer-Verlag, Berlin, 332–347.

BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. ModernInformation Retrieval. Addison-Wesley, Read-ing, MA.

BLUMER, A., BLUMER, J., HAUSSLER, D., EHRENFEUCHT,A., CHEN, M., AND SEIFERAS, J. 1985. The smal-lest automaton recognizing the subwords of atext. Theor. Comput. Sci. 40, 31–55.

BOYER, R. AND MOORE, J. 1977. A fast string search-ing algorithm. Commun. ACM 20, 10, 762–772.

CHANG, W. AND LAMPE, J. 1992. Theoretical andempirical comparisons of approximate stringmatching algorithms. In Proceedings of the 3dAnnual Symposium on Combinatorial PatternMatching (CPM ’92). LNCS, vol. 644, Springer-Verlag, Berlin, 172–181.

CHANG, W. AND LAWLER, E. 1994. Sublinear approx-imate string matching and biological applica-tions. Algorithmica 12, 4/5, 327–344. Prelimi-nary version in FOCS ’90.

CHANG, W. AND MARR, T. 1994. Approximate stringmatching and local similarity. In Proceedings ofthe 5th Annual Symposium on CombinatorialPattern Matching (CPM ’94). LNCS, vol. 807,Springer-Verlag, Berlin, 259–273.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 55: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 85

CHVATAL, V. AND SANKOFF, D. 1975. Longest com-mon subsequences of two random sequences.J. Appl. Probab. 12, 306–315.

COBBS, A. 1995. Fast approximate matching usingsuffix trees. In Proceedings of the 6th AnnualSymposium on Combinatorial Pattern Matching(CPM ’95), 41–54.

COLE, R. AND HARIHARAN, R. 1998. Approximatestring matching: a simpler faster algorithm. InProceedings of the 9th ACM-SIAM Symposiumon Discrete Algorithms (SODA ’98), 463–472.

COMMENTZ-WALTER, B. 1979. A string matching al-gorithm fast on the average. In Proc. ICALP ’79.LNCS, vol. 6, Springer-Verlag, Berlin, 118–132.

CORMEN, T., LEISERSON, C., AND RIVEST, R. 1990. Intro-duction to Algorithms. MIT Press, Cambridge,MA.

CROCHEMORE, M. 1986. Transducers and repeti-tions. Theor. Comput. Sci. 45, 63–86.

CROCHEMORE, M. AND RYTTER, W. 1994. Text Algo-rithms. Oxford Univ. Press, Oxford, UK.

CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W.1994. Speeding up two string-matching algo-rithms. Algorithmica 12, 247–267.

DAMERAU, F. 1964. A technique for computer detec-tion and correction of spelling errors. Commun.ACM 7, 3, 171–176.

DAS, G., FLEISHER, R., GASIENIEK, L., GUNOPULOS,D., AND KARKAINEN, J. 1997. Episode matching.In Proceedings of the 8th Annual Symposiumon Combinatorial Pattern Matching (CPM ’97).LNCS, vol. 1264, Springer-Verlag, Berlin, 12–27.

DEKEN, J. 1979. Some limit results for longest com-mon subsequences. Discrete Math. 26, 17–31.

DIXON, R. AND MARTIN, T. Eds. 1979. AutomaticSpeech and Speaker Recognition. IEEE Press,New York.

EHRENFEUCHT, A. AND HAUSSLER, D. 1988. A new dis-tance metric on strings computable in lineartime. Discrete Appl. Math. 20, 191–203.

ELLIMAN, D. AND LANCASTER, I. 1990. A review of seg-mentation and contextual analysis techniquesfor text recognition. Pattern Recog. 23, 3/4, 337–346.

FRENCH, J., POWELL, A., AND SCHULMAN, E. 1997. Ap-plications of approximate word matching in in-formation retrieval. In Proceedings of the 6thACM International Conference on Informationand Knowledge Management (CIKM ’97), 9–15.

GALIL, Z. AND GIANCARLO, R. 1988. Data structuresand algorithms for approximate string match-ing. J. Complexity 4, 33–72.

GALIL, Z. AND PARK, K. 1990. An improved algo-rithm for approximate string matching. SIAMJ. Comput. 19, 6, 989–999. Preliminary versionin ICALP ’89 (LNCS, vol. 372, 1989).

GIEGERICH, R., KURTZ, S., HISCHKE, F., AND OHLEBUSCH,E. 1997. A general technique to improve filter

algorithms for approximate string matching. InProceedings of the 4th South American Work-shop on String Processing (WSP ’97). CarletonUniv. Press. 38–52. Preliminary version asTech. Rep. 96-01, Universitat Bielefeld, Ger-many, 1996.

GONNET, G. 1992. A tutorial introduction to Compu-tational Biochemistry using Darwin. Tech. rep.,Informatik E. T. H., Zuerich, Switzerland.

GONNET, G. AND BAEZA-YATES, R. 1991. Handbook ofAlgorithms and Data Structures, 2d ed. Addison-Wesley, Reading, MA.

GONZALEZ, R. AND THOMASON, M. 1978. Syntactic Pat-tern Recognition. Addison-Wesley, Reading, MA.

GOSLING, J. 1991. A redisplay algorithm. In Proceed-ings of ACM SIGPLAN/SIGOA Symposium onText Manipulation, 123–129.

GROSSI, R. AND LUCCIO, F. 1989. Simple and efficientstring matching with k mismatches. Inf. Process.Lett. 33, 3, 113–120.

GUSFIELD, D. 1997. Algorithms on Strings, Trees andSequences. Cambridge Univ. Press, Cambridge.

HALL, P. AND DOWLING, G. 1980. Approximate stringmatching. ACM Comput. Surv. 12, 4, 381–402.

HAREL, D. AND TARJAN, E. 1984. Fast algorithmsfor finding nearest common ancestors. SIAM J.Comput. 13, 2, 338–355.

HECKEL, P. 1978. A technique for isolating differ-ences between files. Commun. ACM 21, 4, 264–268.

HOLSTI, N. AND SUTINEN, E. 1994. Approximatestring matching using q-gram places. In Pro-ceedings of 7th Finnish Symposium on ComputerScience. Univ. of Joensuu. 23–32.

HOPCROFT, J. AND ULLMAN, J. 1979. Introduction toAutomata Theory, Languages and Computation.Addison-Wesley, Reading, MA.

HORSPOOL, R. 1980. Practical fast searching instrings. Software Practice Exper. 10, 501–506.

JOKINEN, P. AND UKKONEN, E. 1991. Two algorithmsfor approximate string matching in static texts.In Proceedings of the 2nd Mathematical Founda-tions of Computer Science (MFCS ’91). Springer-Verlag, Berlin, vol. 16, 240–248.

JOKINEN, P., TARHIO, J., AND UKKONEN, E. 1996. A com-parison of approximate string matching algo-rithms. Software Practice Exper. 26, 12, 1439–1458. Preliminary version in Tech. Rep. A-1991-7, Dept. of Computer Science, Univ. of Helsinki,1991.

KARLOFF, H. 1993. Fast algorithms for approx-imately counting mismatches. Inf. Process.Lett. 48, 53–60.

KECECIOGLU, J. AND SANKOFF, D. 1995. Exact andapproximation algorithms for the inversiondistance between two permutations. Algorith-mica 13, 180–210.

KNUTH, D. 1973. The Art of Computer Pro-gramming, Volume 3: Sorting and Searching.Addison-Wesley, Reading, MA.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 56: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

86 G. Navarro

KNUTH, D., MORRIS, J., JR, AND PRATT, V. 1977. Fastpattern matching in strings. SIAM J. Com-put. 6, 1, 323–350.

KUKICH, K. 1992. Techniques for automatically cor-recting words in text. ACM Comput. Surv. 24, 4,377–439.

KUMAR, S. AND SPAFFORD, E. 1994. A pattern-matching model for intrusion detection. In Pro-ceedings of the National Computer Security Con-ference, 11–21.

KURTZ, S. 1996. Approximate string searching un-der weighted edit distance. In Proceedings of the3rd South American Workshop on String Pro-cessing (WSP ’96). Carleton Univ. Press. 156–170.

KURTZ, S. AND MYERS, G. 1997. Estimating the prob-ability of approximate matches. In Proceedingsof the 8th Annual Symposium on CombinatorialPattern Matching (CPM ’97). LNCS, vol. 1264,Springer-Verlag, Berlin, 52–64.

LANDAU, G. AND VISHKIN, U. 1988. Fast string match-ing with k differences. J. Comput. Syst. Sci. 37,63–78. Preliminary version in FOCS ’85.

LANDAU, G. AND VISHKIN, U. 1989. Fast parallel andserial approximate string matching. J. Algor. 10,157–169. Preliminary version in ACM STOC ’86.

LANDAU, G., MYERS, E., AND SCHMIDT, J. 1998. In-cremental string comparison. SIAM J. Com-put. 27, 2, 557–582.

LAWRENCE, S. AND GILES, C. L. 1999. Accessibility ofinformation on the web. Nature 400, 107–109.

LEE, J., KIM, D., PARK, K., AND CHO, Y. 1997. Effi-cient algorithms for approximate string match-ing with swaps. In Proceedings of the 8th AnnualSymposium on Combinatorial Pattern Matching(CPM ’97). LNCS, vol. 1264, Springer-Verlag,Berlin, 28–39.

LEVENSHTEIN, V. 1965. Binary codes capable of cor-recting spurious insertions and deletions of ones.Probl. Inf. Transmission 1, 8–17.

LEVENSHTEIN, V. 1966. Binary codes capable of cor-recting deletions, insertions and reversals. Sov.Phys. Dokl. 10, 8, 707–710. Original in Russianin Dokl. Akad. Nauk SSSR 163, 4, 845–848,1965.

LIPTON, R. AND LOPRESTI, D. 1985. A systolic ar-ray for rapid string comparison. In Proceedingsof the Chapel Hill Conference on VLSI, 363–376.

LOPRESTI, D. AND TOMKINS, A. 1994. On the search-ability of electronic ink. In Proceedings of the4th International Workshop on Frontiers inHandwriting Recognition, 156–165.

LOPRESTI, D. AND TOMKINS, A. 1997. Block edit mod-els for approximate string matching. Theor.Comput. Sci. 181, 1, 159–179.

LOWRANCE, R. AND WAGNER, R. 1975. An exten-sion of the string-to-string correction problem.J. ACM 22, 177–183.

LUCZAK, T. AND SZPANKOWSKI, W. 1997. A suboptimallossy data compression based on approximate

pattern matching. IEEE Trans. Inf. Theor. 43,1439–1451.

MANBER, U. AND WU, S. 1994. GLIMPSE: A tool tosearch through entire file systems. In Proceed-ings of USENIX Technical Conference. USENIXAssociation, Berkeley, CA, USA. 23–32. Prelim-inary version as Tech. Rep. 93-34, Dept. of Com-puter Science, Univ. of Arizona, Oct. 1993.

MASEK, W. AND PATERSON, M. 1980. A faster algo-rithm for computing string edit distances. J.Comput. Syst. Sci. 20, 18–31.

MASTERS, H. 1927. A study of spelling errors. Univ.of Iowa Studies in Educ. 4, 4.

MCCREIGHT, E. 1976. A space-economical suffix treeconstruction algorithm. J. ACM 23, 2, 262–272.

MELICHAR, B. 1996. String matching with k differ-ences by finite automata. In Proceedings of theInternational Congress on Pattern Recognition(ICPR ’96). IEEE CS Press, Silver Spring, MD.256–260. Preliminary version in Computer Anal-ysis of Images and Patterns (LNCS, vol. 970,1995).

MORRISON, D. 1968. PATRICIA—Practical algo-rithm to retrieve information coded in alphanu-meric. J. ACM 15, 4, 514–534.

MUTH, R. AND MANBER, U. 1996. Approximate mul-tiple string search. In Proceedings of the 7thAnnual Symposium on Combinatorial PatternMatching (CPM ’96). LNCS, vol. 1075, Springer-Verlag, Berlin, 75–86.

MYERS, G. 1994a. A sublinear algorithm for approx-imate keyword searching. Algorithmica 12, 4/5,345–374. Perliminary version in Tech. Rep.TR90-25, Computer Science Dept., Univ. of Ari-zona, Sept. 1991.

MYERS, G. 1994b. Algorithmic Advances for Search-ing Biosequence Databases. Plenum Press, NewYork, 121–135.

MYERS, G. 1986a. Incremental alignment algo-rithms and their applications. Tech. Rep. 86–22,Dept. of Computer Science, Univ. of Arizona.

MYERS, G. 1986b. An O(N D) difference algorithmand its variations. Algorithmica 1, 251–266.

MYERS, G. 1991. An overview of sequence compari-son algorithms in molecular biology. Tech. Rep.TR-91-29, Dept. of Computer Science, Univ. ofArizona.

MYERS, G. 1999. A fast bit-vector algorithm for ap-proximate string matching based on dynamicprogamming. J. ACM 46, 3, 395–415. Earlier ver-sion in Proceedings of CPM ’98 (LNCS, vol. 1448).

NAVARRO, G. 1997a. Multiple approximate stringmatching by counting. In Proceedings of the 4thSouth American Workshop on String Processing(WSP ’97). Carleton Univ. Press, 125–139.

NAVARRO, G. 1997b. A partial deterministic automa-ton for approximate string matching. In Pro-ceedings of the 4th South American Workshopon String Processing (WSP ’97). Carleton Univ.Press, 112–124.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Very relevant

Very relevant

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Myers Alg.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 57: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

A Guided Tour to Approximate String Matching 87

NAVARRO, G. 1998. Approximate Text Searching.Ph.D. thesis, Dept. of Computer Science, Univ. ofChile. Tech. Rep. TR/DCC-98-14. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/thesis98.ps.gz.

NAVARRO, G. 2000a. Improved approximate patternmatching on hypertext. Theor. Comput. Sci.,237, 455–463. Previous version in Proceedingsof LATIN ’98 (LNCS, vol. 1380).

NAVARRO, G. 2000b. Nrgrep: A fast and flexible pat-tern matching tool, Tech. Rep. TR/DCC-2000-3.Dept. of Computer Science, Univ. of Chile, Aug.ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/nrgrep.ps.gz.

NAVARRO, G. AND BAEZA-YATES, R. 1998a. Im-proving an algorithm for approximatepattern matching. Tech. Rep. TR/DCC-98-5, Dept. of Computer Science, Univ.of Chile. Algorithmica, to appear. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/dexp.ps.gz.

NAVARRO, G. AND BAEZA-YATES, R. 1998b. A practicalq-gram index for text retrieval allowing errors.CLEI Electron. J. 1, 2. http://www.clei.cl.

NAVARRO, G. AND BAEZA-YATES, R. 1999a. Fast multi-dimensional approximate pattern matching. InProceedings of the 10th Annual Symposiumon Combinatorial Pattern Matching (CPM ’99).LNCS, vol. 1645, Springer-verlag, Berlin, 243–257. Extended version to appear in J. Disc. Algor.(JDA).

NAVARRO, G. AND BAEZA-YATES, R. 1999b. A new in-dexing method for approximate string matching.In Proceedings of the 10th Annual Symposiumon Combinatorial Pattern Matching (CPM ’99),LNCS, vol. 1645, Springer-verlag, Berlin, 163–185. Extended version to appear in J. DiscreteAlgor. (JDA).

NAVARRO, G. AND BAEZA-YATES, R. 1999c. Very fastand simple approximate string matching. Inf.Process. Lett. 72, 65–70.

NAVARRO, G. AND RAFFINOT, M. 2000. Fast and flexiblestring matching by combining bit-parallelismand suffix automata. ACM J. Exp. Algor. 5, 4.Previous version in Proceedings of CPM ’98.Lecture Notes in Computer Science, Springer-Verlag, New York.

NAVARRO, G., MOURA, E., NEUBERT, M., ZIVIANI, N., AND

BAEZA-YATES, R. 2000. Adding compression toblock addressing inverted indexes. Kluwer Inf.Retrieval J. 3, 1, 49–77.

NEEDLEMAN, S. AND WUNSCH, C. 1970. A generalmethod applicable to the search for similaritiesin the amino acid sequences of two proteins. J.Mol. Biol. 48, 444–453.

NESBIT, J. 1986. The accuracy of approximate stringmatching algorithms. J. Comput.-Based In-str. 13, 3, 80–83.

OWOLABI, O. AND MCGREGOR, R. 1988. Fast approx-imate string matching. Software Practice Ex-per. 18, 4, 387–393.

REGNIER, M. AND SZPANKOWSKI, W. 1997. On theapproximate pattern occurrence in a text. InProceedings of Compression and Complexity ofSEQUENCES ’97. IEEE Press, New York.

RIVEST, R. 1976. Partial-match retrieval algorithms.SIAM J. Comput. 5, 1.

SAHINALP, S. AND VISHKIN, U. 1997. Approximate pat-tern matching using locally consistent pars-ing. Manuscript, Univ. of Maryland Institute forAdvanced Computer Studies (UMIACS).

SANKOFF, D. 1972. Matching sequences under dele-tion/insertion constraints. In Proceedings of theNational Academy of Sciences of the USA, vol. 69,4–6.

SANKOFF, D. AND KRUSKAL, J., Eds. 1983. Time Warps,String Edits, and Macromolecules: The Theoryand Practice of Sequence Comparison. Addison-Wesley, Reading, MA.

SANKOFF, D. AND MAINVILLE, S. 1983. Common Subse-quences and Monotone Subsequences. Addison-Wesley, Reading, MA, 363–365.

SCHIEBER, B. AND VISHKIN, U. 1988. On finding low-est common ancestors: simplification and par-allelization. SIAM J. Comput. 17, 6, 1253–1262.

SELLERS, P. 1974. On the theory and computation ofevolutionary distances. SIAM J. Appl. Math. 26,787–793.

SELLERS, P. 1980. The theory and computation ofevolutionary distances: pattern recognition. J.Algor. 1, 359–373.

SHI, F. 1996. Fast approximate string matchingwith q-blocks sequences. In Proceedings of the3rd South American Workshop on String Pro-cessing (WSP’96). Carleton Univ. Press. 257–271.

SUNDAY, D. 1990. A very fast substring search algo-rithm. Commun. ACM 33, 8, 132–142.

SUTINEN, E. 1998. Approximate Pattern Matchingwith the q-Gram Family. Ph.D. thesis, Dept. ofComputer Science, Univ. of Helsinki, Finland.Tech. Rep. A-1998-3.

SUTINEN, E. AND TARHIO, J. 1995. On using q-gram lo-cations in approximate string matching. In Pro-ceedings of the 3rd Annual European Sympo-sium on Algorithms (ESA ’95). LNCS, vol. 979,Springer-Verlag, Berlin, 327–340.

SUTINEN, E. AND TARHIO, J. 1996. Filtration with q-samples in approximate string matching. In Pro-ceedings of the 7th Annual Symposium on Com-binatorial Pattern Matching (CPM ’96). LNCS,vol. 1075, Springer-Verlag, Berlin, 50–61.

TAKAOKA, T. 1994. Approximate pattern matchingwith samples. In Proceedings of ISAAC ’94.LNCS, vol. 834, Springer-Verlag, Berlin, 234–242.

TARHIO, J. AND UKKONEN, E. 1988. A greedy approxi-mation algorithm for constructing shortest com-mon superstrings. Theor. Comput. Sci. 57, 131–145.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Text search Dyn Prog. alg.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

Page 58: A Guided Tour to Approximate String Matchingcheung/papers/Matching/Navarro-Survey-Approx...A Guided Tour to Approximate String Matching 33 distance, despite being a simplification

88 G. Navarro

TARHIO, J. AND UKKONEN, E. 1993. ApproximateBoyer–Moore string matching. SIAM J. Com-put. 22, 2, 243–260. Preliminary version inSWAT’90 (LNCS, vol. 447, 1990).

TICHY, W. 1984. The string-to-string correctionproblem with block moves. ACM Trans. Comput.Syst. 2, 4, 309–321.

UKKONEN, E. 1985a. Algorithms for approximatestring matching. Information and Control 64,100–118. Preliminary version in Proceedingsof the International Conference Foundationsof Computation Theory (LNCS, vol. 158,1983).

UKKONEN, E. 1985b. Finding approximate patternsin strings. J. Algor. 6, 132–137.

UKKONEN, E. 1992. Approximate string matchingwith q-grams and maximal matches. Theor.Comput. Sci. 1, 191–211.

UKKONEN, E. 1993. Approximate string matchingover suffix trees. In Proceedings of the 4thAnnual Symposium on Combinatorial PatternMatching (CPM ’93), 228–242.

UKKONEN, E. 1995. Constructing suffix trees on-line in linear time. Algorithmica 14, 3, 249–260.

UKKONEN, E. AND WOOD, D. 1993. Approximatestring matching with suffix automata. Algorith-mica 10, 353–364. Preliminary version in Rep.A-1990-4, Dept. of Computer Science, Univ. ofHelsinki, Apr. 1990.

ULLMAN, J. 1977. A binary n-gram technique for au-tomatic correction of substitution, deletion, in-sertion and reversal errors in words. Comput.J. 10, 141–147.

VINTSYUK, T. 1968. Speech discrimination by dy-namic programming. Cybernetics 4, 52–58.

WAGNER, R. AND FISHER, M. 1974. The string to stringcorrection problem. J. ACM 21, 168–178.

WATERMAN, M. 1995. Introduction to ComputationalBiology. Chapman and Hall, London.

WEINER, P. 1973. Linear pattern matching algo-rithms. In Proceedings of IEEE Symposium onSwitching and Automata Theory, 1–11.

WRIGHT, A. 1994. Approximate string matching us-ing within-word parallelism. Software PracticeExper. 24, 4, 337–362.

WU, S. AND MANBER, U. 1992a. Agrep—a fast approx-imate pattern-matching tool. In Proceedings ofUSENIX Technical Conference. USENIX Asso-ciation, Berkeley, CA, USA. 153–162.

WU, S. AND MANBER, U. 1992b. Fast text searchingallowing errors. Commun. ACM 35, 10, 83–91.

WU, S., MANBER, U., AND MYERS, E. 1995. A sub-quadratic algorithm for approximate regular ex-pression matching. J. Algor. 19, 3, 346–360.

WU, S., MANBER, U., AND MYERS, E. 1996. A sub-quadratic algorithm for approximate limitedexpression matching. Algorithmica 15, 1, 50–67. Preliminary version as Tech. Rep. TR29-36,Computer Science Dept., Univ. of Arizona, 1992.

YAO, A. 1979. The complexity of pattern matchingfor a random string. SIAM J. Comput. 8, 368–387.

YAP, T., FRIEDER, O., AND MARTINO, R. 1996. High Per-formance Computational Methods for BiologicalSequence Analysis. Kluwer Academic Publish-ers, Dordrecht.

ZOBEL, J. AND DART, P. 1996. Phonetic string match-ing: lessons from information retrieval. In Pro-ceedings of the 19th ACM International Confer-ence on Information Retrieval (SIGIR ’96), 166–172.

Received August 1999; revised March 2000; accepted May 2000

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor

For Evaluation Only.Copyright (c) by Foxit Software Company, 2004Edited by Foxit PDF Editor