19
Journal of Algorithms 52 (2004) 82–100 www.elsevier.com/locate/jalgor Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 30043, Republic of China Received 17 November 2003 Available online 6 May 2004 Abstract A real scaled occurrence of a pattern in a text is a position of the text at which the pattern occurs in some real scale 1. The real scaled indexing problem is to preprocess a text so that all real scaled occurrences of a pattern in the text can be found efficiently. Let T be a text of length n over a finite alphabet Σ . We show that with O(n 3 )-time preprocessing on T , using O(n 3 ) storage, all real scaled occurrences of a pattern P in T can be found in O(|P |+ U r ) time, where U r is the number of real scaled occurrences of P in T . The decision version of the real scaled indexing problem is to preprocess a text so that a query of the following form can be answered efficiently: “Does a pattern P have a real scaled occurrence in the text?” We show that with O(n 2 )-time preprocessing on T , using O(n 2 ) storage, whether a pattern P has a real scaled occurrence in T can be determined in O(|P |) time. The discrete scaled indexing problem is a restricted version of the real scaled indexing problem, in which only discrete scales are considered. For this restricted version, we show that with O(n log n)- time preprocessing on T , using O(n log n) storage, all discrete scaled occurrences of a pattern P in T can be found in O(|P |+ U d ) time, where U d is the number of discrete scaled occurrences of P in T . 2004 Elsevier Inc. All rights reserved. Keywords: String matching; Approximate string matching; Scaled matching; Suffix trees; Indexing 1. Introduction Given a text string and a pattern string, the string-matching problem is to find all occurrences of the pattern in the text. String matching is a fundamental problem in many This research is supported by the National Science Council of the Republic of China under grant NSC-92- 2213-E-007-021. * Corresponding author. Tel.: 886-3-5742805, fax: 886-3-5723694. E-mail address: [email protected] (B.-F. Wang). 0196-6774/$ – see front matter 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jalgor.2004.03.006

Efficient algorithms for the scaled indexing problem

Embed Size (px)

Citation preview

r

a

cursal

l

ttern

em,

many

2-

Journal of Algorithms 52 (2004) 82–100

www.elsevier.com/locate/jalgo

Efficient algorithms for the scaledindexing problem

Biing-Feng Wang,∗ Jyh-Jye Lin, and Shan-Chyun KuDepartment of Computer Science, National Tsing HuaUniversity, Hsinchu, Taiwan 30043, Republic of Chin

Received 17 November 2003

Available online 6 May 2004

Abstract

A real scaled occurrenceof a pattern in a text is a position of the text at which the pattern ocin some real scale 1. The real scaled indexing problemis to preprocess a text so that all rescaled occurrences of a pattern in the text can be found efficiently. LetT be a text of lengthn over afinite alphabetΣ . We show that withO(n3)-time preprocessing onT , usingO(n3) storage, all reascaled occurrences of a patternP in T can be found inO(|P | + Ur) time, whereUr is the numberof real scaled occurrences ofP in T . Thedecision version of the real scaled indexing problemis topreprocess a text so that a query of the following form can be answered efficiently: “Does a paP

have a real scaled occurrence in the text?” We show that withO(n2)-time preprocessing onT , usingO(n2) storage, whether a patternP has a real scaled occurrence inT can be determined inO(|P |)time. Thediscrete scaled indexing problemis a restricted version of the real scaled indexing problin which only discrete scales are considered. For this restricted version, we show that withO(n logn)-time preprocessing onT , usingO(n logn) storage, all discrete scaled occurrences of a patternP in T

can be found inO(|P |+Ud) time, whereUd is the number of discrete scaled occurrences ofP in T . 2004 Elsevier Inc. All rights reserved.

Keywords:String matching; Approximatestring matching; Scaled matching; Suffix trees; Indexing

1. Introduction

Given a text string and a pattern string, thestring-matching problemis to find alloccurrences of the pattern in the text. String matching is a fundamental problem in

This research is supported by the National ScienceCouncil of the Republic of China under grant NSC-92213-E-007-021.

* Corresponding author. Tel.: 886-3-5742805, fax: 886-3-5723694.E-mail address:[email protected] (B.-F. Wang).

0196-6774/$ – see front matter 2004 Elsevier Inc. All rights reserved.doi:10.1016/j.jalgor.2004.03.006

B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100 83

rching: theionsre toatmost

e

so

ndels.s of arn aretringorithmedmateatching

ap.rt non-ientwhere

ales.gedsed

time.two-

, Amircaledey

sional

puterppearalso be

important computer applications. The most obvious such application is context seawithin a text editor. There are two classic linear-time algorithms for string matchingKnuth–Morris–Pratt algorithm [16] and the Boyer–Moore algorithm [11]. For applicatin which a fixed text is queried many times, it is worthwhile to build a data structuallow fast queries. Given a text string, theindexing problemis to preprocess the text so tha search for an on-line pattern can be done efficiently. The suffix tree [20,21] is thewell-known data structure that admits efficient on-line string searches. LetT be a text oflengthn over an alphabetΣ . Thesuffix treeof T is essentially a trie [15] storing all thsuffixes ofT . Assuming thatΣ is finite, the suffix tree ofT can be built inO(n) timeusingO(n) storage so that a search for a patternP can be done inO(|P |+U) time, where|P | is the length ofP andU is the number of occurrences ofP in T . Another importantproblem isdictionary matching[2,7], which is to preprocess a dictionary of patternsthat all occurrences of the preprocessed patterns in a text can be found efficiently.

Recently, a variety of approximate stringmatching problems have been defined astudied. Abrahamson [1] investigated a generalization of string matching, in which thpattern is a sequence of pattern elements, each compatible with a set of symboFredriksson and Ukkonen [14] considered the problem of finding the occurrencetwo-dimensional pattern in a two-dimensional text when also rotations of the patteallowed. Landau and Vishkin [18] gave efficient serial and parallel algorithms for smatching in the presence of errors. Krithivasan and Sitalakshmi [17] proposed an algfor two-dimensional string matching in the presence of errors. Their result was improvin [6,8]. Cole and Hariharan [12] presented efficient algorithms for finding all approximatches of a pattern in a text, where the edit distance between the pattern and the mtext substring is at mostk. A swap versionof a patternP is a string derived fromP bya series of local swaps, where each symbolcan participate in no more than one swThe pattern matching with swaps problemis that of finding every location in a text fowhich there exists a swap version of a given pattern. Amir et al. [3] proposed the firstrivial result for this problem. Latter, Amir, Lewenstein, and Porat [10] gave an efficalgorithm that counts the number of necessary swaps at every location in the textthere is a swapped matching.

Another important approximate string matching problem is called thescaled matchingproblem, which is to find all occurrences of a pattern in a text in all possible scA restricted version of the scaled matching problem is called thediscrete scaled matchinproblem, in which only discrete scales are considered. This restricted version was studifor the first time by Amir et al. [9]. They showed that a simple algorithm propoby Eilam-Tzoreff and Vishkin [13] can be adapted to solve the problem in linearFurthermore, for finite alphabets, they presented a linear-time algorithm for thedimensional discrete scaled matching problem. Later, by using a new techniqueand Calinescu [5] proposed another algorithm for the two-dimensional discrete smatching problem. Their algorithm requireslinear time and is alphabet-independent. Thalso generalized the new technique to obtain an efficient algorithm for two-dimendiscrete scaled dictionary matching.

The scaled matching problem was originally inspired by applications in comvision [4]. The above results dealt with discrete scales only. In reality, an object may ain non-discrete scales. Therefore, to be truly practical, non-discrete scales should

84 B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100

ebutngl scale

atet al.stion:dl

s

f adtion

ngdwithdmeg

s ofle isxtendng and

ns

nt thegive7, we

considered. Amir et al. [4] pioneered the study on real scales. In order to avoid somconceptual problems, they considered only scales 1, that is, a pattern can be enlargednot reduced. They gave a natural definition forproportionally enlarging a pattern accordito a real scale and showed that finding all occurrences of a pattern in a text in any reacan be done in linear time. Thereal scaled indexing problemis to preprocess a text so thall occurrences of a pattern in the text in any real scale can be found efficiently. Amir[4] defined the real scaled indexing problem and posed the following interesting que“Can we efficiently preprocess a text in a manner that will enable finding all real scaleoccurrences of a patternP in the text inO(|P | + Ur) time, whereUr is the number of reascaled occurrences ofP in T ?”

The real scaled indexing problem is the focus of this paper. As in [4], only scale 1are considered. LetT be a text of lengthn over a finite alphabetΣ . We show that withO(n3)-time preprocessing onT , usingO(n3) storage, all real scaled occurrences opatternP in T can be found inO(|P | + Ur) time, whereUr is the number of real scaleoccurrences ofP in T . This is the first result that satisfies the query time in the quesposed by Amir et al. We can do better for thedecision version of the real scaled indexiproblem, which is to preprocess a text so that a query of the following form can be answereefficiently: “Does a patternP have a real scaled occurrence in the text?” We show thatO(n2)-time preprocessing onT , usingO(n2) storage, whether a patternP has a real scaleoccurrence inT can be determined inO(|P |) time. Thediscrete scaled indexing probleis a restricted version of the real scaled indexing problem, in which only discrete scales arconsidered. For this restricted version, we show that withO(n logn)-time preprocessinon T , usingO(n logn) storage, all discrete scaled occurrences inT of a patternP canbe found inO(|P | + Ud) time, whereUd is the number of discrete scaled occurrenceP in T . In Amir et al.’s definition, the enlarging of a pattern according to a real scadefined by using truncation as the approximation function. In this paper, we also ethe proposed real scaled indexing algorithm to many other functions such as roundiceiling.

The remainder of this paper is organized asfollows. In Section 2, necessary notatioand definitions are given. In Section 3, we present theO(n3)-time algorithm for thereal scaled indexing problem. In Section 4, we present theO(n2)-time algorithm forthe decision version of the real scaled indexing problem. In Section 5, we preseO(n logn)-time algorithm for the discrete scaled indexing problem. In Section 6, wean extension of the real scaled indexing algorithm in Section 3. Finally, in Sectionconclude this paper with some final remarks.

2. Preliminaries

Let Σ be analphabetconsisting of a finite number ofsymbols. Thesizeof Σ , denotedby |Σ|, is the number of symbols inΣ . A string overΣ is a sequence of symbols inΣ .The lengthof a stringX, denoted by|X|, is the number of symbols inX. We assumethat a stringX is represented by an array such thatX(i) is the ith symbol ofX, where1 i |X|.

B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100 85

ls

tation.

aded

g

ed. Inairs.

path

Let X be a string of lengthm. A substringof X is a contiguous block of symboin X. For any indicesi and j such that 1 i j m, the substring ofX consistingof X(i),X(i + 1), . . . , andX(j) is denoted byX(i, j). A substring of the formX(1, i),where 1 i m, is a prefix of X, whereas a substring of the formX(j,m), where1 j m, is a suffix of X. A string Y occurs in X at position i if Y is equal toX(i, i + |Y | − 1).

The concatenationof two stringsX andY is XY , which is the sequence definingXfollowed by the sequence definingY . Denote the stringaa . . . a, in which the symbolais repeatedr times, byar . The run-length representationof a stringY is y

r11 y

r22 . . . y

rpp

such thatY = yr11 y

r22 . . . y

rpp , yi = yi+1 for 1 i < p, andrj > 0 for 1 j p. For easy

description, in this paper, we always describe a string by using its run-length represenLet Y = y

r11 y

r22 . . . y

rpp be a string. For any realk ∈ [1,∞), thek-scalingof Y , denoted

by δk(Y ), is yr1k1 y

r2k2 . . . y

rpkp . A real scaled occurrenceof Y in a stringX is a position

i of X, where 1 i |X|, at whichδk(Y ) occurs for somek ∈ [1,∞). A discrete scaledoccurrenceof Y in X is a positioni of X, where 1 i |X|, at whichδk(Y ) occurs forsomek ∈ Z+.

The real scaled indexing problemis to preprocess a stringT , called thetext, so thatall real scaled occurrences inT of an on-line stringP , called thepattern, can be foundefficiently. Thedecision version of the real scaled indexing problemis to preprocesstext T so that whether a patternP has a real scaled occurrence inT can be determineefficiently. Thediscrete scaled indexing problemis a restricted version of the real scalindexing problem, in which only discrete scaled occurrences of the given patternP areneeded to be found.

3. An algorithm for the real scaled indexing problem

Let T be a text of lengthn. In this section, we show that withO(n3)-time preprocessinon T , usingO(n3) storage, all real scaled occurrences of a patternP in T can be foundin O(|P | + Ur) time, whereUr is the number of real scaled occurrences ofP in T . Forconvenience, we say that a patternP can r-matcha stringS if there existsk ∈ [1,∞)

such thatδk(P ) = S. If a patternP has a real scaled occurrence at a positionl of T , where1 l n, we callP avalid patternand(P, l) avalid pattern–position pair. In Section 3.1,the data structure that we use for the real scaled indexing problem is introducSection 3.2, anO(n3) upper bound is given on the number of valid pattern–position pIn Section 3.3, we present theO(n3)-time algorithm for preprocessingT .

3.1. The real scaled indexing tree

A trie [15] associated with a setS1, S2, . . . , Sm of strings is a rooted tree such that

1. Each edge is labeled with a symbol inΣ , and it is directed away from the root.2. No two edges emanating from the same vertex have the same label.3. Each vertexv represents the string obtained by concatenating the labels on the

from the root tov.

86 B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100

d the

s.n–

tly,g annd

off

t

Fig. 1. A RSIT storing(ab,3), (ab,4), (aa,2), (aaab,2), (bcc,5), (ccca,1), (cccc,7).

4. EachSi is represented by some vertex, where 1 i m.5. Each leaf represents someSi , where 1 i m.

The data structure that we use for the real scaled indexing problem is callereal scaled indexing tree, which is defined as follows. Areal scaled indexing tree(abbreviated to RSIT) that stores a set(P1, l1), (P2, l2), . . . , (Pm, lm) of valid pattern–position pairs is a trie associated withP1,P2, . . . ,Pm; in addition, eachli is storedin the vertex representingPi , where 1 i m. For example, a RSIT that store(ab,3), (ab,4), (aa,2), (aaab,2), (bcc,5), (ccca,1), (cccc,7) is depicted in Fig. 1The real scaled indexing tree ofT is the RSIT that stores the set of all valid patterposition pairs.

In a top–down traversal on a RSIT, we need to proceed from a vertexv to a childu suchthat the label of the edge(v,u) is a given symbol. To support such a traversal efficienfor each vertexv with more than one outgoing edge, we represent its children by usinarray of length|Σ| such that the childu correspondent with a given symbol can be fouin O(1) time. In such a representation, searching the vertex representing a given patternP

takesO(|P |) time. Thus, given the RSIT ofT , we can find the real scaled occurrencesa patternP in O(|P | + Ur) time, whereUr is the number of real scaled occurrences oP

in T . Clearly, inserting a valid pattern–position pair(P, l) into a RSIT takesO(|P |) time.

3.2. An upper bound on the number of valid pattern–position pairs

For k ∈ [1,∞) andd ∈ Z+, let φ(k, d) be the largest integerr such thatkr d .For example,φ(1.5,5) = 3 and φ(1.5,6) = 4. By definition, it is easy to show thaφ(k, d) = d/k. The following lemma shows that for a fixedk ∈ [1,∞) and a givenstringS, there is at most one patternP such thatδk(P ) = S.

Lemma 1. Let k ∈ [1,∞) and S = ad11 a

d22 . . . a

dss . If there is a patternP such that

δk(P ) = S, P = ar1a

r2 . . . arss , whereri = φ(k, di) for 1 i s.

1 2

B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100 87

t

at

atem

onns.t

lid

Proof. Consider a fixedri , where 1 i s. Sinceδk(P ) = S, kri = di . Thus, by thedefinition ofφ, φ(k, di) ri . Sincek 1, k(ri + 1) kri + k > kri > di . Thus,φ(k, di) < ri + 1. Fromφ(k, di) ri andφ(k, di) < ri + 1, we conclude thatri = φ(k, di)

and thus the lemma holds.Lemma 2. Let S be a string andm be an integer such that1 m |S|. Then, there is amost one pattern of lengthm that canr-matchS.

Proof. Let S = ad11 a

d22 . . . a

dss . Suppose that there are two patternsP andP ′ of lengthm

such thatδk(P ) = δk′(P ′) = S for somek, k′ ∈ [1,∞). In the following, we prove thislemma by showing thatP = P ′. By Lemma 1,

P = ar11 a

r22 . . . ars

s and P ′ = ar ′1

1 ar ′2

2 . . . ar ′s

s ,

whereri = φ(k, di) andr ′i = φ(k′, di) for 1 i s. Without loss of generality, assume th

k k′. Sincek k′, ri r ′i for 1 i s. By combining this and(r1 + r2 + · · · + rs) =

(r ′1 + r ′

2 + · · · + r ′s ) = m, we conclude easily thatri = r ′

i for 1 i s. Thus,P = P ′ andthe lemma holds.

The upper bound is as follows.

Theorem 1. The number of valid pattern–position pairs of a textT of lengthn is O(n3).

Proof. By Lemma 2, there are at most|S| patterns that canr-match a stringS. Thus, foreach substringT (i, j), where 1 i j n, there are at most|T (i, j)| valid pattern–position pairs(P, i) such thatP can r-matchT (i, j). Therefore, in total, there aremost

∑1in

∑ijn |T (i, j)| = O(n3) valid pattern–position pairs. Thus, the theor

holds. The number of valid patterns cannot be larger than the number of valid pattern–positi

pairs. Thus, Theorem 1 also implies anO(n3) upper bound on the number of valid patterAn artificial example is givenas follows to show that theO(n3) bound is tight. Assume than is a multiple of 8. Consider an exampleT = Acn/2B, whereA = abab . . .ab, in whichab is repeatedn/8 times, andB = dede . . .de, in which de is repeatedn/8 times. In thefollowing, we show thatT hasΩ(n3) valid patterns by showing that there is a distinct vapattern for every triple(p, q, r), where 1 p,q, r n/4. Consider a fixed such(p, q, r).Let P = A(p,n/4)cn/4+qB(1, r) andk = (n/2)/(n/4+ q). Clearly, 1 k < 2. Since 1k < 2 and no two consecutive symbols inA andB are the same, we haveδk(A(p,n/4)) =A(p,n/4) andδk(B(1, r)) = B(1, r). Thus,δk(P ) = A(p,n/4)cn/2B(1, r), which is equalto T (p,n/4+n/2+r). Therefore,P is a valid pattern. SinceP = A(p,n/4)cn/4+qB(1, r)

is distinct for every triple(p, q, r), where 1 p,q, r n/4, T hasΩ(n3) valid patterns.

88 B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100

ingthee is

ere

ng

s

3.3. The algorithm

According to the definition of a RSIT, each leaf of the RSIT ofT represents a validpattern. The following lemma further shows that every vertex in the RSIT ofT representsa valid pattern.

Lemma 3. Let v be a vertex in the RSIT ofT and l be a position stored inv. Then, in theRSIT ofT , l is stored in all ancestors ofv.

Proof. LetP be the pattern represented byv. Sincel is stored inv, there isk ∈ [1,∞) suchthatδk(P ) occurs inT at the positionl. Let u be any ancestor ofv andP ′ be the patternrepresented byu. SinceP ′ is a prefix ofP , δk(P

′) is a prefix ofδk(P ). Thus,δk(P′) also

occurs inT at the positionl. Therefore,l is stored inu and the lemma holds.Since every vertex in the RSIT ofT represents a valid pattern and there areO(n3)

valid patterns, we conclude that the number of vertices in the RSIT ofT is O(n3). In theremainder of this section, we give anO(n3)-time algorithm to construct the RSIT ofT .The idea of our preprocessing algorithm is simple, which is to generate for each substrS of T all patterns that canr-matchS and then to insert the generated patterns intoRSIT of T . There are two difficulties in implementing the simple idea. First, therno obvious way to compute the patterns that canr-match a given substringS, althoughwe know from Lemma 2 that there are at most|S| such patterns. Second, since thareO(n3) patterns and the insertion of each pattern takesO(n) time, a straightforwardimplementation requiresO(n4) time.

Let k ∈ [1,∞). An integerd ∈ Z+ is k-invertible if there is an integerr ∈ Z+ suchthat kr = d; that is,d is k-invertible if kφ(k, d) = d . A string S = a

d11 a

d22 . . . a

dss is

k-invertibleif eachdi is k-invertible, where 1 i s. For ease of presentation, for a striS = a

d11 a

d22 . . . a

dss , we defineδ−1

k (S) = aφ(k,d1)

1 aφ(k,d2)

2 . . . aφ(k,ds)s . By definition, for any

k-invertiblestringS, δk(δ−1k (S)) = S. Let l be a position ofT , where 1 l n. We define

I∗(k, l) to be the longestk-invertible prefix ofT (l, n) and defineP ∗(k, l) = δ−1k (I∗(k, l)).

SinceI∗(k, l) is k-invertible,P ∗(k, l) is a valid pattern ofT . As we will show later, thereexists a set ofO(n2) patternsP ∗(k, l) such that the RSIT ofT can be built by doinginsertion for these patterns only. Before going on, the computation ofI∗(k, l) andP ∗(k, l)

is discussed. We computeI∗(k, l) andP ∗(k, l) as follows. LetT (l, n) = ad11 a

d22 . . . a

dqq . If

all di , where 1 i q , arek-invertible, we simply compute

I∗(k, l) = T (l, n) and P ∗(k, l) = aφ(k,d1)1 a

φ(k,d2)2 . . . a

φ(k,dq)q .

Otherwise, we compute

I∗(k, l) = ad11 a

d22 . . . a

dt−1t−1 a

kφ(k,dt)t and P ∗(k, l) = a

φ(k,d1)1 a

φ(k,d2)2 . . . a

φ(k,dt )t ,

wheret 1 is the smallest integer such thatdt is not k-invertible. Sinceφ(k, d) can beeasily computed asd/k in O(1) time for anyd ∈ Z+, the above computation takeO(q) = O(n) time.

B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100 89

fal

t

For an intervalX, we denote its left endpoint bye(X). An intervalX is left-closedif itincludese(X), and isleft-openotherwise. Forr d ∈ Z+, defineK(r, d) to be the set oreal numbersk ∈ [1,∞) such thatkr = d , which by definition is the left-closed interv[d/r, (d + 1)/r). We have the following lemma.

Lemma 4. Let k ∈ [1,∞) and P = ar11 a

r22 . . . a

rww such thatδk(P ) occurs at a positionl

of T , where1 l n. LetT (l, n) = ad11 a

d22 . . . a

dqq andK∗ = K(r1, d1)∩K(r2, d2)∩ · · ·∩

K(rw−1, dw−1). Then,P is a prefix ofP ∗(g, l) for anyg ∈ K∗ such thatg k.

Proof. Sinceδk(P ) occurs at the positionl of T , δk(P ) = akr11 a

kr22 . . . a

krww is a prefix

of T (l, n). Thus,w q, kri = di for 1 i < w, andkrw dw. Let g ∈ K∗ suchthatg k. Sinceg ∈ K∗, δg(P ) = a

d11 a

d22 . . . a

dw−1w−1 a

grww , which isg-invertible. Moreover,

sinceg k, grw krw dw and thusδg(P ) is a prefix ofT (l, n). Therefore,δg(P )

is a g-invertible prefix ofT (l, n). SinceI∗(g, l) is the longest such prefix,δg(P ) is aprefix of I∗(g, l). By the definition ofφ, it is easy to see that for a stringS and aprefix S′ of S, δ−1

g (S′) is a prefix ofδ−1g (S). Therefore,P = δ−1

g (δg(P )) is a prefix of

P ∗(g, l) = δ−1g (I∗(g, l)). Thus, the lemma holds.

For a stringS = ad11 a

d22 . . . a

dss , defineΓ (S) = e(K(j, di)) | 1 i s,1 j di.

SinceK(r, d) = [d/r, (d + 1)/r) for d r ∈ Z+, Γ (S) can be computed asdi/j | 1 i s,1 j di in O(|S|) time. The following lemma is the key of ourO(n3)-timealgorithm.

Lemma 5. LetP be a pattern that has a real scaled occurrence at a positionl of T , where1 l n. Then,P is a prefix ofP ∗(g, l) for someg ∈ Γ (T (l, n)).

Proof. Let P = ar11 a

r22 . . . a

rww andT (l, n) = a

d11 a

d22 . . . a

dqq . Let k ∈ [1,∞) such thatδk(P )

occurs at the positionl of T . Let K∗ = K(r1, d1) ∩ K(r2, d2) ∩ · · · ∩ K(rw−1, dw−1). Aswe had shown in the proof of Lemma 4,w q, kri = di for 1 i < w, andkrw dw.Sincekri = di for 1 i < w,k ∈ K∗. Since eachK(ri, di), where 1 i < w, is a left-closed interval,K∗ is a left-closed interval. Letg = e(K∗). Clearly,g is the left endpoinof someK(ri, di), where 1 i < w. The setΓ (T (l, n)) containse(K(j, di)) for 1 i s

and 1 j di . Thus,g ∈ Γ (T (l, n)). Sincek ∈ K∗, g k. Therefore, by Lemma 4,P isa prefix ofP ∗(g, l). Thus, the lemma holds.

Based upon Lemmas 3 and 5, we can construct the RSIT ofT by doing onlyO(n2)

insertions for the pairs(P ∗(g, l), l), where 1 l n andg ∈ Γ (T (l, n)). The constructionis described as follows.

Algorithm 1.Input: a textT of lengthn

Output: the real scaled indexing tree ofT

90 B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100

.

theoct

l ae

time

eITt

time

begin

1. create an empty RSIT2. for l ← 1 to n do3. begin4. computeΓ (T [l, n])5. for eachg ∈ Γ (T [l, n]) do6. begin7. P ∗(g, l) ← δ−1

k (I∗(g, l)), whereI∗(g, l) is the longestg-invertible prefix ofT (l, n)

8. insert(P ∗(g, l), l) into the RSIT9. storel into every ancestor of the vertex representingP ∗(g, l) in the RSIT

10. end11. end

end

Clearly, the computation time of Algorithm 1 isO(n3). We have the following theorem

Theorem 2. With O(n3)-time preprocessing on a textT , usingO(n3) storage, all realscaled occurrences of a patternP in T can be found inO(|P | + Ur) time, wheren is thelength ofT andUr is the number of real scaled occurrences ofP in T .

The size of the alphabetΣ plays an important role on the space requirement andsearch time of a RSIT. We had assumed thatΣ is finite. In case it is not true, we can dthe following. Assume thatΣ = 1,2, . . . , n. This assumption can be justified by the fathat we can always sort the symbols appearing inT , and assign to each distinct symbounique integer between 1 andn. For each vertexv with more than one outgoing edge, wassociate it with an array of sizen to store its children such that the childu correspondento a given integer can be found inO(1) time. In such a representation, the construction tis the same. Since there areO(n2) patternsP ∗(g, l), where 1 l n andg ∈ Γ (T [l, n]),the RSIT ofT hasO(n2) leaves. Thus, there at mostO(n2) internal vertices that arassociated with arrays to store their children.Therefore, the space requirement of the RSis still O(n3). Unfortunately, since we need to transform a patternP into a correspondenstring over1,2, . . . , n, the query time becomesO(|P | logn + Ur). However, in caseΣ is 1,2, . . . , n at the beginning, the transformation is unnecessary and the queryremainsO(|P | + Ur).

4. An algorithm for the decision version of the real scaled indexing problem

In this section, we show that withO(n2)-time preprocessing on a textT , usingO(n2)

storage, whether a patternP is a valid pattern can be determined inO(|P |) time, wherenis the length ofT .

Lemma 6. Any substring of a valid pattern is a valid pattern.

B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100 91

,

lyls.t

Proof. Let P be a valid pattern andP ′ be a substring ofP . SinceP is a valid patternthere existsk ∈ [1,∞) such thatδk(P ) is a substring ofT . Clearly,δk(P

′) is a substring ofδk(P ). Consequently,δk(P

′) is a substring ofT and thus the lemma holds.Lemma 7. For any valid patternP , there existsg ∈ Γ (T ) such thatδg(P ) occurs inT .

Proof. Let P = ar11 a

r22 . . . a

rww be a valid pattern. Forw 2, it is easy to see thatδg(P )

occurs inT for g = 1. Clearly,Γ (T ) contains 1. Thus, the lemma holds forw 2.Assume thatw > 2. SinceP is a valid pattern, there existk ∈ [1,∞) and a substringS = a

d11 a

d22 . . . a

dww of T such thatS = δk(P ). Let K∗ = K(r2, d2) ∩ K(r3, d3) ∩ · · · ∩

K(rw−1, dw−1). Sincekri = di for 1 < i < w, k ∈ K∗. Let g = e(K∗). Sincek ∈ K∗,g k. Clearly, g is the left endpoint of someK(ri, di), where 1< i < w. SinceS

is a substring ofT , Γ (T ) containse(K(j, di)) for 1 < i < w and 1 j di . Thus,g ∈ Γ (T ). Sinceg ∈ K∗, δg(P ) = a

gr11 a

d22 . . . a

dw−1w−1 a

grww . Sincegr1 kr1 d1 and

grw krw dw, δg(P ) is a substring ofS and thus is a substring ofT . Therefore, thelemma holds.

The idea of our preprocessing algorithm is to construct for everyg ∈ Γ (T ) a stringTg such that for any patternP , δg(P ) occurs inT if and only if P occurs inTg . Then,according to Lemma 7, whether a given pattern is valid can be determined by simpchecking whether it occurs in someTg , whereg ∈ Γ (T ). We begin to describe the detaiIn the remainder of this section, we letT = x

n11 x

n22 . . . x

ntt . Let $ be a special symbol no

in Σ . For eachg ∈ Γ (T ), the stringTg is obtained fromT by doing the following: for

1 i t , we replacexni

i by xφ(g,ni)i if ni is g-invertible and byxφ(g,ni)

i $xφ(g,ni)i otherwise.

For example, in the case ofT = a6b5c4 andg = 1.5, Tg = a4b3$b3c3.

Lemma 8. Letg ∈ Γ (T ) andP be a pattern such thatδg(P ) occurs inT . Then,P occursin Tg .

Proof. LetS = xupx

np+1p+1 x

np+2p+2 . . . x

nq−1q−1 xv

q be the substring ofT such thatδg(P ) = S, where1 p q t, u np , andv nq . By Lemma 1,

P = xφ(g,u)p x

φ(g,np+1)

p+1 . . . xφ(g,nq−1)

q−1 xφ(g,v)q .

Sinceδg(P ) = S, all u,np+1, np+2, . . . , nq−1, andv areg-invertible. Sinceni is g-inver-tible for p < i < q , according to the construction ofTg ,

X = xφ(g,np)p x

φ(g,np+1)

p+1 . . . xφ(g,nq−1)

q−1 xφ(g,nq)q

is a substring ofTg . Moreover, sinceu np andv nq , φ(g,u) φ(g,np) andφ(g, v) φ(g,nq). Thus,P is a substring ofX. Consequently,P is a substring ofTg and thus thelemma holds. Lemma 9. Letg ∈ Γ (T ) andP be a substring ofTg that does not contain $. Then,δg(P )

occurs inT .

92 B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100

to

g

em by

n time

gs.

Proof. SinceP does not contain$, according to the construction ofTg , it is easy to seethat

P = xupx

φ(g,np+1)

p+1 xφ(g,np+2)

p+2 . . . xφ(g,nq−1)

q−1 xvq

for somep, q , u, andv such that 1 p q t , u φ(g,np), v φ(g,nq), andni isg-invertible forp < i < q . Thus,

δg(P ) = xgup x

np+1p+1 x

np+2p+2 . . . x

nq−1q−1 x

gvq .

Sinceu φ(g,np), gu np . Similarly, gv nq . Thus,δg(P ) is a substring ofT .Therefore, the lemma holds.

By combining Lemmas 7, 8, and 9, we obtain the following theorem immediately.

Theorem 3. A patternP (overΣ) is a valid pattern if and only ifP occurs in someTg ,whereg ∈ Γ (T ).

Let Γ (T ) = g(1), g(2), . . . , g(z), where z = |Γ (T )|. Based upon Theorem 3,determine whether a given patternP has a real scaled occurrence inT is equivalentto determine whetherP occurs in some ofTg(1), Tg(2), . . . , Tg(z). A well-known simpletechnique to find the occurrences of a patternP in a setS1, S2, . . . , Sm of strings isto construct a stringS∗ = S1$S2$ . . .$Sm and then to do the finding by determininthe occurrences ofP in S∗. Let T ∗ = Tg(1)$Tg(2)$. . .$Tg(z). By using the abovesimple technique, we solve the decision version of the real scaled indexing problconstructing the suffix tree [20,21] ofT ∗ so that whether a given patternP occurs in someof Tg(1), Tg(2), . . . , Tg(z) can be determined inO(|P |) time.

The performance of the above solution is mainly dependent on the computatioand the length ofT ∗. EachTg has lengthO(n) and can be computed inO(n) time, whereg ∈ Γ (T ). Since|Γ (T )| = O(n),T ∗ has lengthO(n2) and can be computed inO(n2)

time. Since the length ofT ∗ is O(n2), constructing the suffix tree ofT ∗ requiresO(n2)

time andO(n2) storage. Therefore, we have the following theorem.

Theorem 4. With O(n2)-time preprocessing on a textT , usingO(n2) storage, whether apatternP has a real scaled occurrence inT can be determined inO(|P |) time, wheren isthe length ofT .

Consider again the exampleT = Acn/2B in Section 3.2, whereA = (ab)n/8 andB = (de)n/8. As we had shown, allAcn/4+qB are valid patterns, where 1 q n/4.The total length of thesen/4 patterns isO(n2). Clearly, any string having all thesen/4patterns as substrings is of lengthΩ(n2). Therefore, for an arbitrary textT of lengthn, it isimpossible to construct a string of lengtho(n2) that has all the valid patterns as substrin

5. An algorithm for the discrete scaled indexing problem

In this section, we show that withO(n logn)-time preprocessing on a textT , usingO(n logn) storage, all discrete scaled occurrences of a patternP in T can be found in

B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100 93

d

) is

of

eplyrselves

f

eruct a

ncef

ing

O(|P | + Ud) time, wheren is the length ofT andUd is the number of discrete scaleoccurrences ofP in T . For discrete scaled indexing, a patternP is called avalid patternifδk(P ) occurs inT for somek ∈ Z+. For convenience, we say that a pattern (or a stringtrivial if all symbols contained in it are the same and isnon-trivial otherwise.

Lemma 10. Let P be a trivial pattern. The set of positions ofT at whichP has discretescaled occurrences is the same with the set of positions ofT at whichP occurs.

Proof. SinceP is trivial, for anyk ∈ Z+,P is a prefix ofδk(P ). Thus, ifδk(P ) occurs ata positionl of T for somek ∈ Z+, P occurs atl too. On the other hand, any occurrenceP in T is a discrete scaled occurrence ofP in T . Therefore, the lemma holds.

According to Lemma 10, using the suffix tree ofT , we can efficiently find the discretscaled occurrences of any trivial pattern and for each occurrence found, we can simreport the scale as 1. Therefore, in the remainder of this section, we devote outo non-trivial patterns. LetT = x

n11 x

n22 . . . x

ntt . For 1 i t and 1 j ni , we define

π(i, j) to be the positionl such thatT (l, n) = xji x

ni+1i+1 . . . x

ntt . Consider a fixedk ∈ Z+.

Let l = π(i, j) be a position ofT at whichδk(P ) occurs for some non-trivial patternP ,where 1 i t and 1 j ni . SinceP is non-trivial,j must be a multiple ofk. Thus,it is concluded that a positionl of T , where 1 l n, can be the starting position othe k-scaling of a non-trivial pattern only whenl = π(i, ck) for somei andc. For easeof description, the positionsπ(i, ck) of T are said to bek-feasible, where 1 i t and1 c ni/k.

Let m = maxn1, n2, . . . , nt andD = 1,2, . . . ,m. For k > m, no position ofT isk-feasible. Thus, we only need to considerk ∈ D. Our preprocessing algorithm for thdiscrete scaled indexing problem is similar to the algorithm in Section 4. We constset T1, T2, . . . , Tm of strings such that for everyk ∈ D, a non-trivial patternP occursin Tk if and only if δk(P ) occurs inT . Moreover, there is a one-to-one correspondebetween the occurrences ofP in T1$T2$. . .$Tm and the discrete scaled occurrences oP

in T .Let k ∈ D. For convenience, we defineTk by giving a construction. For eachi, where

1 i t , let ω(k, i) bexni/k

i if k is a factor ofni and bexni/ki $x

ni/ki otherwise. First,

we obtain a stringXk from T by replacing eachxni

i by ω(k, i), where 1 i t . Then,we obtainTk from Xk by replacing every sequence of more than one consecutive$ by justone$. For example, in the case ofT = a6c5a1c1a1b3 andk = 2, Xk = a3c2$c2$$$b1$b1

and Tk = a3c2$c2$b1$b1. For ak-feasible positionl = π(i, ck) of T , where 1 i t

and 1 c ni/k, we defineα(k, l) to be the position of thecth last symbol ofω(k, i)

in Tk . Figure 2 gives an illustration for the definition ofα(k, l). Note thatα is a one-to-onefunction.

With proofs similar to those of Lemmas 8 and 9, it is not difficult to obtain the followtwo lemmas.

Lemma 11. Let k ∈ D and P be a non-trivial pattern. Ifδk(P ) occurs inT at a posi-tion l,P occurs inTk at the positionα(k, l).

94 B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100

Fig. 2. An illustration forα(k, l).

Lemma 12. Let k ∈ D andP be a non-trivial pattern. IfP occurs inTk at a positionl′,δk(P ) occurs inT at the positionl, wherel is the position such thatl′ = α(k, l).

For a fixedk ∈ D, α(k, l) is unique for everyk-feasible positionl of T . Thus, bycombining Lemmas 11 and 12, we obtain the following.

B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100 95

f,

e-

tween

fnon-

Lemma 13. Let k ∈ D and P be a non-trivial pattern. LetLk be the set of positions oT at whichδk(P ) occurs. LetL′

k be the set of positions ofTk at whichP occurs. Then|Lk| = |L′

k| andLk = l | 1 l n,α(k, l) ∈ L′k.

Let P be a non-trivial pattern. For a fixedk ∈ D, Lemma 13 shows that there is a onto-one correspondence between the occurrences ofP in Tk and the occurrences ofδk(P )

in T . In the following, we further show that there is a one-to-one correspondence bethe occurrences ofP in T1$T2$. . .$Tm and the discrete scaled occurrences ofP in T .

Lemma 14. Let k, k′ ∈ D andP be a non-trivial pattern. If bothδk(P ) andδk′(P ) occurat the same position ofT , k = k′.

Proof. Suppose that bothδk(P ) andδk′(P ) occur at the same positionl = π(i, j) of T ,where 1 i t and 1 j ni . Sincel is bothk-feasible andk′-feasible,k andk′ arefactors of j . SinceP is non-trivial andδk(P ) occurs atl, x

j/ki xi+1 is a prefix ofP .

Similarly, xj/k′i xi+1 is a prefix ofP . Thus,j/k = j/k′. Therefore,k = k′ and the lemma

holds. From Lemmas 13 and 14, the following theorem is obtained.

Theorem 5. Let P be a non-trivial pattern. LetL∗ be the set of positions ofT at whichP has discrete scaled occurrences. For eachk ∈ D, let L′

k be the set of positions ofTk atwhichP occurs. Then,|L∗| = ∑

k∈D |L′k| andL∗ = l | 1 l n,α(k, l) ∈ L′

k, k ∈ D.

According to Theorem 5, the problem of finding the discrete scaled occurrences inT

of a non-trivial patternP is equivalent to the problem of finding the occurrences oP

in T1$T2$. . .$Tm. Therefore, we can solve the discrete scaled indexing problem fortrivial patterns by simply constructing the suffix tree ofT1$T2$. . .$Tm. The computationtime and the total length ofT1$T2$. . .$Tm are discussed as follows.

Lemma 15. For k ∈ D, the length ofTk is O(n/k).

Proof. Consider a fixedk ∈ D. Every symbol inTk is either inΣ or is $. By definition,eachω(k, i), 1 i t , contains at most 2ni/k symbols inΣ . Therefore,Tk containsO(2n1/k + 2n2/k + · · · + 2nt/k) = O(n/k) symbols inΣ . By combining this andthe fact that no two$ in Tk are consecutive, we conclude that the number of$ in Tk is alsoO(n/k). Therefore, the length ofTk is O(n/k) and thus the lemma holds.

By Lemma 15, the total length ofT1$T2$. . .$Tm is O(n/1 + n/2 + · · · + n/m) =O(n logm). Therefore, constructing the suffix tree ofT1$T2$. . .$Tm requiresO(n logm)

time and storage. Next, let us discuss the computation time ofT1$T2$ . . .$Tm. For k ∈ D,let Ik be the increasing sequence of the indicesi such thatni k, where 1 i t .For example, in the case ofT = a6b4c5a1c1b3c1a1c1 andk = 2, Ik = (1,2,3,6). Since|T | = n, for each k ∈ D, Ik contains at mostn/k elements. By definition,I1 =

96 B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100

.

g

tain

exingvalid

.

.

(1,2, . . . , t). Fork 2, Ik can be obtained fromIk−1 in O(|Ik−1|) time by a simple scanTherefore, allIk , wherek ∈ D, can be computed inO(n+n/1+n/2+· · ·+n/(m−1)) =O(n logm) time. For eachk ∈ D, given the sequenceIk , Tk can be computed by runninthe following algorithm.

Algorithm 2.Input: k andIk = (i(1), i(2), . . . , i(z))

Output: the stringTk

begin

1. if i(1) = 1 thenTk ← ∅ elseTk ← $ /* replacexn11 . . . x

ni(1)−1i(1)−1 by $ */

2. for j = 1 to z do3. begin4. computeω(k, i(j))

5. Tk ← Tkω(k, i(j)) /* replacexni(j)

i(j) by ω(k, i(j)) */

6. if i(j + 1) > i(j) + 1 thenTk ← Tk$ /* replacexni(j)+1i(j)+1 . . . x

ni(j+1)−1i(j+1)−1 by $ */

7. end

end

For convenience, we assume that|Ik| 1 andi(z + 1) = t + 1 in Algorithm 2. Foreachk ∈ D, Algorithm 2 computesTk in O(n/k) time. Therefore,T1$T2$ . . .$Tm can becomputed inO(n logm) time. Sincem n, based upon the above discussion, we obthe following theorem.

Theorem 6. With O(n logn)-time preprocessing on a textT , usingO(n logn) storage, alldiscrete scaled occurrences of a patternP in T can be found inO(|P | + Ud) time, wheren is the length ofT andUd is the number of discrete scaled occurrences ofP in T .

In Section 3, we had shown that the number of valid patterns for the real scaled indproblem isO(n3). To conclude this section, we give an upper bound on the number ofpatterns for the discrete scaled indexing problem.

Theorem 7. There areO(n2) patterns that have discrete scaled occurrences in a textT oflengthn.

Proof. According to Lemma 11, only substrings ofT1, T2, . . . , Tm can be valid patternsEachTk is of lengthO(n/k) and thus hasO((n/k)2) substrings, wherek ∈ D. Therefore,the number of valid patterns isO((n/1)2 + (n/2)2 + · · · + (n/m)2) = O(n2) and thus thetheorem holds.

Every substring ofT is a valid pattern andT may haveΩ(n2) distinct substringsTherefore, the upper bound in Theorem 7 is tight.

6. Real scaled indexing for other scaling functions

A scaling functionis a functionF : [1,∞) × Z+ → Z+ such thatF(k, r) r forany k ∈ [1,∞) and r ∈ Z+. In this section, we redefine thek-scaling of a stringY =

B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100 97

g

oldsthatma 5,e,

f

t

yr11 y

r22 . . . y

rpp to beδk(Y ) = y

F(k,r1)1 y

F(k,r2)2 . . . y

F (k,rp)p , wherek ∈ [1,∞) andF is a given

scaling function. Agood scaling functionis a scaling functionF such that

(I) F(k1, r) F(k2, r) for k1 < k2 ∈ [1,∞) andr ∈ Z+, and(II) F(k, r1) < F(k, r2) for k ∈ [1,∞) andr1 < r2 ∈ Z+.

Important examples of good scaling functions includeF(k, r) = kr, F(k, r) = kr, andF(k, r) = [kr], where[kr] denotes the nearest integer tokr. For the real scaled indexinalgorithm in Section 3, the scaling function was defined byF(k, r) = kr. In this section,we extend the algorithm to any good scaling function.

Throughout this section, we assume that the given scaling functionF is good. Fork ∈ [1,∞) and r ∈ Z+, let F(k+, r) = limx→k+ F(x, r), which is the limit ofF(x, r)

asx converges tok through values greater thank. By (I), it is easy to see thatF(k+, r)

exists for anyk ∈ [1,∞). For ease of analysis, we assume that fork ∈ [1,∞) andr ∈ Z+,bothF(k, r) andF(k+, r) can be computed inO(1) time.

We need to redefine some notations according toF . For k ∈ [1,∞) andd ∈ Z+, letφ(k, d) be the largest integerr such thatF(k, r) d . An integerd ∈ Z+ is k-invertible ifthere isr ∈ Z+ such thatF(k, r) = d . For r d ∈ Z+, let K(r, d) be the set ofk ∈ [1,∞)

such thatF(k, r) = d . It is easy to conclude from (I) thatK(r, d) is an interval.By using (II), it is easy to modify the proof of Lemma 1 to show that Lemma 1 h

for any good scaling function. Similarly, by using (I) and (II), we can also showLemmas 2–4 and Theorem 1 hold for any good scaling function. In the proof of Lemwe needK(r, d) to be left-closed for anyr d ∈ Z+. This is not always true. For examplin the case ofF(k, r) = kr, K(r, d) is [1,1] if r = d and is((d − 1)/r, d/r] otherwise.Therefore, some modification to Lemma 5 is necessary. For a stringS = a

d11 a

d22 . . . a

dss ,

we defineΓclosed(S) = e(K(j, di)) | K(j, di) is left-closed, 1 i s, 1 j di andΓopen(S) = e(K(j, di)) | K(j, di) is left-open, 1 i s, 1 j di. For example, inthe case ofF(k, r) = kr, Γclosed(S) = 1 andΓopen(S) = (di − 1)/j | 1 i s, 1 j di − 1. Note that|Γclosed(S)| + |Γopen(S)| |S|. The following is an extension oLemma 5.

Lemma 16. LetP be a pattern that has a real scaled occurrence at a positionl of T , where1 l n. Then,P is a prefix ofP ∗(g, l) for someg ∈ Γclosed(T (l, n)), or P is a prefix oflimx→h+ P ∗(x, l) for someh ∈ Γopen(T (l, n)).

Proof. Let P = ar11 a

r22 . . . a

rww andT (l, n) = a

d11 a

d22 . . . a

dqq . Let k ∈ [1,∞) such thatδk(P )

occurs at the positionl of T . LetK∗ = K(r1, d1)∩K(r2, d2)∩· · ·∩K(rw−1, dw−1). Sinceδk(P ) is a prefix ofT (l, n), w q , F(k, rw) dw, andk ∈ K∗. Since eachK(ri, di),1 i < w, is an interval,K∗ is an interval. Two cases are discussed:K∗ is left-closedand K∗ is left-open. Assume first thatK∗ is left-closed. Letg = e(K∗). Clearly, g isthe left endpoint of some left-closed intervalK(ri, di), where 1 i < w. Therefore,g ∈ Γclosed(T (l, n)). Sincek ∈ K∗, g k. Thus, by Lemma 4,P is a prefix ofP ∗(g, l).Next, assume thatK∗ is left-open. Leth = e(K∗). In this case,h is the left endpoinof some left-open intervalK(ri, di), where 1 i < w. Thus,h ∈ Γopen(T (l, n)). Since

98 B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100

ws.

chne

l

ck that

edsion is

e realnce of

t

e,dy. Thenysoftain

k ∈ K∗, (h, k] ⊆ K∗. Therefore, by Lemma 4, for anyx ∈ (h, k],P is a prefix ofP ∗(x, l).Thus,P is a prefix of limx→h+ P ∗(x, l), which completes the proof of this lemma.

Based upon Lemmas 3 and 16, we can construct the RSIT ofT by doingO(n2) inser-tions for the patternsP ∗(g, l) and limx→h+ P ∗(x, l), where 1 l n, g ∈ Γclosed(T (l, n)),andh ∈ Γopen(T (l, n)). The time complexity for the construction is analyzed as folloFor anyk ∈ [1,∞) andd ∈ Z+, φ(k, d) can be determined inO(d) time by simply check-ing the integers between 0 tod . Thus, eachP ∗(g, l) can be computed inO(n) time, where1 l n andg ∈ Γclosed(T (l, n)). SinceF(k+, r) exists and can be computed inO(1)

time for anyk ∈ [1,∞) andr ∈ Z+, the computation of each limx→h+ P ∗(x, l) is similarand also takesO(n) time, where 1 l n andh ∈ Γopen(T (l, n)). Inserting theO(n2)

patternsP ∗(g, l) and limx→h+ P ∗(x, l) into the RSIT takesO(n3) time, where 1 l n,g ∈ Γclosed(T (l, n)), and h ∈ Γopen(T (l, n)). Thus, except for the computation of eaΓclosed(T (l, n)) andΓopen(T (l, n)), where 1 l n, the whole construction can be doin O(n3) time. We have the following theorem.

Theorem 8. LetT be a text of lengthn andF be a good scaling function. WithO(t +n3)-time preprocessing onT , usingO(n3) storage, all real scaled occurrences ofP in T ,with respect toF , can be found inO(|P | + Ur) time, whereUr is the number of reascaled occurrences ofP in T and t is the time for computing eachΓclosed(T (l, n)) andΓopen(T (l, n)), where1 l n.

As we had mentioned,F(k, r) = kr, F(k, r) = kr, andF(k, r) = [kr] are threeimportant examples of good scaling functions. For these functions, it is easy to chethe timet in Theorem 8 isO(n2) and thus the preprocessing time isO(n3).

We remark that the algorithm in Section 4 for the decision version of the real scalindexing problem can also be extended to any good scaling function. The extensimilar and thus is omitted in this paper.

7. Concluding remarks

In Sections 4 and 5, we had used suffix trees to solve the decision version of thscaled indexing problem and the discrete scaled indexing problem. The performasuffix trees is highly dependent upon the size ofΣ . Let T be a text of lengthn. If |Σ|is finite, the suffix tree ofT can be constructed inO(n) time usingO(n) storage so thaa search for any patternP can be done inO(|P | + U) time, whereU is the number ofoccurrences ofP in T . However, if|Σ| is infinite, the construction time and search timrespectively, increase toO(n logn) andO(|P | logn + U). Manber and Myers [19] haproposed an alternative data structure to the suffix tree, which is called the suffix arrasuffix array ofT can be built inO(n logn) time usingO(n) storage so that a search for apatternP can be done inO(|P | + logn + U) time, whereU is the number of occurrenceof P in T . The construction time and search time of a suffix array are independentΣ .Therefore, in case|Σ| is infinite, using suffix arrays, instead of suffix trees, we can ob

B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100 99

nd the

ge2t

orkfor

escucreteuthors’al real

scaledth of

which

8 (6)

go-

0.–

nd

94)

eoret.

2.39.

6)

better search time for the decision version of the real scaled indexing problem adiscrete scaled indexing problem.

As we had mentioned in Section 3.3, for infiniteΣ , the preprocessing time and storain Theorem 2 are stillO(n3). However, for infiniteΣ , the search time in Theoremincreases toO(|P | logn + Ur). The suffix array of a textT is essentially a sorted lisof all the suffixes ofT . By using the technique behind the suffix array, for infiniteΣ , wecan reduce the search time for the real scaled indexing problem toO(|P | + logn + Ur).The idea is to construct a sorted list of the patternsP ∗(g, l), where 1 l n andg ∈ Γ (T (l, n)). By using the RSIT ofT , the sorted list can be easily obtained inO(n3)

time usingO(n3) storage. We leave the details as an exercise for interested readers.As in [4], only real scales 1 were discussed in this paper. One direction for future w

is to study scales< 1 for the scaled matching and indexing problem. Another directionfuture work is to study the problem of real scaled dictionary matching. Amir and Calin[5] had proposed efficient algorithms for the problems of two-dimensional disscaled matching and two-dimensional discrete scaled dictionary matching. To the aknowledge, no results have been proposed so far for the problems of two-dimensionscaled matching, two-dimensional real scaled indexing, and two-dimensional realdictionary matching. To design efficient algorithms for these problems is also worfurther study.

Acknowledgments

The author is grateful to the anonymous referees for their valuable comments,greatly improved the presentation of this paper.

References

[1] K. Abrahamson, Generalized stringmatching, SIAM J. Comput. 16 (6) (1987) 1039–1051.[2] A.V. Aho, M.J. Corasick, Efficient string matching: an aid to bibliographic search, Commun. ACM 1

(1975) 333–340.[3] A. Amir, A. Aumann, G. Landau, M. Lewenstein, N. Lewenstein, Pattern matching with Swaps, J. Al

rithms 37 (2000) 247–266.[4] A. Amir, A. Butman, M. Lewenstein, Real scaled matching, Inform. Process. Lett. 70 (1999) 185–19[5] A. Amir, G. Calinescu, Alphabet-independent and scaled dictionary matching, J. Algorithms 36 (2000) 34

62.[6] A. Amir, M. Farach, Efficient 2-dimensional approximate matching of half-rectangular figures, Inform. a

Comput. 118 (1995) 1–11.[7] A. Amir, M. Farach, R. Giancarlo, K. Park, Dynamic dictionary matching, J. Comput. System Sci. 49 (19

208–222.[8] A. Amir, G. Landau, Fast parallel and serial multidimensional approximate array matching, Th

Comput. Sci. 81 (1991) 97–115.[9] A. Amir, G.M. Landau, U. Vishkin, Efficient pattern matching with scaling, J. Algorithms 13 (1992) 2–3

[10] A. Amir, M. Lewenstein, E. Porat, Approximateswapped matching, Inform. Process. Lett. 83 (2002) 33–[11] R.S. Boyer, J.S. Moore, A fast string searching algorithm, Commun. ACM 20 (10) (1977) 762–772.[12] R. Cole, R. Hariharan, Approximate string matching: a simpler faster algorithm, SIAM J. Comput. 31 (

(2002) 1761–1782.

100 B.-F. Wang et al. / Journal of Algorithms 52 (2004) 82–100

oret.

gst.

0..

)

(1993)

.itching

[13] T. Eilam-Tzoreff, U. Vishkin, Matching patterns in a string subject to multi-linear transformation, TheComput. Sci. 60 (1988) 231–254.

[14] K. Fredriksson, E. Ukkonen, A rotation invariant filter for two-dimensional string matching, in: Proceedinof the 9th Annual Symposium on Combinatorial PatternMatching (CPM 98), in: Lecture Notes in CompuSci., vol. 1448, Springer, Berlin, 1998, pp. 118–125.

[15] E. Horowitz, S. Sahni, Fundamentals of Data Structures in Pascal, 4th ed., Freeman, 1994.[16] D.E. Knuth, J.H. Morris, V.R. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 323–35[17] K. Krithivasan, R. Sitalakshmi, Efficient two-dimensional pattern matching in the presence of errors, Inform

Sci. 13 (1987) 169–184.[18] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorithms 10 (2) (1989

157–169.[19] U. Manber, G. Myers, Suffix array: a new method for on-line string searches, SIAM J. Comput. 22 (5)

935–948.[20] E.M. McCreight, A space-economical suffix tree construction algorithm, J. ACM 23 (2) (1976) 262–272[21] P. Weiner, Linear pattern matching algorithm, in: Proceedings of the 14th IEEE Symposium on Sw

and Automata Theory, 1973, pp. 1–11.