43
Ž . JOURNAL OF ALGORITHMS 24, 223]265 1997 ARTICLE NO. AL960844 Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms U Raffaele Giancarlo ² Dipartimento di Matematica, Uni ¤ ersita di Palermo, Italy ` and Roberto Grossi Dipartimento di Sistemi e Informatica, Uni ¤ ersita di Firenze, Italy ` Received September 28, 1994; revised August 27, 1996 We introduce a new multidimensional pattern matching problem that is a Ž natural generalization of string matching, a well studied problem A. V. Aho, Ž . ‘‘Handbook of Theoretical Computer Science’’ J. van Leeuwen, Ed. , pp. 257]295, . Elsevier, Amsterdam, 1990 . The motivation for its algorithmic study is mainly w x theoretical. Let A 1: n ,...,1: n be a text matrix with N s n ... n entries and 1 d 1 d w x B 1: m ,...,1: m be a pattern matrix with M s m ... m entries, where d G r G 1 1 r 1 r Ž . the matrix entries are taken from an ordered alphabet S . We study the problem Ž of checking whether some r-dimensional submatrix of A is equal to B i.e., a . decision query . A can be preprocessed and B is given on-line. We define a new data structure for preprocessing A and propose CRCW-PRAM algorithms that Ž . 2 Ž . build it in O log N time with N rn processors, where n s max n ,..., n , max max 1 d Ž . Ž . such that the decision query for B takes OM work and O log M time. By using d ŽŽ . . known techniques, we would get the same preprocessing bounds but an O M r work bound for the decision query. The latter bound is undesirable since it can depend exponentially on d; our bound, in contrast, is independent of d and Ž. optimal. We can also answer, in optimal work, two further types of queries: a an * Work partially supported by MURST of Italy. An extended abstract related to this paper has been presented at the 6th Annual Symposium on Combinatorial Pattern Matching, Helsinki, 1995. ² Part of this work was done while the author was a member of technical staff at AT & T Bell Laboratories, Murray Hill, NJ 07974, U.S.A. E-mail: [email protected]. Part of this work was done while the author was visiting AT & T Bell Laboratories. E-mail: [email protected]. 223 0196-6774r97 $25.00 Copyright Q 1997 by Academic Press All rights of reproduction in any form reserved.

Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

Embed Size (px)

Citation preview

Page 1: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

Ž .JOURNAL OF ALGORITHMS 24, 223]265 1997ARTICLE NO. AL960844

Multi-Dimensional Pattern Matching with DimensionalWildcards: Data Structures and Optimal On-Line

Search AlgorithmsU

Raffaele Giancarlo†

Dipartimento di Matematica, Uni ersita di Palermo, Italy`

and

Roberto Grossi‡

Dipartimento di Sistemi e Informatica, Uni ersita di Firenze, Italy`

Received September 28, 1994; revised August 27, 1996

We introduce a new multidimensional pattern matching problem that is aŽnatural generalization of string matching, a well studied problem A. V. Aho,

Ž .‘‘Handbook of Theoretical Computer Science’’ J. van Leeuwen, Ed. , pp. 257]295,.Elsevier, Amsterdam, 1990 . The motivation for its algorithmic study is mainly

w xtheoretical. Let A 1:n , . . . , 1:n be a text matrix with N s n . . . n entries and1 d 1 dw xB 1:m , . . . , 1:m be a pattern matrix with M s m . . . m entries, where d G r G 11 r 1 r

Ž .the matrix entries are taken from an ordered alphabet S . We study the problemŽof checking whether some r-dimensional submatrix of A is equal to B i.e., a

.decision query . A can be preprocessed and B is given on-line. We define a newdata structure for preprocessing A and propose CRCW-PRAM algorithms that

Ž . 2 Ž .build it in O log N time with N rn processors, where n s max n , . . . , n ,max max 1 dŽ . Ž .such that the decision query for B takes O M work and O log M time. By using

dŽŽ . .known techniques, we would get the same preprocessing bounds but an O Mrwork bound for the decision query. The latter bound is undesirable since it candepend exponentially on d; our bound, in contrast, is independent of d and

Ž .optimal. We can also answer, in optimal work, two further types of queries: a an

* Work partially supported by MURST of Italy. An extended abstract related to this paperhas been presented at the 6th Annual Symposium on Combinatorial Pattern Matching,Helsinki, 1995.

† Part of this work was done while the author was a member of technical staff at AT & TBell Laboratories, Murray Hill, NJ 07974, U.S.A. E-mail: [email protected].

‡ Part of this work was done while the author was visiting AT & T Bell Laboratories.E-mail: [email protected].

223

0196-6774r97 $25.00Copyright Q 1997 by Academic Press

All rights of reproduction in any form reserved.

Page 2: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI224

enumeration query retrieving all the r-dimensional submatrices of A that are equalŽ .to B and b an occurrence query retrieving only the distinct positions in A that

correspond to all of these submatrices. As a byproduct, we also derive the firstefficient sequential algorithms for the new problem. Q 1997 Academic Press

1. INTRODUCTION

In its simplest and most classic version, multidimensional pattern match-ing is the following problem. We are given two d-dimensional matrices Aand B. The ‘‘larger’’ of the two is the text and the ‘‘smaller’’ is the pattern.We want to know whether B occurs as a submatrix of A. The study of thisproblem has both theoretical and practical motivation. On the theoreticalside, it provides a challenging generalization for the algorithmic andcombinatorial tools used for string pattern matching. For instance, inpursuing this avenue, Amir and Benson introduced a notion of two-dimen-

w x w xsional periodicity 2 , which is the analog of periodicity for strings 27 . Thisnotion has become increasingly important for the design of multi-dimen-sional pattern matching algorithms to the extent of motivating its general-ization to d ) 2 dimensions and further study of its combinatorial naturew x14, 17, 31 . From the practical point of view, the problem has applications

w x w xin low-level image processing 32 and visual databases 21 . The caseŽ w xd s 2 seems to be well understood see 3, 9, 14, 16 and references in

. Ž w xthose papers , whereas results for the case d ) 2 seem rare see 8, 24 and.references in those papers .

Here we propose a generalization of multidimensional pattern matchingand devise a new data structures and algorithms for that generalization.We point out that both the motivations for the new problem and for itsalgorithmic study are mainly theoretical. The new data structure that wepropose can be thought of as an index for a d-dimensional matrix.Intuitively, an index for a d-dimensional matrix is a tree that stores allsubmatrices of that matrix. A detailed definition of index of a two-dimen-

w xsional matrix is provided in 15 . It generalizes easily to the case d ) 2. Allthe index data structures in this paper will satisfy that definition. We use

Ž .an Arbitrary Concurrent Read Concurrent Write CRCW PRAM modelŽ w x.of computation e.g., see 22 , in which any number of processors is

allowed to read from and to write to the same memory location simultane-ously: in case of a write conflict, an arbitrary processor succeeds.

1.1. Statement of the Problem

w xLet a d-dimensional matrix A s A 1:n , 1:n , . . . , 1:n be the text, and1 2 dw xan r-dimensional matrix B s B 1:m , 1:m , . . . , 1:m be the pattern, where1 2 r

d G r G 1. In the remainder of this paper, we say that A has dimension d,

Page 3: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 225

size N s n n ??? n , and shape n = n = ??? = n . The same terminology1 2 d 1 2 d

applies to any other matrix. The size of B is denoted by M s m m ??? m .1 2 r

Both A and B have entries defined over an ordered alphabet S. Tosimplify the presentation of our ideas, we assume that n , m G 2, fori j

i s 1, . . . , d and j s 1, . . . , r. We remark, omitting the proof that all of ourresults hold also for the case when some n ’s and m ’s are equal to one.i j

Intuitively, we wish to check whether B is equal to some r-dimensionalsubmatrix of A. We now formulate this new pattern matching problem,which we refer to as multidimensional pattern matching with dimensionalwildcards.

First, observe that B can be embedded into a d-dimensional matrix indŽ . possible ways. Figure 1 illustrates this observation for r s 3 and d s 4.r

We refer to each of those d-dimensional matrices as an embedding of Band denote the set of all those matrices by BU. We say that C g BU occurs

Ž .in position x s x , x , . . . , x of A if and only if the d-dimensional1 2 d

submatrix of A, having the same shape as C and ‘‘upper left corner’’ in x,is identical to C. Again, we illustrate this fact by resorting to figures: the

Ž .embedding in Fig. 1d of the pattern matrix occurs in position 1, 1, 1, 1 ofŽthe text matrix in Fig. 2 the embedding in Fig. 1e also occurs in that

. Uposition . This allows us to say that B occurs in A if and only if thereexists a position x of A such that at least one embedding in BU occurs in x.For instance, for the pattern matrix in Fig. 1a, BU occurs in the text matrixA in Fig. 2. The problem we address can be formally stated as: Given textA, pattern B, we want to check if BU occurs in A. We refer to thisformulation as the decision ¨ersion of the problem and to the one in whichwe are required to give all submatrices of A that are equal to matrices inBU as the enumeration ¨ersion. Notice that some positions of A may be theorigin of many submatrices equal to matrices in BU. However, we may beinterested in knowing only which positions of A are occurrences, irrespec-tive of which embeddings appear there. When we are asked to report onlythose positions, we have the occurrence ¨ersion of our problem. From nowon, we will consider the decision version only, except when we specificallymention the enumeration and occurrence versions.

In regular multidimensional pattern matching, both the pattern matrixand the ‘‘candidate’’ submatrices of the text have the same number ofdimensions d. The ‘‘wildcard’’ generalization that we propose here allowsthe pattern to be a r-dimensional matrix, r F d, and then asks for a‘‘comparison’’ of all possible ways in which that matrix can match ad-dimensional submatrix of A. In some sense, our problem considers all‘‘degrees of freedom’’ that one has in matching an r-dimensional patternmatrix with a d-dimensional submatrix of the text. We point out that

Page 4: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI226

Ž . Ž . Ž .FIG. 1. a A three-dimensional pattern matrix of shape 2 = 2 = 2. b ] e All four-di-Ž .mensional embeddings of the matrix in a . They are four-dimensional matrices of shape

2 = 2 = 2 = 1, 2 = 2 = 1 = 2, 2 = 1 = 2 = 2 and 1 = 2 = 2 = 2, respectively. The ) sym-bol denotes a dimension ‘‘that is not really there.’’ We will adopt the following conventions todraw a four-dimensional matrix. In each matrix, the elements on each of the x ]x planes are1 2skewed, while each of those planes are separated by space. The three-dimensional planes

Ž .x ]x ]x are separated by dotted lines they divide the first and second plane in this figure .1 2 2

current algorithms for multidimensional pattern matching would not dealefficiently with our problem, when they can be extended to work for it.

We point out that this problem has some analogy to string patternw xmatching with don’t care symbols introduced by Fischer and Paterson 13 .

Intuitively, the ‘‘)’’ symbol in the matrices of Figs. 1b]1e is a dimensionalwildcard marking a dimension ‘‘we do not care about’’ in the pattern

w xmatching process, while in 13 the ‘‘)’’ symbol is a wildcard characterŽ .matching any character of the alphabet marking positions of the patternandror text string ‘‘we do not care about’’ in the pattern matching process.

Ž w x.As with many pattern matching problems see for instance 26, 29 ,there are two variants of the problem: off-line and on-line.

Page 5: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 227

FIG. 1}continued.

Off-line. The pattern and the text are presented together. In this case,< U <obtaining a ‘‘satisfactory’’ algorithm is easy. Assign B processors to the

Ž .N positions of A. For each position, each processor checks sequentiallywhether a given embedding in BU occurs in that position. The total work is

U d U d dŽ < < . Ž . < < Ž .bounded by O B NM s O 2 NM since B F max F 2 .r gw1, d x r

ŽThe algorithm just outlined solves also the enumeration and occurrence.versions . For the special cases d s 2, 3, better solutions are available by

w xusing the algorithms in 9, 24 .

On-line. The text is preprocessed to build an index data structure. Thepattern is given on-line and we must check whether BU occurs in A byquerying the index. The query may be repeated with different patterns onthe same text. We refer to this query as the decision query. It is desirablethat the work spent in answering the decision query be independent of the

Žtext size N otherwise we lose the advantage of having preprocessed the.text .

A solution can be obtained through standard algorithmic techniques.Indeed, in Section 3, we briefly describe the raw index RR for matrix A. ItA

Page 6: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI228

w xFIG. 2. A text matrix A 1:2, 1:2, 1:2, 1:4 .

is a forest of trees and it can be thought of as a generalization of the suffixw x Žtree data structure 29, 37 to higher dimensions. Let n s max n , n ,max 1 2

. Ž .. . . , n . The construction of RR the preprocessing step requiresd AŽ . 2O log N time with N rn processors by using a generalization of themax

algorithm of Apostolico et al. for the construction of the suffix tree of aw x w xstring 5, 16 . Recall that it has been shown in 15 that any sequential

Ž 2 .algorithm must require V N rn time to build an index for a d-dimen-maxŽsional matrix the proof is for d s 2, but it generalizes easily to the case

.d ) 2 . Since RR satisfies that definition of index, any sequential algo-AŽ 2 .rithm must require V N rn time to build it. Therefore, the parallelmax

algorithm building RR is a polylogarithmic factor away from optimal work.AŽ < U < . ŽHowever, the decision query using RR requires O B M work as weA

will briefly outline, resorting to standard algorithmic techniques, eachembedding in BU needs to be searched in an appropriate and distinct tree

Ž . .of RR in O M work . This is undesirable because the query time dependsA

exponentially on d.

Page 7: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 229

ŽConsider now the enumeration query and the occurrence query those two.queries report the output for the corresponding versions of the problem .

We anticipate that, when we use RR , it is not sufficient to answer theAdecision query to compute the information required by the other two

Ž < U < .queries. The enumeration query can be solved in O B M q enum work,when enum is the number of submatrices of A that are identical tomatrices in BU. Once the list of submatrices is available, we can answer

Ž .the occurrence query in an additional O log enumrlog log enum time andŽ . ŽO enum work. Notice that both those computations are output-bound as

w x .for instance in 30 for Dictionary Matching . In order for them to beoptimal, the work should be proportional both to the input and outputsizes. Therefore, both of those queries are not solved optimally.

From now on, we will concentrate on the on-line version of our problem.We remark that although the off-line case has a ‘‘good’’ solution, itremains an open problem to find better ones. We also point out that oncethe index RR is available, one can answer other kinds of queries that areA

Ž wrelated to the gathering of statistics about the text for examples see 4,x.15 . However, we will not consider those queries.

1.2. Our Results

Our main result is to show that we can achieve an optimal work decisionŽ .query, i.e., O M , at no expense to the preprocessing work and time. To

this end, we introduce a new data structure, the heterogeneous index HH ,AŽ .that is functionally equivalent to RR . It can be built in O log N time withA

2 Ž .N rn processors and can be queried in O log M time with Mrlog Mmaxprocessors. We observe that HH also belongs to the class of index dataA

w xstructures defined in 15 and therefore it must satisfy the already statedŽ 2 .V N rn bound on work. Therefore, the algorithm building it is amax

logarithmic factor away from optimal work. It remains a challenging openproblem to devise more efficient data structure for the representation ofall submatrices of A. However, the optimal work query obtained here canbe regarded as a fundamental contribution of our paper. Indeed, it is notdifficult to imagine situations where we have plenty of time to preprocess

Ž wthe text, but the queries must be answered quickly see for instance 36,x.37 .When we have answered the decision query, we can provide the output

Ž .for the numeration query in O enum additional work with p processors,1 F p F enum. This is optimal. As for the output for the occurrence query,we need a slightly more complicated preprocessing to build auxiliary datastructures once HH is given. The additional preprocessing still requiresAŽ . 2O log N time with N rn processors. Then, once we have answeredmax

Page 8: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI230

the decision query, we can report in an array the occ positions of A inŽ .which an embedding of B occurs in O log occrlog log occ time and

Ž .O occ work. Again, this is optimal. We point out that, if we are willing toallow some ‘‘freedom’’ in the output format, the above bounds reduce toŽ .O occ work with p processors, 1 F p F occ. This point is discussed in

dŽ .detail in Section 10. It is worth noting that occ s enumr in the worstr

case.HH is still a forest of trees. Intuitively, each tree in HH suitably combinesA A

several trees of RR . As we will show, the advantage of using HH instead ofA A

RR to check whether BU occurs in A is that we will need at most twoAU d< < Ž .searches in distinct trees of HH instead of searching in at most B sA r

distinct trees in RR . Such an advantage is even more evident for output-A

bound computations, since we can provide the output for the enumerationand occurrence queries in optimal work mainly because we can limit thedecision query to two trees.

Both the algorithms for the construction of HH and for its ‘‘query’’ areAw xbased on generalizations of the naming technique by Karp et al. 23 for

strings and of the algorithms for the parallel construction of the suffix treew xby Apostolico et al. 5 . While the generalization of the latter technique is

easy, the generalization of the former is not so immediate.The main subproblem we need to solve to get the auxiliary data

structures to report, on-line, all distinct positions of A in which somematrix in BU occurs is related to the Subtree Max Gap Problem introduced

w xin 7 . However, the techniques used there do not seem to extend to oursubproblem. Another subproblem we ‘‘encounter’’ to get the auxiliary data

Ž w xstructures in Space Compaction for this kind of problems, see 19, 20, 28.and references therein . It is interesting to note that Interval Allocation

Ž w x .see 19, 20 for definition , which is related to Space Compaction, appearsas a subproblem in the output-bound computation for Dictionary Match-

w xing 30 .In what follows, for sake of clarity of exposition, we present all data

structures and algorithms for the case d s 4. That is, we assume that thew xtext is a four-dimensional matrix A 1:n , 1:n , 1:n , 1:n . We also assume1 2 3 4

that n s n s n . The results generalize easily to arbitrary values of dmax d 4and to the cases in which n / n . Furthermore, we assume thatmax d

Žn G n G n if not, it suffices to ‘‘relabel’’ the ‘‘dimensions’’ of A so that1 2 3.the inequalities are satisfied.

Throughout this paper, we assume that the alphabet S, from which theentries of both the text and pattern matrix are drawn, is a set of integers inw x1, N . Moreover, to fix ideas, we assume that the text matrix A is mappedin a table in main memory in column major order. That is, position

Page 9: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 231

Ž . Ž . Ž .Ž . Žx s x , x , x , x of A is in entry f x s x y 1 n = n = n q x1 2 3 4 4 1 2 3 3.Ž . Ž . Ž .y 1 n = n q x y 1 n q x y 1 of the table. For simplicity, we1 2 2 1 1

also assume that the pattern matrix is stored in that order.w x w x w x Ž .Let I be the interval 1:n = 1:n = 1:n . For all l s l , l , l in I,1 2 3 1 2 3

we encode l into an integer by a map function similar to f. From now on,we assume that when working with l and x, we are in fact using theintegers corresponding to them.

The remainder of the paper is organized as follows. In Section 2 weprovide some well-known notions on compacted tries needed to make thepaper self-contained. RR , the raw index for A, is presented in Section 3.AThe material in that section is quite simple and generalizes the concepts

w xintroduced in previous papers for the two-dimensional case 15, 16 .However, its introduction greatly simplifies the presentation of moresophisticated material in later sections. The heterogenous index HH and itsAproperties are discussed in Section 4. Algorithms for its construction andfor the decision query are presented in Sections 4]8. Sections 9 and 10present the additional preprocessing and the algorithms for the enumera-tion and occurrence queries, respectively. The problem definition andSections 4]10 represent the main contributions of this paper.

2. BASIC NOTIONS

2.1. Compacted Tries

For convenience of the reader, we need to recall a few known factsw x.about compared tries for strings 25, 29 and the parallel algorithms

w xbuildings and searching in them 5 . We state those facts in terms of stringsover an alphabet G, which we leave unspecified. In later sections, we applythe notions introduced here to strings over some specific alphabets that wedefine.

Usually, compacted tries and related algorithms are described by implic-itly using the standard equality relation between characters of an alphabet.For reasons that will be clear later, here we use instead an equivalencerelation ' defined over the characters of G and that, for the time being,we leave unspecified. However, ' does not necessarily correspond to

w x w xequality of characters. Two strings a s a 1:n and b s b 1:n of length nŽ . w x w xare equivalent shortly, a ' b if and only if a i ' b i , for each i g

w x1, n . Two equivalent strings do not necessarily have equal characters inhomologous positions. We say that a string a is prefix, suffix or substringof another string b if and only if it is equivalent to a prefix, suffix orsubstring of b , respectively.

Page 10: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI232

Let Y be a set of s G 1 strings y , y , . . . , y in GU such that no y is1 2 s iŽprefix of any other y , with 1 F i, j F s and i / j. If y is prefix of y , thenj i j

.append an endmarker symbol, not in G, to y . A compacted trie represent-iw xing the set of strings Y is the same as a patricia tree for strings 25 , except

that we use ' instead of the standard equality relation:

Ž .C0 No unary nodes are allowed, except possibly for the root.Ž . w x UC1 Each arc is labeled with a nonempty substring y j:k g G , fori

w x w < <xsome i g 1, s and j, k g 1, y .i

Ž . UC2 Sibling arcs are labeled with strings from G that start withnonequivalent characters.

Ž . w xC3 For each string y g Y, with i g 1, s , there is exactly one leafi¨ such that the concatenation, say g , of the labels along the path from theroot to ¨ satisfies g ' y . That leaf ¨ is labeled with y .i i

We need some terminology. Given a compacted trie CT , consider anode u g CT. Let b be the string obtained by concatenating the labels onthe arcs found along the path from the roots of CT to u. For each stringa ' b , we say that u is the locus of a in CT. Notice that some strings may

Ž .not have a locus in CT. Pick an edge u, ¨ in CT and let u and ¨ be thelocuses of strings b and b , respectively. We say that ¨ is the extendedu ¨locus of a if b is a proper prefix of a and, in turn, a is a prefix of b .u ¨Notice that, when a ' b , its locus and extended locus are the same node.¨Again, some strings may not have an extended locus in CT.

LEMMA 2.1. Let s be a nonempty string from GU and let CT be thepre¨iously defined compacted trie representing the set Y. s is prefix of at leastone string in Y if and only if s has extended locus ¨ in CT. Moreo¨er, all thestrings in Y that ha¨e s as prefix are stored in the lea¨es descending from ¨ .

A well known example of compacted trie is the suffix tree of a string xw x29, 37 , where ' is the standard equality relation s and Y is the set ofsuffixes of x.

In Sections 2.2 and 2.3, we give two ‘‘black boxes’’ that build and searchin a compacted trie. They are easy extensions of the algorithms ofApostolico et al. for the construction and search in a suffix tree of a stringw x Ž .5 again, we use ' rather than the standard equality relation .

2.2. Parallel Construction of a Compacted Trie

w x ŽApostolico et al. 5 introduced a refinement algorithm hereafter,.refereed to as Refine that builds in parallel the suffix tree T for a stringx

x on an Arbitrary CRCW PRAM. However, Refine can be suitably used tobuild in parallel a compacted trie CT over the alphabet G as defined in

Page 11: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 233

Ž . Ž .C0 ] C3 . We describe the ‘‘functionality’’ of the algorithm by means ofŽ .the ‘‘black-box’’ Refine G, ' , Y, s, n, BB, Q, Name omitting the details

about its ‘‘inner workings.’’Ž .There are three main inputs for a given G and ' . i A set Y of s G 1

U Žnonequivalent strings y , y , . . . , y g G . We assume without loss of1 2 sgenerality that they have the same length n, and log n is an integer. If not,

. Ž .we pad the strings in Y with instances of a new character not in G. ii AQ = Q noninitialized matrix BB, called Bulletin Board, for a parameter

Ž . Ž .Q G max s, n to be specified when we use the ‘‘black box.’’ iii Afunction Name on which all the computation hinges. Its argument is a

w f x w x w x fsubstring y j: j q 2 y 1 such that i g 1, s , j g 1, n and j q 2 y 1 Fiw xn. It returns an integer in 1, Q , which is called the name if and only if

they are equivalent.w xWe remark that Apostolico et al. 5 also provided a ‘‘black box’’ to

compute names. It is a parallel version of the naming technique of Karpw xet al. 23 . We cannot use that ‘‘black box’’ directly for the ‘‘strings’’ we are

interested in. Indeed, in order to achieve the claimed work bounds for theŽ .construction of our index, we need to suitably modify it see Section 6 .

Ž .The output of Refine G, ' , Y, s, n, BB, Q, Name is a sequence of trees,called refinement trees DŽ r ., for r s log n, log n y 1, . . . , 0. The final treeDŽ0. is isomorphic to the compacted trie CT representing the strings in Y.

ŽWe point out that all arcs and leaves of the refinement trees including. Ž wCT are labeled with integers that encode substrings of strings in Y see 5,

x .22 for details . The encoding is such that, given the label of an arc or leaf,the corresponding substrings can be identified in constant time.

Ž w x.THEOREM 2.1 Apostolico et al. 5, 22 . Assuming that each call toŽfunction Name requires constant work, Algorithm Refine G, ' , Y, s, n, BB,

.Q, Name correctly builds the compacted trie CT representing the set Y,Ž r . w xtogether with the companion refinement trees D for r g 0, log n , in

Ž .O log n time with s processors.

We point out that the work bound in the previous theorem does notinclude the computation of the function Name.

2.3. Parallel Search in a Compacted Trie

w x UGiven a string g 1:m g G , to find out whether g is prefix of any stringŽ .in Y, we check whether g has an extended locus in CT by Lemma 2.1 .

The check can be done as follows.? @Let r , r , . . . , r , with e F log m , be the decreasing sequence of inte-0 1 e

gers such that m s Ýe 2 r j. Moreover, let d s 0 and d s Ý j 2 r l, forjs0 y1 j ls0

w xeach j g 0, e . Consider the partition of g into e substrings g s g g ???0 1w x r jg , where g s g d q 1:d is a substring of length 2 , with 0 F j F e.e j jy1 j

Page 12: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI234

Ž .The algorithm ExLocus that finds if any the extended locus of g in CTŽ w x.is a generalization of an homologous algorithm by Apostolico et al. 5 ,

that finds the extended locus of a string in a suffix tree. First, it identifies aprefix h of some string in Y and its extended locus u in CT. h is such that

Žh ' g if and only if g has extended locus u h can be identified through.the labels on the arcs of the refinement trees . Then it checks directly

whether h ' g .Ž . Ž r .The inputs to ExLocus are i all the refinement trees D produced by

Ž . w xRefine. ii the names h of g , for j g 0, e . Those names must bej jconsistent in the following sense. Let b be an arbitrary substring of some

Ž .string in Y and let its length be the same as that of g a power of 2 . gj jmust be assigned the same name as b if and only if the two strings are

Ž .equivalent. iii An algorithm that can check whether g is equivalent toŽ < <. Ž < <.another string of the same length in O log g time and O g work.

Ž w x. Ž . Ž . Ž .THEOREM 2.2 Apostolico et al. 5, 22 . Assume that i , ii , and iiiare a¨ailable. Algorithm ExLocus correctly finds the extended locus of g in

Ž < <. Ž < <.CT , if it exists, in O log g time and O g work

3. THE RAW INDEX RRA

We present natural generalizations to four-dimensional matrices ofw xnotions introduced in previous papers 15, 16 for the two-dimensional

case. They will be used in Section 4 to define the heterogeneous index formatrix A.

Recall that the text matrix A is of shape n = n = n = n and that I1 2 3 4w x w x w x Ž .is the interval 1:n = 1:n = 1:n . Fix l s l , l , l g I. For the text1 2 3 1 2 3

w x w x w xmatrix in Fig. 2, I s 1:2 = 1:2 = 1:2 and a possible choice of l isŽ .1, 2, 2 . Consider all four-dimensional matrices of shape l = l = l = 11 2 3and with entries from S. We consider those matrices as atomic characters

l Ž .of a new alphabet, which we denote by S see Fig. 3 for examples . WhenŽ . ll s 1, 1, 1 , we adopt the convention that S s S, i.e., each character in S

is seen as a four-dimensional matrix of shape 1 = 1 = 1 = 1. Two charac-ters a and b in Sl are equivalent if and only if they are two identicalfour-dimensional matrices. Let ' denote that equivalence relation.B

w xConsider now a four-dimensional matrix C 1:l , 1:l , 1:l , 1: f . That matrix1 2 3l w xcan be seen as a string over S , where its ith character is C 1:l , 1:l , 1:l , i .1 2 3

Ž .w xWe denote that string as C l 1: f and, when it is clear from the context,Ž .we will omit l see Fig. 4 for an example . The equivalence relation 'B

easily extends to strings over Sl. The definitions of prefix, suffix, substringand compacted tries for strings over Sl follow from the ones given in

Ž l .Section 2.1 with G s S and letting ' be ' .B

Page 13: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 235

Ž . Ž1, 2, 2.FIG. 3. a A four-dimensional matrix of shape 1 = 2 = 2 = 1. It is an element of S .Ž . Ž2, 1, 2.b A four-dimensional matrix of shape 2 = 1 = 2 = 1. It is an element of S .

Informally, the index RR is a forest of trees that stores all submatricesAŽ . lof A. For each fixed l s l , l , l g I, there is a distinct tree CT that1 2 3

stores the submatrices of A of shape l = l = l = i, where 1 F i F n .1 2 3 4For the text matrix in Fig. 2, RR will consist of eight trees. Two of themAare shown in Figs. 5 and 6. We now define one such CT l.

Ž .For each position x s x , x , x , x in A and such that x q l y 1 F1 2 3 4 i iwn , i s 1, . . . , 3, consider the submatrix A x : x q l y 1, x : x q l yi 1 1 1 2 2 2

x1, x : x q l y 1, x :n . Extend its fourth dimension so that it becomes of3 3 3 4 4Žlength n q 1 and in such a way that the resulting matrix is unique that4

can be done by appending to it a sequence of ‘‘endmarker’’ matrices of. lshape l = l = l = 1. Examples are reported in Figs. 5 and 6 . Let Y be1 2 3

FIG. 4. A four-dimensional matrix of shape 2 = 1 = 2 = 2. It can also be seen as a stringof length 2 obtained by concatenating two characters from SŽ2, 1, 2.. The characters areseparated by dotted lines.

Page 14: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI236

Ž1, 2, 2. Ž .FIG. 5. CT for the text matrix of Fig. 2. We show only the paths from root to leafw xthat correspond to the submatrices A x : x , x : x q 1, x : x q 1, x :4 of the matrix in Fig.1 1 2 2 3 3 4

�Ž . Ž . Ž . Ž .4 Ž2, for all x g 1, 1, 1, 1 ; 1, 1, 1, 3 ; 2, 1, 1, 1 ; 2, 1, 1, 3 . They are of shape 1 = 2 = 2 = 4 y.x q 1 . Those submatrices are extended so that they are unique and all of length n q 1 s 54 4

along the fourth dimension. That is done by appending to them ‘‘endmarker’’ matrices ofŽshape 1 = 2 = 2 = 1 containing $s $ is a character not in S and that does not match any

. Žother character, including itself . We show only one of those extended matrices explicitly theˆ .$ at the end of the other matrices encodes the endmarkers not explicitly shown . Notice thatthe chosen submatrices of A are stored in CT Ž1, 2, 2. as strings over the alphabet SŽ1, 2, 2.. Thesolid horizontal line is drawn for reference.

the set of all those extended matrices. CT l is the compacted trie repre-senting the matrices in Y l. They are stored as strings over the alphabet Sl

Ž .see Figs. 5 and 6 .We define RR as the forest of compacted tries CT l, for all l g I and weA

refer to it as the raw index for a matrix A. We have:

THEOREM 3.1. Let RR be the raw index of a four-dimensional matrix A ofAshape n = n = n = n . Let C be a matrix with entries defined o¨er the1 2 3 4

w xalphabet S and of shape l = l = l = i, for some i g 1:n and l s1 2 3 4Ž . Ž .w xl , l , l g I. C is a submatrix of A if and only if C l 1:i as extended locus ¨1 2 3in the compacted trie CT l of RR . Moreo¨er, all the positions of A correspond-Aing to the occurrences of C in A are stored in the lea¨es that descend from ¨ .

Page 15: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 237

Ž2, 1, 2. Ž .FIG. 6. CT for the text matrix of Fig. 2. We show only the paths from root to leafw xthat correspond to the submatrices A x : x q 1, x : x , x : x q 1, x :4 of the matrix in Fig.1 1 2 2 3 3 4

�Ž . Ž . Ž . Ž .4 Ž2, for all x g 1, 1, 1, 1 ; 1, 1, 1, 3 ; 1, 2, 1, 1 ; 1, 2, 1, 3 . They are of shape 2 = 1 = 2 = 4 y.x q 1 . The extension of those matrices is similar to that already discussed for matrices in4

ˆŽFig. 5 and it is shown only for one again, the $ at the end of the other matrices encodes the.endmarkers not explicitly shown . Notice that the chosen submatrices of A are stored in

CT Ž2, 1, 2. as strings over SŽ2, 1, 2.. The solid horizontal line is drawn for reference.

Proof. The result follows from the fact that C occurs in A if and only ifl Ž l lC is a prefix of some matrix in Y and from Lemma 2.1 applied to CT , Y

. w xand letting ' be ' . For details see 15, 16 .B

Ž 2Ž . .RR can be built in O N log N rn work. Indeed, the computation ofA 4names for all substrings of length a power of 2 of strings in Y l, for all

Ž .l s l , l , l g I, can be done in the stated work bound by using the Karp1 2 3w xet al. 23 naming technique. We leave the details to the interested reader

Page 16: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI238

Ža more general version of what is required here is presented in Section 6. w x}see Remark 6.1 . Once we have those names, we can use I distinct

Ž .versions of algorithm Refine outlined in Section 2.2 , each of which buildsl Ž 2Ž . .one CT . The work is still bounded by O N log N rn . Details are left4

to the reader.We can answer, on-line, the decision query by searching through RR .A

We briefly discuss this search method. Fix an embedding C of B andŽ .assume that it is of shape l = l = l = i, l s l , l , l g I and 1 F i F n .1 2 3 1 2 3 4

We search for the extended locus of C in CT l. If that node exists, then Coccurs in A by Theorem 3.1 and therefore BU occurs in A by definition.Conversely, if we fail the search with all the embeddings of B, then there

U U d< < Ž .is no occurrence of B . This strategy needs at most B s searches inr

compacted tries. For each search, we use the ‘‘black box’’ ExLocusŽ . Ž .outlined in Section 2.3 so that each search takes O log M time withŽ .O Mrlog M processors. Therefore, for a given embedding C, apart from

CT l as built by Refine, we need to satisfy the additional hypothesis ofTheorem 2.2. We omit the details on how that is done, since the requiredtechniques are essentially the same as those in Section 8.1. We obtain:

THEOREM 3.2. Once RR has been built using the algorithm Refine, weAU dŽŽ . .can check whether B occurs in A in O M work.r

Remark 3.1. Consider the enumeration query. In order to answer it, weneed some additional data structures for each tree CT l. Namely, for eachCT l and each node u in that tree, we need the list of leaves in the subtreerooted at u. We omit the details here since they are outlined in Section 9for different trees. However, we point out that the required informationcan be computed by Euler Tour Techniques and parallel evaluation of

Ž w x.arithmetic expressions see for instance 10, 22, 35 . The total time andprocessors needed do not exceed the ones for the construction of RR .A

Once the auxiliary information is available, we search in each tree of RA

for the extended locus u of the corresponding embedding C of B. ByTheorem 3.1, all submatrices of A equal to C are at the leaves of the

dŽŽ . . Ž .subtree rooted at u. So, we need O M work to search in all trees plusr

Ž .O enum work to output all the leaves corresponding to the submatrices ofA equal to embeddings of B. Once we have those submatrices, we cananswer the occurrence query by eliminating ‘‘duplicate’’ positions, i.e.,positions that are the origin or more than one embedding of B. That can

Ž . Ž .be done in O log enumrlog log enum time and O enum work by meansw xof prefix sum 11 .

Page 17: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 239

4. THE HETEROGENEOUS INDEX HHA

We start by giving an intuitive idea on how one can modify the raw indexfor A so as to reduce the number of searches in a trie needed to checkwhether BU occurs in A. We need the notion of source of a matrix. Wefirst give some intuition using the example of the matrix C in Fig. 1c.Then, we state it formally.

Consider the example matrix C. It is of shape 2 = 2 = 1 = 2. Its sourceis the matrix B in Fig. 1a. Notice that the shape of B is obtained by

w x w xdeleting 1 in the shape of C. Moreover, B y , y , y s C x , x , x , x ,1 2 3 1 2 3 4where y s x , y s x , and y s x . Notice that x , the component1 1 2 2 3 4 3corresponding to 1 in the shape of C, is ignored. Formally, consider afour-dimensional matrix C of shape s = s = s = s . When s s 1, for all1 2 3 4 i

w x w x1 F i F 4, the source of C is D 1 s C 1, 1, 1, 1 . Now assume that k of thes ’s are not equal to one, 0 - k F 4. From left to right, let them beis , . . . , s . In this case, the source of C is a matrix D of shape s = si i i i1 k 1 2

w x w x= ??? = s and such that D y , y , . . . , y s C x , x , x , x , where y si 1 2 k 1 2 3 4 1kŽx , y s x , . . . , y s x the remaining components are ignored becausei 2 i k i1 2 k

.they correspond to 1’s in the shape of C .Observe that, for the three-dimensional pattern matrix B in Fig. 1, all

its four-dimensional embeddings have B as source. Therefore, an alterna-tive statement of our pattern matching problem is to check whether any

Žfour-dimensional submatrix of A has B as source we are stating it for the.case d s 4 that we are presenting . This fact suggests the following idea:

represent all submatrices of A in terms of their sources. In that case, one canobserve the following. Consider the four-dimensional matrix on the path

Ž1, 2, 2. Ž . Ž .from the root of CT to the leaf labeled 1, 1, 1, 1 see Fig. 5 . It isŽ .easy to verify that the initial part of this matrix ending at the solid line

has the same source as the corresponding initial part of the four-dimen-sional matrix on the path from the root of CT Ž2, 1, 2. to the leaf labeledŽ . Ž .1, 1, 1, 1 see Fig. 6 . So, it would be convenient if those two matriceswould ‘‘share’’ the same path from the root to an internal node of asuitably defined tree. In general, the length of the path they share dependson ‘‘how much in common’’ the sources of those two matrices have.

The idea of using sources suggests partitioning the forest of trees RRAinto subforests, each of which represents a suitable combination of appro-priate trees. Assuming that the text matrix is the one of Fig. 2, we

� Ž1, 1, 1.4anticipate that the partition of the forest RR is the following: CT ,A

� Ž2, 1, 1. Ž1, 2, 1. Ž1, 1, 2.4 � Ž2, 1, 2. Ž2, 2, 1. Ž1, 2, 2.4 � Ž2, 2, 2.4CT , CT , CT , CT , CT , CT and CT .In the next subsection we will show how CT Ž2, 1, 2., CT Ž1, 2, 2., and CT Ž2, 2, 1.

can be combined into a single tree and justify why trees in differentsubforests of the partition need to be combined in different new trees. Wedefine HH in general terms in Section 4.2.A

Page 18: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI240

4.1. Combining Trees of R }An ExampleA

Consider the four-dimensional matrices in SŽ1, 2, 2., SŽ2, 1, 2., and SŽ2, 2, 1..The sources of all of those matrices have the same shape: 2 = 2. Recallthat the matrices in those alphabets have been used as characters to storesubmatrices of A as strings in CT Ž1, 2, 2., CT Ž2, 1, 2., and CT Ž2, 2, 1., respec-tively. Since we want to ‘‘combine’’ those trees into one, we combine those

w xalphabets into one: S 2, 2 . We refer to the characters of this new alphabetas heterogeneous characters of shape 2 = 2. For example, the two matrices

w x w xin Fig. 3 are both in S 2, 2 . Two characters in S 2, 2 are equivalent if andonly if the corresponding four-dimensional matrices have identical sources.For example, the matrices in Fig. 3 are equivalent since they have thesame source matrix. We denote this new equivalence relation with ' .H

w xThe concatenation of heterogeneous characters in S 2, 2 is analogous tothat of characters of S. A string obtained by concatenating a sequence of

w x Žf characters from S 2, 2 is called a heterogeneous string over the alphabetw x. Ž .S 2, 2 , whose length is f see Fig. 7 .The equivalence relation ' easily extends to heterogeneous strings soH

that two of those strings are equivalent if and only if they have equivalentheterogeneous characters in homologous positions. The definitions ofprefix, suffix, heterogeneous substring, and compacted tries for heteroge-

Ž w xneous strings follow from the ones given in Section 2.1 with G s S 2, 2. w x �Ž . Ž . Žand letting ' be ' . We define a set EE 2, 2 s 1, 2, 2 , 2, 1, 2 , 2, 2,H

.41 .Heterogeneous strings do not have a one-to-one correspondence with

Ž .four-dimensional matrices as can easily be checked in Fig. 7 . However,they display some useful properties that we now point out.

Ž . w x Ž .FIG. 7. a A heterogeneous string of length 2 over S 2, 2 . The first character top is alsoŽ2, 1, 2. Ž . Ž1, 2, 2.a character in S , while the second bottom is also a character in S . Notice that

the shown string does not correspond to a four-dimensional matrix.

Page 19: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 241

Remark 4.1. We now show how a single four-dimensional matrix can berepresented by many equivalent heterogeneous strings. We give an exam-ple first. The matrix in Fig. 1c can be represented either by the heteroge-neous string in Fig. 1c or Fig. 1d or Fig. 7. They are all equivalent. In

w x Ž . w xgeneral, consider C 1:l , 1:l , 1:l , 1: f , l s l , l , l g EE 2, 2 . It can be1 2 3 1 2 3Ž l1, l2 , l3. Ž .w xseen as a string over the alphabet S . For C l i , substitute any

Ž . w xcharacter including itself from S 2, 2 that is equivalent to it. All theheterogeneous strings so obtained are equivalent. Out of all those equiva-lent representations of a matrix, we can choose the most convenient. Aswe will see, that gives us the required flexibility to combine trees of RR . InAwhat follows, when we say that a heterogeneous string represents a matrix,we mean that the string is in the class of heterogeneous strings that can beobtained from that matrix.

Remark 4.2. Heterogeneous strings are useful in comparing matrices interms of their sources. Indeed, let Z and Z be two four-dimensional1 2matrices of shape a = a = a = k and b = b = b = k, respectively,1 2 3 1 2 3

Ž . Ž . w xsuch that a s a , a , a and b s b , b , b are both in EE 2, 2 . Let1 2 3 1 2 3h y z and h y z be two heterogeneous strings representing those matri-1 2ces, respectively. One can show that h y z ' h y z if and only if Z1 H 2 1and Z have the same source.2

w x w xWe now define CT 2, 2 , the trie over the alphabet S 2, 2 that combinesCT Ž1, 2, 2., CT Ž2, 1, 2., and CT Ž2, 2, 1.. Consider the sets of matrices Y Ž1, 2, 2.,Y Ž2, 1, 2., and Y Ž2, 2, 1. defined as in Section 3, but for the text matrix in Fig.2. Some matrices in those sets are shown in Figs. 5 and 6. Recall that allthe matrices in those sets have been augmented so that they are all

Ž .nonequivalent with respect to ' see Fig. 5 again . They are alsoBw xheterogeneous strings over the alphabet S 2, 2 , each of length n q 1 s 5.4

w x w xLet Y 2, 2 be the set of all of those heterogeneous strings. So Y 2, 2 is theunion of Y Ž1, 2, 2., Y Ž2, 1, 2., and Y Ž2, 2, 1.. Those heterogeneous strings are

Žnonequivalent with respect to ' the endmarker matrices we haveH. w xadded have all distinct source matrices . Let CT 2, 2 be the compacted

w x Ž .trie storing all the heterogeneous strings in Y 2, 2 see Fig. 8 . Thefollowing remarks are useful in illustrating the definition. Moreover, they

w xalso state some useful facts about CT 2, 2 .

Remark 4.3. Consider the matrix on the path from the root to leafŽ . Ž1, 2, 2. Ž . Ž1, 2, 2.1, 1, 1, 1 in CT see Fig. 5 . That matrix is in Y and, as a

w x Ž .heterogeneous string, it is also in Y 2, 2 by definition . However, it isw xrepresented by an equivalent heterogeneous string in CT 2, 2 , the one on

Ž . Ž . Ž .the path from the root to the leaf labeled 1, 2, 2 ; 1, 1, 1, 1 see Fig. 8 .An analogous fact holds for the matrix on the path from the root to leafŽ . Ž2, 1, 2. Ž .1, 1, 1, 1 in CT see Fig. 6 . It is represented by an equivalent

Page 20: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI242

w x Ž .FIG. 8. CT 2, 2 . We are showing only the part that combines the partially shown treesCT Ž1, 2, 2. and CT Ž2, 1, 2. in Figs. 5 and 6, respectively. Notice that a prefix of the heteroge-

w xneous string in Y 2, 2 that is stored on the path from the root to the leaf labeledŽ . Ž . Ž . Ž .1, 2, 2 ; 1, 1, 1, 3 is equivalent to a prefix of the one stored on the path to 2, 1, 2 ; 1, 1, 1, 1 .

w xheterogeneous string in CT 2, 2 , the one on the path from the root to theŽ . Ž . Ž .leaf labeled 2, 1, 2 ; 1, 1, 1, 1 see Fig. 8 . So, in general, heterogeneous

w xstrings may appear in one form in Y 2, 2 and be stored in an equivalentw x w xform in CT 2, 2 . The form in which they appear in CT 2, 2 cannot be

w xestablished a priori and it is fixed by the algorithm building CT 2, 2 .

Remark 4.4. We justify why, among the trees in RR for the matrix inA

Fig. 2, CT Ž2, 1, 2., CT Ž1, 2, 2., and CT Ž2, 2, 1. can be suitably combined into aŽ Ž1, 1, 2. Ž1, 2, 1. Ž2, 1, 1.single tree the same reasons hold for CT , CT , and CT .

Indeed, consider the two example matrices we have used in Remark 4.3,one from CT Ž1, 2, 2. and the other from CT Ž2, 1, 2.. As heterogeneous strings,

w xthey share a path in CT 2, 2 . That is possible because the ‘‘characters’’forming those two matrices are matrices that have sources of equal shape:2 = 2. CT Ž2, 1, 2., CT Ž1, 2, 2., and CT Ž2, 2, 1. are the only trees in RR storingAmatrices formed by ‘‘characters’’ with sources of shape 2 = 2.

Page 21: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 243

Remark 4.5. There is a one-to-one correspondence between the sub-Ž1, 2, 2. Ž2, 1, 2. Ž2, 2, 1. w xmatrices of A in Y , Y , and Y and the leaves of CT 2, 2 .

w x Ž < w x <.Thus, the total number of nodes in CT 2, 2 is O N = EE 2, 2 .

w x Ž < w x <.Remark 4.6. CT 2, 2 can be efficiently stored in O N = EE 2, 2 space.Each label at the leaves is represented by two tuples that are encoded intotwo integers. Consider now the information on the arcs. Again, we provide

Ž .an example on how it is stored. Consider the arc root, x in Fig. 8. We canŽ . Ž .encode the label on that arc as follows: l s 2, 1, 2 ; x s 1, 2, 1, 1 ; j s 3

meaning that the heterogeneous substring on that arc is the source ofw xA x : x q l y 1, x : x q l y 1, x : x q l y 1, x : x q j y 1 . Again,1 1 1 2 2 2 3 3 3 4 4

those tuples can be encoded into integers. Since all the informationw xlabeling the arcs and leaves of CT 2, 2 can be stored in constant space

Ž < w x <.and since the total number of nodes is O N = EE 2, 2 , the total spaceŽ < w x <.required to sore that tree is O N = EE 2, 2 .

4.2. The Index HHA

In this section we extend the notion of heterogeneous string so that wecan define the heterogeneous index HH . Then, we show how to use it toAanswer the various queries.

Ž .We need some notation. Let m s m , m , . . . , m be a tuple such thatj 1 2 j

w xj F 3 and m G 2, for i s 1, . . . , j. Let EE m be the set of triples l g Ii jsuch that l is equal to m once all components in it equal to one have beenj

w x w x w x w xeliminated. For example, when I s 1:2 = 1:2 = 1:2 , EE 2, 2 is the setŽ . w x �Ž .4defined in Section 4.1. Moreover, let m s 1 and EE 1 s 1, 1, 1 . Let MM0

be the set of all tuples m , 0 F j F 3, for which the corresponding EE setsj

w x w x w x �Ž . Ž . Ž . Ž .4are nonempty. For I s 1:2 = 1:2 = 1:2 , MM s 1 , 2 , 2, 2 , 2, 2, 2 .From now on, we will drop the subscript j from m , even though m willjdenote a tuple of variable length.

w xFix m g MM and let S m be the alphabet obtained by putting together alll w x Ž . w xcharacters in S , for all l g EE m . When m s 2, 2 , S 2, 2 is defined as in

Section 4.1. We define the equivalence relation ' between symbols ofHw xthe new alphabet exactly as it has been defined for S 2, 2 in Section 4.1.

The definition of heterogeneous string, substring, prefix, suffix and com-pacted trie is also as in that subsection.

w xLet Y m be the set of heterogeneous strings obtained by the union ofl Ž . w x Ž . w xY defined in Section 3 , for all l g EE m . When m s 2, 2 , Y 2, 2 is

w xdefined as in Section 4.1. Let CT m be the compacted trie over thew x w xalphabet S m representing all the heterogeneous strings in Y m . It

l w xcombines the trees CT of RR , for all l g EE m .A

w xRemark 4.7. The same remarks that hold for CT 2, 2 in Section 4.1w xcan be easily extended to hold for CT m . In particular, it can be stored in

Page 22: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI244

Ž < w x <.O N = EE m space. Moreover, Remark 4.2 extends to the case in whichw xa and b are both in EE m and the two heterogeneous strings h y z and1

w xh y z are both over the alphabet S m .2

w xThe heterogeneous index HH is the forest of trees CT m , for all m g MM.Aw x ŽFor the test matrix in Fig. 2, HH is composed of the trees CT 1 which isA

Ž1, 1, 1. . w x Ž � Ž2, 1, 1. Ž1, 2, 1.equal to CT in RR , CT 2 which combines CT , CT ,AŽ1, 1, 2.4 . w x Ž � Ž2, 1, 2. Ž2, 2, 1. Ž1, 2, 2.4CT of RR , CT 2, 2 which combines CT , CT , CT ofA. w x Ž Ž2, 2, 2. .RR and CT 2, 2, 2 which is equal to CT in RR .A A

Ž 2 .THEOREM 4.1. The index HH for matrix A can be stored in O N rnA 4space.

w xProof. As already noted, each CT m requires space that is propor-< w x <tional to N = EE m . The result follows since the EE sets partition I and

< <I s Nrn .4

Recall that B is a pattern matrix of shape m = m = ??? = m , with1 2 rm G 2, i s 1, . . . , r, and r F 4. We now show how the index HH can bei Aused to answer the decision query, i.e., check whether BU occurs in A.Through the same process, we can also identify all submatrices of A that

Ž .have B as source enumeration query and all positions of A in whichŽ .those submatrices occur occurrence query . The strategy can be outlined

U Žas follows: we choose two types of embeddings from B the choice.depends on r and not on the specific pattern matrix . Then we search in

two distinct trees in the forest HH . We discuss first the case in which r s 3Aand provide examples.

w x w xLet C 1:m , 1:m , 1:m , 1:1 and C 1:m , 1:m , 1:1, 1:m , be two em-1 1 2 3 2 1 2 3beddings of B. Notice that C has a nonexistent fourth dimension, while1

Ž .C has a fourth dimension of length m s m ) 1. When m , m , m g2 3 r 1 2 3MM, C can be represented by a heterogeneous character h y g in1 1

w x Ž .S m , m , m . Similarly, when m , m g MM, C can be represented by a1 2 3 1 2 2w xheterogeneous string h y g over the alphabet S m , m .2 1 2

EXAMPLE 4.1. Let the pattern matrix be the one in Fig. 1a. Then C is1the matrix in Fig. 1b and it can be seen as a heterogeneous character

w x Ž .h y g in S 2, 2, 2 also shown in that figure . C is the matrix in Fig. 1c1 2w x Žand it can be seen as a heterogeneous string h y g over S 2, 2 also2

.shown in that figure .

LEMMA 4.1. With the con¨entions stated abo¨e, BU occurs in A ifŽ . Ž . Xand only if i m , m , m g MM and h y g has extended locus u in1 2 3 1

w x Ž . Ž .CT m , m , m or ii m , m g MM and h y g has extended locus u in1 2 3 1 2 2w x Ž .CT m , m . In case i , all submatrices of A with shape l = l = l = 1,1 2 1 2 3

Ž . w xl , l , l g EE m , m , m , with origin in x and that ha¨e B as source are1 2 3 1 2 3w xidentified by the labels at the lea¨es of the subtree of CT m , m , m roots at1 2 3

Page 23: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 245

X Ž . Ž .u . In case ii , all submatrices of A with shape l = l = l = m , l , l , l1 2 3 3 1 2 3w xg EE m , m , with origin in x and that ha¨e B as source are identified by the1 2

w xlabels at the lea¨es of the subtree of CT m , m rooted at u.1 2

Proof. The proof is structured as follows: we present examples followedby general arguments. The example matrix is the one in Fig. 1a for thepattern B and it is the one in Fig. 2 for the text A. C and the example2heterogeneous string h y g are those in Example 4.1.2

Ž . U« Since B occurs in A, there is at least one four-dimensionalŽsubmatrix of A that has B as a source recall the definition of occurrence

U .of B and the definitions of source and embedding . Denote that subma-w xtrix by C 1:l , 1:l , 1:l , 1:l . Since C is a submatrix of A, we must have1 2 3 4

Ž .l s l , l , l g I. Since C has shape l = l = l = l and has B as1 2 3 1 2 3 4Ž .source, l is either equal to 1 or to m the shape of B is m = m = m .4 3 1 2 3

We have two cases:

Case l s m . Let h y g be a heterogeneous string representing C.4 3We give an example first:

w xEXAMPLE 4.2. Let C s A 1:2, 1:1, 1:2, 1:2 . Notice that it is equal to theembedding in Fig. 1d and let that embedding be also h y g . Moreover,

Ž .h y g has extended locus u in the tree in Fig. 8 it is actually its locus .Although C / C , it can be easily verified that those two matrices have the2same source and the two heterogeneous strings representing them areequivalent. Therefore, h y g has extended locus in the tree in Fig. 82Ž .even though C is not a submatrix of A . Moreover, all submatrices of A2

Ž . w xhaving shape l = l = l = 2, l , l , l g EE 2, 2 , and with B as source1 2 3 1 2 3have labels at the leaves of the subtree rooted at u.

When l s m , one of l , l , l must be one and the other two must be4 3 1 2 3w xequal to m and m , in order of increasing indices. So, EE m , m is1 2 1 2

Ž . Ž .nonempty at least l is in there . Therefore, m , m is in MM. We now1 2w x Žshow that h y g has extended locus u in CT m , m even though C2 1 2 2

.could not be a submatrix of A .Notice that both C and C have B as source. Moreover, C and C have2 2

shape l = l = l = m and m = m = 1 = m , respectively, and both1 2 3 3 1 2 3Ž . Ž . w xl , l , l and m , m , 1 are in EE m , m . So, by Remarks 4.2 and 4.7,1 2 3 1 2 1 2

Žh y g ' h y g . Since C is a submatrix of A, h y g or an equivalentH 2.heterogeneous string must be prefix of some heterogeneous string in

w x Ž .Y m , m by construction . Therefore, h y g has extended locus u in1 2w xCT m , m . Since h y g ' h y g , u is also the extended locus of1 2 H 2

w xh y g . Since CT m , m is a compacted trie representing all strings in2 1 2w x Ž .Y m , m , we have by Lemma 2.1 that all heterogeneous strings in1 2w xY m , m that have h y g as prefix are ‘‘stored’’ at the leaves of the1 2 2

subtree rooted at u. But, by the correspondence between submatrices of A

Page 24: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI246

w xand the strings in Y m , m , that is equivalent to say that all submatrices1 2Ž . w xof A with shape l = l = l = m , l , l , l g EE m , m , with origin in x1 2 3 3 1 2 3 1 2

and that have B as source are identified by the labels at the leaves of thew xsubtree of CT m , m rooted at u.1 2

Case l s 1. Using the same ideas outlined before, one can show that4X w xh y g has extended locus u in CT m , m , m and that the submatrices1 1 2 3

Ž .of A indicated in part i of the lemma are identified by the labels at thew x Xleaves of the subtree of CT m , m , m rooted at u .1 2 3

Ž . w x¥ Let us assume that h y g has extended locus u in CT m , m ,2 1 2Ž . Žwhere m , m g MM. The proof for the case in which h y g has an1 2 1

X w x .extended locus u in CT m , m , m is similar and omitted. We show that1 2 3Ž .an embedding of B not necessarily C occurs in A.2

Ž .EXAMPLE 4.3. The heterogeneous string h y g in Fig. 1c has ex-2w x Ž .tended locus u in CT 2, 2 see Fig. 8 . Based on that, we can conclude that

w xit represents submatrix P s A 1:2, 1:1, 1:2, 1:2 of the matrix in Fig. 2.Ž .Notice that P is equal to the embedding in Fig. 1d. Since h y g2

Žrepresents also C , we have that P and C must have the same source the2 2. U Žmatrix in Fig. 1a . Therefore, B occurs in A a submatrix of A has B as

.source . Again, all submatrices of A having shape l = l = l = 2,1 2 3Ž . w xl , l , l g EE 2, 2 , and with B as source have labels at the leaves of the1 2 3subtree rooted at u.

Recall that h y g is a heterogeneous string of length m . Let h y b2 3be the heterogeneous string on the path from the roots to one of the

w xleaves in the subtree rooted at u. Notice that h y b is in Y m , m . By1 2Ž w x w xLemma 2.1 applied to CT s CT m , m , Y s Y m , m and letting '1 2 1 2

. w x Žbe ' h y g is equivalent to h y b 1:m a prefix of h y b of lengthH 2 3. w x w xm . By the way we have defined Y m , m and CT m , m , we know3 1 2 1 2

wthat h y b represents an extended submatrix of A, say, A x : x q l y1 1 1x w x1, x : x q l y 1, x : x q l y 1, x :n q x . Hence, h y b 1:m repre-2 2 2 3 3 3 4 4 4 3

wsents the submatrix P s A x : x q l y 1, x : x q l y 1, x : x q l y1 1 1 2 2 2 3 3 3x w x1, x : x q m y 1 . Since h y b 1, m ' h y g , we have that P and4 4 3 3 H 2

Ž .C have the same source matrix by Remarks 4.2 and 4.7 . So B is the2Ž .source of P the source of a matrix is unique . But P occurs in A and

therefore an embedding of B occurs in A. Using this fact and Lemma 2.1,Ž .one can again show that all the submatrices of A indicated in part ii of

the lemma have labels at the leaves in the subtree rooted at u.

Remark 4.8. Notice that the origins of the submatrices of A that haveB as source are the positions of A in which some embedding of B occurs.

Page 25: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 247

THEOREM 4.2. Whether BU occurs in A can be ascertained by querying atmost two distinct trees in HH . Through the same method, we can also identifyAall submatrices of A that are embeddings of B and where those embeddingsoccur in A.

Proof. If r s d s 4 then B is also its own four-dimensional embeddingw x Žm1, m2 , m3.and CT m , m , m is isomorphic to CT . Hence, we can apply1 2 3

Theorem 3.1. If r - d s 4, then we follow the proof given in Lemma 4.1for the case r s 3. The same ideas extend to the case r s 2, except thatm ‘‘disappears’’ and m plays the role of m . For the case r s 1, one can3 2 3

U Ž . Ž .show that B occurs in A if and only if i m g MM and h y g has1 1X w x Ž . w xextended locus u in CT m or ii h y g has extended locus u in CT 1 .1 2

Ž . Ž .In case i , all submatrices of A with shape l = l = l = 1, l , l , l g1 2 3 1 2 3w xEE m , with origin in x and that have B as source are identified by the1

w x X Ž .labels at the leaves of the subtree of CT m rooted at u . In case ii , all1Ž . w xsubmatrices of A with shape l = l = l = m , l , l , l g EE 1 , with1 2 3 1 1 2 3

origin in x and that have B as source are identified by the labels at thew xleaves of the subtree of CT 1 rooted at u. The proof is similar to the one

in Lemma 4.1 and therefore omitted.

Remark 4.9. So far, we have considered the case r F d F 4. We nowprovide some intuition as to why Theorem 4.2 extends to the case 4 - r F d.Let m = ??? = m be the shape of B, the pattern matrix. Again, m G 2,1 r ii s 1, . . . , r. When r s d, B is its own d-dimensional embedding and we

w xneed to search only in CT m , . . . , m . Consider the case 4 - r - d.1 dObserve that the shape of a d-dimensional embedding of B has its last

Ž . Ž .component either equal to 1 type one or to m type two . An embeddingrw xof type one is a heterogeneous character in S m , . . . , m and we must1 r

w xsearch for it in CT m , . . . , m . An embedding of type two is a string over1 rw x wthe alphabet S m , . . . , m and we must search for it in CT m , . . . ,1 ry1 1

xm . Now, we can arbitrarily pick one embedding of each type because,ry1within each type, all embeddings of B are equivalent with respect to the‘‘equal source’’ relation. But we have also grouped submatrices of Aaccording to the same equivalence relation. The relevant ones, if any, are‘‘stored’’ in the compacted tries we have mentioned.

5. PARALLEL CONSTRUCTION OF INDEX HH }AN OUTLINEA

We use N 2rn processors, indexed from 1 to N 2rn , to build HH . The4 4 Aalgorithm follows the classical scheme outlined in Section 2.2; i.e., thereare two phases, naming and refining. During naming we assign integersŽ .names to all heterogeneous substrings, of length a power of 2, of the

Page 26: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI248

w xheterogeneous strings in Y m , for all m g MM. This algorithm is a suitablew xmodification of the naming technique by Karp et al. 23 . Since all strings

w x Žin Y m , m g MM, are of length n q 1, we assume for convenience and4.without loss of generality that n q 1 is a power of 2. During refining, we4

w xuse those names and an instance of Refine to build each of CT m , m g MM.w xThis phase is a ‘‘verbatim’’ application of ideas in 5 and notions in

Section 4. It is briefly outlined in Section 7. We now focus on the namingw xphase, which requires suitable modification to Karp et al. 23 .

6. NAMING

In order to compute the names for the required substrings of strings inw xY m , for all m g MM, we use Nrn groups of N processors each. For each4

l g I, let G l denote one of those groups. Moreover, we refer to thel w xcollection of all groups G such that l g EE m as a cluster. We now

Ž .describe the computation for a fixed m s m , m g MM. In particular, we1 2give a high level description of the computation for one group G l,

w xl g EE m , and the communication with other groups in the same cluster.w xConsider the set Y m , m . The names that we assign to heterogeneous1 2

w xsubstrings, having length a power of 2, of strings in Y m , m must satisfy1 2the following:

PROPERTY 6.1. Let h y a and h y b be any two of those substrings.Ž . Ž . < < < <Name h y a s Name h y b and h y a s h y b if and only if h y a

' h y b.H

w x Ž1, m1, m2 . Žm1, 1, m2 .Recall that Y m , m is the union of three sets Y , Y1 2and Y Žm1, m2 , 1.. Each of those sets ‘‘contains’’ some suitably chosen subma-

Ž1, m1, m2 . wtrices of A. For instance, Y consists of the matrices A x : x , x : x1 1 2 2xq m y 1, x : x q m y 1, x : x q n , for each position x in A. Those1 3 3 2 4 4 4

w xmatrices are represented as heterogeneous strings in Y m , m . When1 2‘‘endmarker’’ matrices are ignored, every suffix of each heterogeneous

w xstring in Y m , m is also in that set. That is, all suffixes of heterogeneous1 2w x w xstrings in Y m , m are also in Y m , m . This fact, in turn, implies that1 2 1 2

w xa heterogeneous substring of a heterogeneous string in Y m , m is prefix1 2w xof a heterogeneous string in Y m , m . Therefore, Property 6.1 needs to1 2

Ž .be enforced only for substrings that are prefixes of length a power of 2 ofw xheterogeneous strings in Y m , m . Then, it will automatically hold for all1 2

the required substrings. From now on, we concentrate on prefixes only.

Remark 6.1. Consider the problem of assigning names to all blocksubstrings of length a power of two of block strings in Y Ž1, m1, m2 .. Using the

w xsame arguments as for Y m , m , we can reduce the problem to the1 2

Page 27: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 249

computation of names for prefixes of length a power of 2. Moreover, whenrestricted to SŽ1, m1, m2 ., the equivalence relation ' is the same as ' .H BSo, we can use the algorithm in this section to assign names to the

Ž1, m1, m2 . Žrequired block substrings in Y recall discussion about naming of.block strings in Section 3 .

The group of processors GŽ1, m1, m2 . will compute the names of theprefixes of the heterogeneous strings associated with the matrices inY Ž1, m1, m2 .. Groups GŽm1, 1, m2 . and GŽm1, m2 , 1. will do the same for setsY Žm1, 1, m2 . and Y Žm1, m2 , 1..

Groups proceeds independently and each of them implements aw xstraightforward extension of Karp et al. 23 ‘‘naming’’ algorithm. However,

groups need to communication. Indeed, a group can assign a name to aheterogeneous string at one time and another group can assign a name toan equivalent heterogeneous string at a different time. Since we wantProperty 6.1 to hold, those names must be the same. Communication

w xamong groups is implemented through a standard tool: Bulletin Boards 5 .However, the set-up of the proper ‘‘communication paths’’ is tricky. Wenow describe the general structure of the computation within each groupand, for convenience of the reader, the Bulletin Boards.

Ž . w x lFix l s l , l , l g EE m , m . All processors in G work synchronously1 2 3 1 2and each is assigned to a different position x of A. The computation isdivided into d q 1 s 5 rounds, each of which is divided in steps. Round

u vk s 0 lasts one step. Round k, 0 - k - d s 4, lasts log n steps. Thekl u vprocessors in G will be active only for the first log l steps and will bek

idle for the remaining steps of the round. Notice that when l is equal to 1,k0 - k - d, the entire group is idle during round k. Round k s d s 4 is

Ž . ldivided into log n q 1 steps, all processors in G are active. Thus, there4dy1 u v Ž . Ž .are overall 1 q Ý log n q log n q 1 s O log N steps. The num-ks1 k 4

ber of steps would be larger if we kept all available processors active at alltimes: in the kth step processors in G l could ‘‘process’’ the kth component

Ž .of l that is different from one, but this would always require log n q 14Ž Ž ..steps per round for a total of O d log n q 1 steps.4

Communication among groups is implemented through a set of nonini-k u vtialized Q = Q Bulletin Boards BB , 1 F k F d s 4 and 1 F g F log n .g k

For technical reasons, we choose Q s 3N 2rn . They are shared globally4by all groups of processors in all clusters during different steps of thecomputation. We now outline the use and the rules for reading and writing

Ž . w xin one BB they will apply also to the others . BB a, b is used to store aw 2 x Ž . w x w xunique integer c g 1, N rn assigned to a pair a, b g 1, Q = 1, Q .4

BB contains either ‘‘garbage’’ or a valid entry corresponding to the indexof the processor that successfully writes first in that cell. Once the entry is

Ž .valid, no processor can change it including the one who wrote in it . We

Page 28: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI250

denote the operation of possibly setting and reading the content of BB asw xc ¤ BB a, b . The details on how that operation can be performed in

w xconstant work can be derived from the one described in 5 .We need to check whether an entry of BB is garbage. That can be done

Ž w xby well known methods see 6, 5, 18 for sequential and parallel tech-.niques . From now on, we simply assume that this ‘‘validity’’ check is done

w xwhen we access a BB. Moreover, using the techniques in 5 , the spacerequirement of a BB can be reduced from Q2 to Q1qe, for any e , 0 - e F 1.That results in a slowdown of a 1re factor in time.

6.1. Computation within One Group

Ž .To simplify the presentation, we fix l s 1, m , m and consider group1 2GŽ1, m1, m2 .. The same algorithm generalizes to other groups and otherclusters. At the end of the computation, the processor in charge of x has

w g xcomputed the names of the heterogeneous strings h y b 1:2 correspond-w g xing to matrices A x : x , x : x q m y 1, x : x q m y 1, x : x q 2 y 1 ,1 1 2 2 1 3 3 2 4 4

Ž .for 0 F g F log n q 1 . Since the computation is a straightforward exten-4w xsion of the naming technique by Karp et al. 23 , we will limit ourselves to

an outline, giving details only for the parts that are needed later on toexplain the communication among processors. The use of the BulletinBoards is presented in the procedure below and explained in the para-graphs following it.

Ž .PROCEDURE HNAMES. 1 Round k s 0. The processor in charge of xcomputes the name of the heterogeneous character h y a equal to

w xA x : x , x : x , x : x , x : x . Since h y a is a character of S and we1 1 2 2 3 3 4 4w 2 xassume that all of those characters are integers in 1, N rn , the name of4

h y a is the character itself.Ž .2 Round k, 0 - k - d s 4. Consider step g for some g, 1 F g F

u v Ž .log l . Since we have fixed l s 1, m , m , the processor in charge of x isk 1 2idle during the first round and l s m . To simplify notation, let k s 2.2 1The processor computes the name of the heterogeneous character h y a

w xcorresponding to P s A x : x , x : x q q y 1, x : x , x : x , where q s1 1 2 2 3 3 4 4Ž g . Ž w x .min 2 , m . Notice that h y a is a heterogeneous character in S q .1

That is done as follows. Let f be the largest integer such that 2 f - q.wDivide P into two possibly overlapping submatrices P s A x : x , x : x q1 1 1 2 2

f x w f x2 y 1, x : x , x : x and P s A x : x , y: y q 2 y 1, x : x , x : x , where3 3 4 4 2 1 1 3 3 4 4

y s x q q y 2 f. Assume that the names a and b of the two heteroge-2neous characters h y a and h y a corresponding to P and P have1 2 1 2

Žbeen computed, respectively. If P is outside the boundaries of A, we2w 2 x .assume that a unique integer in N rn q 1, Q is assigned to h y a .4 2

Ž . 1w xName h y a ¤ BB a, b . For k s 3 and step g, h y a corresponds tog

Page 29: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 251

w x Ž g .A x : x , x : x q m y 1, x : x q q y 1, x : x , where q s min 2 , m . It1 1 2 2 1 3 3 4 4 2w xis a heterogeneous character in S m , q . The computation is similar to1

the previous round, except that BB2 is used.g

Ž . Ž .3 Round k s d s 4. Consider step g, 1 F g F log n q 1 . The4processor in charge of x computes the name of the heterogeneous string

w g x wh y b 1:2 corresponding to P s A x : x , x : x q m y 1, x : x q m y1 1 2 2 1 3 3 2g x1, x : x q 2 y 1 . As in the previous rounds and steps, that is done by4 4

dividing P in to two submatrices. Let a and b be the names of the twoŽheterogeneous strings corresponding to those two matrices. Name h y

w g x. 4w xb 1:2 ¤ BB a, b .g

Consider round k s 2, step g. Notice that the processor in charge of xwgives a name to heterogeneous character corresponding to A x : x , x : x1 1 2 2

x Ž g .q q y 1, x : x , x : x , where q s min 2 , m . This matrix has only one3 3 4 4 1component of its shape that is larger than one. This is the reason that the

1 Ž .processor uses BB g denotes the step . In general, when a processor ingsome cluster gives a name to a submatrix of A during round k, 0 - k - d,and that matrix has c F k components in its shape larger than one, then

c ŽBB is used again, g denotes the step and, for any c, we have enoughgu v u v .BBs: g F log l F log n because n G n . This policy will assure thatk c c k

all valid entries in each BBc are distinct, 1 F c F d y 1 s 3, as we nowgshow.

Indeed, fix a processor in a cluster. One can easily verify that thenumber of components larger than one in the shape of the matrix beingassigned a name during active round k and step g, 0 - k - d, is increasingwith the active rounds. Notice that within each active round also the step isincreasing, so each processor will access BBc, 1 F c F d y 1 s 3, at mostgonce. Since each processor has a distinct integer assigned to it and it canwrite at most once in those BBs, all valid entries in each BBc are distinct,g1 F c F d y 1 s 3. Using similar arguments, one can show that all validentries in BBd, d s 4, are also distinct.g

Remark 6.2. For later reference, we point out that the computation ofw xthe name of a single h y b 1:n q 1 can be thought of as a bottom-up4

ŽŽ . .computation over a balanced binary tree having O n q 1 m m nodes.4 1 2This observation generalizes to the computation of the name of a hetero-

w xgeneous string defined over the alphabet S m and of length n q 1, m s4Ž . ŽŽ . .m , m , . . . , m . In that case, the tree has O n q 1 m ??? m nodes.1 2 j 4 1 j

w g xRemark 6.3. The names of all h y b 1:2 can be suitably stored in atable and retrieved in constant work by one processor each. The same factholds for the names computed by other clusters of processors.

Page 30: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI252

6.2. Communication with Other Groups in the Cluster

We will show how processor within a cluster communicate so thatProperty 6.1 is inductively enforced and we also sketch a proof of correct-ness. Again, to simply the presentation, we will work with a specific group

X Ž . Žm1, 1, m2 .of processors. Fix l s m , 1, m and consider group G . At the1 2X Ž X X X X .end of the computation, the processor in charge of x s x , x , x , x has1 2 3 4

Xw g xcomputed the names of the heterogeneous strings h y b 1:2 corre-w X X X X X Xsponding to matrices A x : x q m y 1, x : x , x : x q m y 1, x : x q1 1 1 2 2 3 3 2 4 4

g x Ž .2 y 1 , for 0 F g F log n q 1 . The computation proceeds as in the4previous subsection. In what follows, we will discuss how processors inGŽ1, m1, m2 . and GŽm1, 1, m2 . communicate.

Ž X. X Xa Consider round k s 0. The processor in charge of x computesX w X Xthe name of the heterogeneous character h y a equal to A x : x ,1 1

X X X X X X x Ž .x : x , x : x , x : x again, the name is the character itself . With reference2 2 3 3 4 4Ž .to 1 in Procedure HNAMES, we will have that the name of h y a is

equal to the name of h y a X if and only if they are the same character ofw x w xS s S 1 . So, Property 6.1 holds for those two characters of S 1 . This is

the base of the induction.

Ž X. X X X Xb Consider round k , 0 - k - d s 4, and step g , 1 F g Fu v X X

Xlog l . To simplify notation, let k s 1. Since we have fixed l skŽ . Xm , 1, m , the first round will take place. The processor in charge of x1 2computes the name of the heterogeneous character h y a X corresponding

X w X X X X X X X X X x X Ž gX .to P s A x : x q q y 1, x : x , x : x , x : x , where q s min 2 , m .1 1 2 2 3 3 4 4 1Ž X w X x . Ž X.Notice that h y a is a heterogeneous character in S q . Name h y a

1 w X X x X XX¤ BB a , b , where a and b are the names of the heterogeneousg

X X Ž X .characters h y a and h y a coming out of the division of P in two .1 2Ž .With reference to 2 in Procedure HNAMES, notice that during round

k s 2, step g, the processor in charge of x in GŽ1, m1, m2 . assigns a name toŽ w x.h y a a heterogeneous character of S q . We show that, when both

characters are in the same heterogeneous alphabet, h y a ' h y a X ifHŽ . Ž X. w x w X xand only if Name h y a s Name h y a . Notice that S q s S q im-

plies q s qX which, in turn, implies g s gX. Now, we have that h y a ' hHX X Žy a if and only if h y a ' h y a , i s 1, 2, which is true becausei H i

. X XProperty 6.1 holds inductively if and only if a s a and b s b . Since allvalid entries in BB1 are distinct, we have a s aX and b s bX if and only ifg

Ž . Ž X.Name h y a s Name h y a . Therefore, Property 6.1 holds for thosetwo heterogeneous characters. Using similar arguments, one can show thatProperty 6.1 holds for all characters to which processors in the clusterGŽ1, m1, m2 ., GŽm1, 1, m2 . and GŽm1, m2 , 1. assign names during round kX, 0 - kX

- d y 1 s 3.

Page 31: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 253

Ž X. X Xc Consider round k s d s 4, step g . The processor in charge ofX Xw gX xx computes the name of the heterogeneous string h y b 1:2 corre-

w X X X X X X X X gX

sponding to P s A x : x q m y 1, x : x , x : x q m y 1, x : x q 2 y1 1 1 2 2 3 3 2 4 4

x Ž Xw gX x. 4 w X X x X XX1 . Name h y b 1:2 ¤ BB a , b , where a and b are the names ofg

the heterogeneous strings coming out of the division of PX in 2. WithŽ . Xw gX xreference to 3 in Procedure HNAMES, we have that h y b 1:2 and

w gX xh y b 1:2 are heterogeneous strings of the same length and both overw xthe alphabet S m , m . In order for Property 6.1 to hold, we need to show1 2

that those two heterogeneous strings are equivalent if and only if they haveŽ X.the same name. The proof is a continuation of the induction in b above

and it uses the same ideas. The details are omitted. We can conclude thatw xProperty 6.1 holds for all the required substrings in Y m , m .1 2

LEMMA 6.1. The names of the heterogeneous substrings, ha¨ing length aw x Ž . 2power of 2, of strings in Y m can be computed in O log N time by N rn4

processors, for all m g MM. They satisfy Property 6.1.

Proof. The proof that all names satisfy Property 6.1 is by induction onthe number of rounds and steps. We have outlined the proof only for the

w xset Y m , m . For other sets, the proof is similar and therefore omitted.1 2As for the time analysis, each step can be done in constant time by a

group of processors and the first round also takes constant time. So, theŽ d u v. Ž .total time for this part of the computation is O S log n s O log N .ks1 k

7. REFINING PHASE

The parallel construction of the index HH can now be easily doneAŽ .through the use of Algorithm Refine G, ' , Y, s, n, BB, Q, Name pre-

w xsented in Section 2.2. For each CT m g HH , m g MM, we use a differentAinstance of Refine. All those instances share a new Bulletin Board BB,

2 w xwhich is of size Q = Q, where now Q s N rn . For a given CT m , the4w xremaining parameters of Refine are set as follows. The set Y is Y m with

< w x < w xs s Y m and n s n q 1. The alphabet G is the alphabet S m of the4heterogeneous characters, and the equivalence relation ' becomes ' .HThe naming function is the one computed in Section 6, Moreover, the

l w x w xcluster of processors G for l g EE m , is in charge of computing CT m .

THEOREM 7.1. Consider a matrix A of size N s n = n = n = n . HH1 2 3 4 AŽ . 2can be correctly built in O log N time with N rn processors.4

Proof. By Lemma 6.1, the names of the needed heterogeneous stringscan be computed in the work bound stated by the theorem. Finally, since

Page 32: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI254

Ž .each name can be accessed in constant work see Remark 6.3 , eachw x Ž . < w x < ŽCT m can be built in O log N time with s s Y m processors by

2. < w x <Theorem 2.1 . But Ý Y m s N rn .m 4

w xRemark 7.1. Recall from Section 4 that the arcs and leaves of CT mw xare labeled with substrings and strings in Y m . The arcs and leaves of the

refinement trees computed by Refine are labeled in the same way. Theinformation stored on those arcs and leaves is not relevant here. However,it is important to point out that we can encode that information withintegers in the same way as in Remark 4.6.

8. DECISION QUERY

Given an r-dimensional matrix B, of shape m = m = ??? = m , we1 2 rwant to check whether BU occurs in A. According to Theorem 4.2, therequired check can be done by ‘‘searching’’ in at most two distinct trees in

ŽHH . Moreover, the choice of those two trees depends on r only see proofA.of Theorem 4.2 . We will explicitly discuss the case r s 3 since the other

cases in Theorem 4.2 are similar.

Remark 8.1. Recall from Section 1.2 that, for ease of exposition, we arepresenting our results for the case d s 4. So r F d s 4. By Remark 4.9,Theorem 4.2 generalizes to the case 4 - r F d. The query procedurepresented here works also in that more general case.

Ž .Assume that r s 3 and, again d s 4 . We need to check whetherŽ . Ž . Ž .condition i or ii in Lemma 4.1 holds. We show only how to check ii ,

Ž .since the method is similar for i . In this section, let h y g denote theheterogeneous string h y g in Lemma 4.1. We need to check whether2Ž . w xm , m g MM and h y g has extended locus in CT m , m . The first1 2 1 2

Ž . Ž .condition can be easily checked in O r F O log M time by one proces-sor. We omit the details. In Section 8.1, we show how to find the extended

w x Ž . Žlocus of h y g in CT m , m , if it exists, in O log M time with O Mr1 2. Ž .log M processors. Since similar bounds hold also to check conditions i ,

we get:

THEOREM 8.1. It can be checked on-line whether a pattern matrix BŽ U . Ž .occurs as an r-dimensional submatrix of A i.e., B occurs in A in O log M

time with Mrlog M processors.

w x8.1. Searching in CT m , m1 2

w xWe now outline how to find the extended locus of h y g in CT m , m ,1 2Žif it exists. We use the method outlined in Section 2.3 which we have

Page 33: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 255

described in terms of a ‘‘parametric’’ string g and ‘‘parametric’’ compacted.trie . Partition h y g in a way analogous to the partition of g in Section

�2.3. That gives a set of heterogeneous strings PP s h y g , h y g , . . . , h0 14 ? @y g , with e F log m . Since h y g is a heterogeneous string over3 3

w xS m , m , all the heterogeneous strings in PP are also over that alphabet.1 2r j e r j w xMoreover, h y g has length 2 , where S 2 s m . Let Y m , m bej js0 3 1 2

w xthe set Y of Section 2.3 and let CT m , m be the trie CT also from the1 2section. In order to apply ExLocus to this set and compacted trie, we needthe following:

Ž .a The refinement trees obtained from the construction ofw xCT m , m . They are available after the preprocessing step.1 2

Ž .b The names of the heterogeneous strings in PP. They must beconsistent with the names assigned to the prefixes of heterogeneous strings

w xin Y m , m in the following sense. h y g must have the same name as a1 2 jw xprefix of a heterogeneous string in Y m , m of the same length if and1 2

only if they are equivalent with respect to ' . That is, Property 6.1 mustHbe extended to hold for the heterogeneous strings in PP. In Section 8.2, we

Ž . Ž .will outline how that is done in O log M time with O Mrlog M proces-sors.

Ž .c We also need an optimal work procedure to check for theequivalence of two heterogeneous strings. Indeed, as briefly outlined inSection 2.3, ExLocus finds a candidate prefix h y h from the strings in

w x w xY m , m and its extended locus u in CT m , m such that h y h ' h1 2 1 2 Hy g if and only if h y g has u as extended locus. In order to perform thisequivalence test, we observe that h y h encodes a d s 4-dimensionalsubmatrix AX of A in terms of its source. For the sake of discussion,assume that AX is of shape lX = lX = lX = m and that it originates in1 2 3 3

X Žposition x of A its fourth dimension must be m because h y h is a3. Xstring of that length . Omitting details, A can be identified through the

Ž .integers labeling the refinement trees by Remark 7.1, they are availableand in the same work bound in which Exlocus identifies h y h. AX is not

Ž .reported explicitly, rather it is identified by giving the integers encodingX Ž X X X . Xl s l , l , l , x and m . Now, let C be the embedding of B that has the1 2 3 3

same shape as AX. Then, the check h y g ' h y h consists of checkingHX X X Ž X X X . Xwhether C s A , where the matrix A is given through l s l , l , l , x1 2 3

Ž .and m . In Subsection 8.3, we show how that can be done O log M time3Ž .with O Mrlog M processors.

We can now apply Theorem 2.2 to get that we can find the extendedw x Ž . Ž .locus of h y g in CT m , m in O log M time with O Mrlog M1 2

processors.

Page 34: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI256

8.2. Consistent Naming

We now outline how to assign a consistent name to h y g . Thej

computation proceeds in exactly the same way as the computation ofnames discussed in Section 6. However, the entries of the Bulletin Boardsare read, but not modified. Recall that the Bulletin Boards are notinitialized, so one can read an entry that has not been set during prepro-cessing, i.e., ‘‘garbage.’’ In that case, we can be sure that h y g is not aj

w xsubstring of any string in Y m , m and therefore h y g does not have an1 2w xextended locus in CT m , m . If no ‘‘garbage’’ entry is read, then one can1 2

Žbe sure that the name of h y g is consistent the computation is the samej.as if the name of h y g were computed during the preprocessing step .j

We now discuss how to allocate processors. Notice that Remark 6.2applies to h y g . Therefore, the computation of its name can be consid-jered as a bottom up computation over a balanced binary tree ofŽ r j .O 2 m , m nodes. Consider the forest obtained by all the trees corre-1 2

sponding to the heterogeneous strings in PP. The total number of nodes isŽ . Ž . Ž .O M s O m m m and the height of the tallest tree is O log M .1 2 3

Therefore, the computation of the names for the heterogeneous strings inŽ w x.PP is a standard parallel tree computation on that forest e.g., see 22 and

Ž .it can be performed in O log M time with Mrlog M processors.

8.3. Checking AX s C

Recall that the submatrix AX is given by lX, xX, and m . The check3X w x w XA s C boils down to checking whether C y , y , y , y s A x q y y1 2 3 4 1 1

X X X x Ž . X1, x q y y 1, x q y y 1, x q y y 1 , for all y s y , y , y , y g I s2 2 3 3 4 4 1 2 3 4w X x w X x w X x w x1:l = 1:l = 1:l = 1:m . To simplify the discussion, let us assume1 2 3 3

X Ž X X X w x.that l s 1 while l s m and l s m because l g EE m , m . As stated3 1 1 2 2 1 2Žin Section 1, A is stored in column major order as established by function

. X Xf . Therefore, to compare A with C, we must pick the entires of A fromthe table storing A and compare them with the corresponding entries ofC. To perform that task in the required work bounds, there are some‘‘difficulties’’ as we now explain.

w X X X X xA x q y y 1, x q y y 1, x q y y 1, x q y y 1 is stored in entry1 1 2 2 3 3 4 4ˆ X X X X XŽ . Ž .f y s f x q y y 1, x q y y 1, x q y y 1, x q y y 1 , for all y in I .1 1 2 2 3 3 4 4

ˆ X XŽ . Ž . Ž . Ž .Notice that f y s f x q f y . f x can be accessed in constant workX Ž .because we know the integer encoding x . But f y cannot be accessed in

constant work because we do not know the integer encoding y and weŽ . Ž .cannot compute it, unless we want to spend O dM work instead of O M .

w x w xSimilarly, C y , y , y , y s B y , y , y and B, in turn, is stored in1 2 3 4 1 2 4Ž . Ž .Žcolumn major order according to the map function g y s y y 1 m =4 1

. Ž . Ž . Xm q y y 1 m q y y 1 . So the difficult part in comparing A with2 2 1 1Ž . Ž .C is to pick entry g y in the table storing B and access f y in constant

Page 35: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 257

Ž .work based only on the value of g y and, as we will see, the integerX Ž . Ž .encoding l which is available . Once we have f y we can compute in

Ž . w xconstant work f y and compare B y , y , y with the corresponding entry1 2 4of AX. We show how that can be achieved by precomputing some informa-tion during preprocessing.

Ž .The precomputed information is as follows. Fix l s l , l , l g I. For1 2 3X Ž X. Ž .Ž .each position y of A compute g y , which is y y 1 l = l = l ql 4 1 2 3

Ž .Ž . Ž . Ž .y y 1 l = l q y y 1 l q y y 1 , except that we do not compute3 1 2 2 1 1the parts corresponding to l’s equal to one. For instance, when l s 1 and3

Ž X . Ž .Ž . Ž .both l and l are not equal to one, g y s y y 1 l = l q y y 1 l2 1 l 4 1 2 2 1Ž . Ž X . Ž X.q y y 1 . Store f y in a table indexed by l and g y . This computation1 l

Ž . Ž . < <can be done in O d s O log N time with N processors. Since I s Nrn ,4the total work is still bounded as in Theorem 7.1.

X Ž .Now, for the check A s C, consider again the computation of f y sŽ X X X X . Xf x q y y 1, x q y y 1, x q y y 1, x q y y 1 , for all y in I . Since1 1 2 2 3 3 4 4Ž . Ž . Ž .Xg y s g y , f y can be accessed, as desired, in constant work by knowingl

Ž . X Ž .only the integer encoding l and the value of g y . Therefore, theˆ XŽ . Ž .computation of f y , for all y g I , can obviously be done in O log M time

Ž .with O Mrlog M processors. In the same time and work bounds, we canalso check whether AX s C.

9. ENUMERATION QUERY

When we need to report all the enum submatrices of A that are equalto matrices in BU , we need some additional preprocessing of the trees

w x w xCT m g HH , for all m g MM. For each CT m , we list its leaves and theirA

labels, from left to right, and store them in that order into a LEAVESŽ .array each tree has its own array . Moreover, for each internal node in

w x Ž .CT m , we set pointers to the LEAVES array to the leftmost andrightmost leaves in the subtree rooted at that node. For each tree, thiscomputation can be done by resorting to standard parallel techniques,such as Euler Tour Techniques and parallel evaluation of arithmetic

Ž w x.expressions see for instance 10, 22, 35 . The total time and processorsŽ .needed for all trees in HH are still bounded as in Theorem 7.1.A

As explained in Section 8, once we have checked whether BU occurs inŽ . X Ž .A, we get at most two nodes u and u in at most two distinct trees of HHA

such that all submatrices of A that have B as source are labels of theleaves in the subtrees rooted at u and uX, respectively. Then we use theLEAVES arrays of those two trees to report all the required submatrices inŽ .O enum work with p processors, 1 F p F enum. We have:

Page 36: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI258

THEOREM 9.1. Assume that the forest HH is a¨ailable. We can builtAauxiliary data structures within the time and processor bounds of Theorem 7.1.Once those data structures are a¨ailable and we ha¨e answered the decisionquery, we can report the enum submatrices of A that ha¨e B as source inŽ .O enum work with p processors, 1 F p F enum.

10. OCCURRENCE QUERY

We are interested in reporting the positions x of A such that at leastone of the matrices in BU is equal to a submatrix of A originating in x.We are not interested in which particular embedding occurs in thatposition.

w xConsider a tree CT m g HH . Recall from Section 4.2 that its leaves areAlabeled by means of ‘‘shapes’’ l and positions x. Since we are interestedonly in the positions x, we ignore the ‘‘shape’’ l from the labels of theleaves. Moreover, for ease of notation, we refer to the positions x as

w x w xintegers in 1, N . Now, we may have many leaves of CT m labeled withthe same integer and, in order to answer the occurrence query we need to‘‘eliminate’’ duplicate copies of the same integer. In this section we showthe following.

THEOREM 10.1. Assume that the forest HH is a¨ailable. We can buildAŽ . Ž 2 .auxiliary data structures in O log N time with O N rn processors so that,4

w xgi en a tree CT m g HH and a node u in that tree, we can report the g distinctAŽ .positions of A at the lea¨es of the subtree rooted at u in O g work with p

processors, 1 F p F g. The space required by the additional data structures isŽ 2 .O N log Nrn .4

Once we have Theorem 10.1, it is easy to report the occ distinctpositions of A where at least one embedding of B occurs. Indeed, let uand uX be the nodes of the previous section. We apply Theorem 10.1 to uto get a list of distinct positions of A that are at the leaves of the subtreerooted at u. We do the same for uX. So we obtain two lists of positions inwhich each position is repeated at most twice. We first eliminate dupli-cated from the two lists and then extract only the remaining valid entriesfrom the two lists and copy them into an output array. That can be done inŽ .O log occrlog log occ time and optimal work by techniques analogous to

w xthe computation of prefix sum 11 . Notice that if we allow the positions ofA to be reported in output to be ‘‘intermixed’’ with ‘‘garbage,’’ we can

Ž .eliminate duplicates from the two lists in O occ work with p processors,1 F p F occ, and give the resulting lists as output.

Page 37: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 259

THEOREM 10.2. Assume that the trees of HH ha¨e been processed as inATheorem 10.1. Once we ha¨e checked whether BU occurs in A, we can report

Žthe occ positions of A in which an embedding of B occurs in O log occrlog.log occ time and optimal work.

Remark 10.1. We point out that the approach outlined here to answerthe occurrence query using HH cannot work for RR . Indeed, each tree ofA ARR already has distinct positions of A at its leaves. In the case of R , theA Aslow down is due to the fact that we need to search in all of its trees.

10.1. Distinct Elements under a Tree

w xLet us consider a tree CT m g HH and recall from Section 4.2 that itAŽ w x. w xhas O N EE m nodes. We transform CT m into a binary tree T by

‘‘expanding’’ the q children of any given node into a complete andŽ w x.balanced binary tree of q leaves, so that T has m s O N EE m nodes.

Ž . ŽThis transformation can be done in O log m time and optimal work seew x .22 and references therein . It is worth noting that, given a node u in

w xCT m , we can identify node u corresponding to it in T in constant work.ˆMoreover, all leaves in the subtree rooted at u are leaves in the subtree

w xrooted at u. So we can work with tree T instead of CT m .ˆWe want to preprocess tree T so that we can answer the following

query: given a node ¨ g T , we want to report the list of all distinct integersassociated with the leaves of the subtree rooted at ¨ . We point out thatthis problem has some analogy with the Subtree Max Gap Problem intro-

w xduced in 7 .We need some notation and definitions. Given two lists of integers L1

and L , let L y L be the difference list consisting of the integers that are2 1 2Ž .in L but not in L . For each node ¨ in T , let s ¨ be the number of1 2

Ž . Ž . Ž .nodes in its subtree including ¨ , and let LT ¨ and RT ¨ be the leftŽ Ž ..and right children of ¨ . We say that edge ¨ , LT ¨ is light and edge

Ž Ž .. Ž Ž .. Ž Ž .. Ž Ž .. Ž Ž ..¨ , RT ¨ is hea¨y when s LT ¨ F s RT ¨ . If s LT ¨ ) s RT ¨ ,Ž Ž .. Ž Ž ..then ¨ , LT ¨ is hea¨y and ¨ , RT ¨ is light. Let a solid path consist

w xonly of heavy edges in T and recall from 34 that solid paths partition T.Moreover, there is exactly one solid path from any node ¨ leading to some

Ž .leaf in T see Fig. 9 . For each node ¨ , there are at most log m light edgesŽ .on the path from ¨ to root T .

Ž .We let il ¨ denote the list of the distinct integers associated with theleaves of the subtree rooted at ¨ . For a leaf, its list consists of the integer

Ž .associated with the leaf. Rather than storing il ¨ explicitly, we ‘‘distribute’’it over the solid path from ¨ to the leaf at the end of that path. When ¨ is

Ž Ž .. Ž Ž ..an internal node, we do the following: If s LT ¨ F s RT ¨ , we storew Ž Ž .. Ž Ž ..xthe difference list il LT ¨ y il RT ¨ explicitly at ¨ . Similarly, if

Page 38: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI260

Ž . Ž .FIG. 9. A tree partitioned into heavy solid lines and light edges dotted lines . Thedifference lists of all internal nodes are shown in curly brackets. The difference lists of theleaves are the integers stored at the leaves.

Ž Ž .. Ž Ž .. w Ž Ž .. Ž Ž .xs LT ¨ ) s RT ¨ , we store the difference list il RT ¨ y il LT ¨explicitly at ¨ . When ¨ is a leaf, its different list consists of the integer

Ž .associated with that leaf and we store this list explicitly see Fig. 9 .

Ž .Property 10.1. List il ¨ is obtained by the union of all the differencelists on the solid path from ¨ leading to a leaf in T. Since there areŽ .O log m light edges in each leaf-to-root path, all the difference lists take a

Ž .total of O m log m space.

Ž . w Ž Ž .. Ž Ž ..xProof. Using the fact that either il ¨ s il LT ¨ y il RT ¨ jŽ Ž .. Ž . w Ž Ž .. Ž Ž .x Ž Ž ..il RT ¨ or il ¨ s il RT ¨ y il LT ¨ j il LT ¨ , the first part is a

Ž .simple induction on the length of the solid path starting at the leaf . TheŽ .total space is O m log m because each integer in a leaf can belong to no

Ž .more than O log M lists.

ŽThe computation of difference lists consists of several tasks some ofthem are quite standard and we omit the details about their implementa-

.tion .

v We compute the Euler tour of T and number its nodes accordingly.Ž . Ž . w xThis can be done in O log m time and O m work 35 . Based on that

Ž . Ž .tour, we assign a number E f to each leaf f , where E f is the first timeŽ . Ž .f is visited during the tour. We assign two numbers FE ¨ and LE ¨ to

any internal node ¨ , such that they correspond to the first and last time ¨is visited during the tour. We also process the tree T so that we can answer

Ž .Lowest Common Ancestor queries LCA for short . This can be done inŽ . Ž . w xO log m time and O m work 33 .

Page 39: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 261

v We compute the partition of T into solid paths, such that eachlinked list forming a solid path is ranked. Moreover, for each light edge e,

Ž .we compute its distance d e from the root, i.e., the number of light edgesŽ . Ž .including e on the path from e to root T . We also link e to the closest

Ž .such edge towards the root . Once the Euler tour is known, we can useŽ .algorithms similar to prefix sum to perform those tasks in O log m time

Ž .and O m log m work.v Ž .For each integer i stored in multiple copies at the leaves of T , we

Ž .compute an array LEAF i that contains the leaves storing i, as theyappear in the postorder visit of T. These arrays can be obtained by bucket

Ž Ž ..sorting pairs i, E f , where f is a leaf. The time and work required forŽ . Ž . Žthis computation are still bounded by O log m and O m log m see for

w x .instance 22 and references therein , respectively. We compute a booleanŽ . Ž .table S in constant time and O m work, such that S w, i s 1 if and only

Ž .if node w is the LCA of any two adjacent nodes in LEAF i . The space intable S is not initialized and we access S by using the same rules of the

Ž .Bulletin Boards see Section 6 .v We obtain the difference lists by processing light edges. All the light

Ž .edges e g T are processed in decreasing order of their distance d e fromthe root. Edges having the same distance value are processed in parallel in

Ž . Žthe same stage. Consequently, we have a total of O log m stages becauseŽ . .there are no more than O log m light edges on any path of T . Initially, a

processor is assigned to each leaf of T. We maintain the followingŽ . Ž .invariant for processing an edge e s w, ¨ , where w s parent ¨ : All the

Ž .light edges of greater distance value than d e have already been pro-cessed and the processors in charge of the leaves in the subtree rooted at

Ž¨ know that it is time to process e there is no conflict for processorsamong light edges, since all light edges in the subtree rooted at ¨ have

.already been processed by the invariant . Once the processing of e is done,all processors in its subtree migrate to the light edge eX that follows e on

Ž X.the path to the root e has a link to e .v ŽWe now concentrate on the processing of e i.e., stage k, whereŽ ..k s d e . All processors in charge of the leaves in the subtree rooted at ¨

do the following. Processors that hold the same integer i elect a leaderŽthis can be done in constant time by means of a Bulletin Board of N

.entries . Assume that the winner is the processor in charge of leaf f. ItŽ .checks S w, i . If it is one, w is LCA of two leaves, one in the subtree

rooted at ¨ and the other in the subtree rooted at the sibling of ¨ . So i isnot in the difference list of w. The processor remains idle for the rest ofthe stage. Else, i is in the difference list of w. The processor writes i in

Ž . Žposition E f of a temporary array TEMP having m entries and initial-k.ized to empty at the beginning of the stage . At the end of stage k, the c

Page 40: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI262

integers that must be in the difference list of w are stored inw Ž . Ž .x Ž Ž . Ž . Ž ..TEMP FE ¨ , LE ¨ because FE ¨ F E f F LE ¨ . Once again,k

there is no conflict with other light edges that are currently being pro-cessed because they have the same distance.

v Ž .After the last stage i.e., k s 1 , we have the difference lists of allŽ Ž .nodes stored in TEMP arrays there are a total of O log m arrays of m

. Ž .entries each . Consider TEMP . For each node w such that w, ¨ has beenk

processed during stage k, its difference list is in noncontiguous positions ofw Ž . Ž .xTEMP FE ¨ , LE ¨ . We extract the difference lists of all such nodes byk

X Žresorting to compaction. Assume that TEMP has c nonempty entries wekX .do not need to know c a priori . We extract those nonempty entries and

Ž .store them as they appear from left to right in TEMP into a new arraykX w xG of c cells. Moreover, for each position TEMP h we also compute thek k

w xnumber of nonempty entries in TEMP 1, h . Let those numbers be g andk 1w Ž . x w Ž .xg for TEMP FE ¨ y 1 and TEMP LE ¨ . Then, the difference list of2 k k

w x Ž .w is G g q 1, g if g s g , the difference list is empty . It is folklorek 1 2 1 2that this kind of compaction problem can be solved by means of a prefix

w x Ž .sum 12 . Therefore, for a single k, the time is bounded by O log m andŽ .the work by O m . Since we process all TEMP arrays in parallel, we have a

Ž . Ž .total of O log m time and O m log m work. We point out that we couldw xuse more sophisticated techniques, such as the algorithm in 11 for prefix

w x Ž wsum or the algorithm in 20 for linear approximate compaction see 19, 20,x .28 and references therein for definition and applications . However, we

would not get a better work bound for the element distinctness problemconsidered here.

v In order to simplify the query procedure, the difference lists arefurther assembled together. Consider a solid path p. Visit this path for its

Ž . Ž .first node closest to the root to the last node a leaf f and ‘‘stack up’’the difference lists that are met on the way. The result is an array Dp

with the following property. Assume that we know, for a node wX in p, thatw x Ž .its difference list starts at D j . Then, by Property 10.1, il w is given byp

w x Žall entries from D j until the end of D . D is the union of subarraysp p p

.previously obtained by compaction of TEMP arrays . For each solid path,we compute its D array and store it at the leaf at the end of the path, aswe now briefly sketch. We compute, for each wX in p, the start point ofŽ X. X Ž X.il w in D . If w has an empty difference list, the start point of il w isp

the same as the start point for the il list of the closest ‘‘downward’’ node inŽ .p. Since we know the rank of each node in the linked list of each solid

path and since we also know the size of the difference list of each node w,we can compute the start point of the difference lists of all nodes in the

Ž . Ž .various D arrays in O log m time and O m log m work. Once we know

Page 41: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 263

those start points, we can copy the relevant portions of the G arrays in theŽ . Ž .appropriate entries of the D arrays in O log m time and O m log m

work. The details are omitted.

Ž . Ž .LEMMA 10.1. We can preprocess T in O log m time and O m proces-sors, such that each query reporting the g distinct integers in the subtree rooted

Ž .at a node can be answered in O g work with p processors, 1 F p F g. TheŽ .space required to store the difference lists is O m log m .

Proof. The fact that the query can be answered in the stated time andprocessor bounds is obvious. As for the preprocessing, notice that all tasks

Ž . Ž .can be done in O log m time and O m processors. The space boundfollows immediately from Property 10.1.

Theorem 10.1 easily follows from Lemma 10.1 by processing all trees ofHH in parallel. Adding the time, processors and space for auxiliary dataAstructures needed over all trees, we get the bounds of Theorem 10.1.

11. CONCLUDING REMARKS AND OPEN PROBLEMS

We have introduced a new multidimensional pattern matching problemthat allows us to account for the degrees of freedom in matching anr-dimensional pattern against a d-dimensional text, r F d. We have de-vised new data structures and algorithms for the on-line version of thisproblem achieving optimal work for the queries considered. It would beinteresting to obtain better algorithms for the preprocessing step of theon-line case, as well as fast algorithms for the off-line case.

ACKNOWLEDGMENTS

The authors thank D. Breslauer, S. Das, Y. Matias, and K. Park for helpful discussionsrelated to this paper and the referees for very stimulating comments.

REFERENCES

1. A. V. Aho, Algorithms for finding pattern in strings, in ‘‘Handbook of TheoreticalŽ .Computer Science’’ J. van Leeuwen, Ed. , pp. 257]295, Elsevier, Amsterdam, 1990.

2. A. Amir and G. Benson, Two-dimensional periodicity and its applications, in ‘‘Proceed-ings of ACM]SIAM Symposium on Discrete Algorithms, 1992,’’ pp. 440]452.

3. A. Amir, G. Benson, and M. Farach, An alphabet independent approach to two-dimen-Ž .sional pattern matching, SIAM J. Comput. 23 1994 , 313]323.

4. A. Apostolico, The myriad virtues of subword trees, in ‘‘Combinatorial Algorithms ofŽ .Words’’ A. Apostolico and Z. Galil, Ed. , NATO ASI Series F, Vol. 12, pp. 85]96,

Springer-Verlag, Berlin, 1984.

Page 42: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

GIANCARLO AND GROSSI264

5. A. Apostolico, C. Iliopoulos, G. Landau, B. Schieber, and U. Vishkin, Parallel construc-Ž .tion of a suffix tree with applications, Algorithmica 3 1988 , 347]365.

6. A. V. Aho, J. E. Hopcroft and J. D. Ullman, ‘‘The Design and Analysis of ComputerAlgorithms,’’ Addison]Wesley, Reading, MA, 1974.

7. O. Berkman, C. S. Iliopoulos and K. Park. The subtree max gap problem with applicationŽ .to parallel string covering, Inform. and Comput. 123 1995 , 127]137.

8. D. Breslauer, Dictionary-matching on unbounded alphabets: Uniform length dictionaries,Ž .J. Algorithms, 18 1995 , 278]295.

9. R. Cole, M. Crochemore, Z. Galil, L. Gasieniec, R. Hariharan, S. Muthukrishnan, K.Park, and W. Rytter, Optimally fast parallel algorithms for preprocessing and patternmatching in one and two dimensions, in ‘‘Proceedings, 34th Symposium on Foundationsof Computer Science,’’ pp. 248]258, IEEE, New York, 1993.

10. R. Cole and U. Vishkin, The accelerated centroid decomposition technique for optimalŽ .parallel tree evaluation in logarithmic time, Algorithmica 3 1988 , 329]346.

11. R. Cole and U. Vishkin, Faster optimal parallel prefix sums and list ranking, Inform. andŽ .Comput. 81 1989 , 334]352.

12. M. J. Fischer and L. Ladner, Parallel prefix computation, Assoc. Comput. Mach. 27Ž .1980 , 831]838.

13. M. J. Fischer and M. S. Paterson, String matching and other products, in ‘‘Complexity ofŽ .Computation’’ R. M. Karp, Ed. , pp. 113]125, SIAMrAmerican Mathematical Society,

Providence, RI, 1974.14. Z. Galil and K. Park, A truly alphabet independent two-dimensional pattern matching

algorithm, in ‘‘Proceedings, 33th Symposium on Foundations of Computer Science,’’ pp.247]256, IEEE, New York, 1992.

15. R. Giancarlo, An index data structure for matrices, with applications to fast two-dimen-sional pattern matching, in ‘‘Proceedings, Workshop on Algorithms and Data Structures,’’Lecture Notes in Computer Science, Vol. 709, pp. 337]348. Springer-Verlag, BerlinrNewYork, 1993.

16. R. Giancarlo and R. Grossi, Parallel construction and query of suffix trees for two-dimen-sional matrices, in ‘‘Proceedings of the 5th ACM Symposium on Parallel Algorithms andArchitectures, 1993,’’ pp. 86]97.

17. R. Giancarlo and F. Mignosi, Generalizations of the periodicity theorem of Fine andWilf, in ‘‘Proceedings of Colloquium on Algebraic Aspects of Programming,’’ LectureNotes in Computer Science, Vol. 787, pp. 130]141, Springer-Verlag, Berlin, 1994.

Ž .18. T. Hagerup, On saving space in parallel computation, Inform. Process. Lett. 29 1988 ,327]329.

19. T. Hagerup, The log-star revolution, in ‘‘Proceedings 9th Symposium on TheoreticalAspects of Computer Science,’’ Lecture Notes in Computer Science, Vol. 577, pp.259]280, Springer-Verlag, BerlinrNew York, 1992.

Ž .20. T. Hagerup, Fast deterministic processor allocation, J. Algorithms 18 1995 , 629]649.21. R. Jain, ‘‘Workshop Report on Visual Information Systems,’’ Technical report, National

Science Foundation, 1992.22. J. JaJa, ‘‘An Introduction to Parallel Algorithms,’’ Addison]Wesley, Reading, MA, 1992.´ ´23. R. Karp, R. Miller, and A. Rosenberg, Rapid identification of repeated patterns in

strings, arrays and trees, in ‘‘Proceedings 4th Symposium on Theory of Computing,’’ pp.125]136, Assoc. Comput. Mach., New York, 1972.

24. M. Karpinski and W. Rytter, An alphabet independent optimal parallel search for threedimensional patterns, in ‘‘Proceedings, 5th Symposium on Combinatorial Pattern Match-ing,’’ Lecture Notes in Computer Science, Vol. 807, pp. 90]103, Springer-Verlag,BerlinrNew York, 1994.

25. D. E. Knuth, ‘‘The Art of Computer Programming,’’ Vol. 3, Sorting and Searching,Addison]Wesley, Reading, MA, 1973.

Page 43: Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms

PATTERN MATCHING WITH DIMENSIONAL WILDCARDS 265

26. D. E. Knuth, J. H. Morris, and V. B. Pratt, Fast pattern matching in strings, SIAM JŽ .Comput. 6 1977 , 189]195.

27. R. C. Lyndon and M. F. Schutzenberger, The equation am s bnc p in a free group,Ž .Michigan Math. J. 9 1962 , 289]298.

28. Y. Matias and U. Vishkin, Converting high probability into nearly constant time}Withapplications to parallel hashing, in ‘‘Proceedings, 23rd Symposium on Theory of Comput-ing,’’ pp. 307]316, Assoc. Comput. Mach., New York, 1991.

29. E. M. McCreight, A space economical suffice tree construction algorithm, Assoc. Com-Ž .put. Mach. 23 1976 , 262]272.

30. E. M. Muthukrishnan and K. Palem, Highly efficient dictionary matching in parallel, in‘‘Proceedings of the 5th ACM Symposium on Parallel Algorithms and Architectures,1993,’’ pp. 69]78.

31. M. Regnier and L. Rostami, A unifying look at d-dimensional periodicity and spacecoverings, in ‘‘Proceedings, 4th Symposium on Combinatorial Pattern Matching,’’ LectureNotes in Computer Science, Vol. 684, pp. 215]227, Springer-Verlag, BerlinrNew York,1993.

32. A. Rosenfeld and A. C. Kak, ‘‘Digital Picture Processing,’’ Academic Press, San Diego,1982.

33. B. Schieber and U. Vishkin, On finding lowest common ancestors: Simplification andŽ .parallelization, SIAM J. Comput. 17 1988 , 1253]1262.

34. D. D. Sleator and R. E. Tarjan, A data structure for dynamic trees, J. Comput. SystemŽ .Sci. 26 1983 , 362]391.

35. R. E. Tarjan and U. Vishkin, Finding biconnected components and computing treeŽ .functions in logarithmic parallel time, SIAM J. Comput. 14 1985 , 862]874.

36. U. Vishkin, Deterministic sampling}A new technique for fast pattern matching, SIAM J.Ž .Comput. 20 1991 , 303]314.

37. P. Weiner, Linear pattern matching algorithms, in ‘‘Proceedings, 14th Symposium onSwitching and Automata Theory,’’ pp. 1]11, IEEE, New York, 1973.