Faster Tree Pattern Matching

  • Upload
    -

  • View
    242

  • Download
    0

Embed Size (px)

Citation preview

  • 7/27/2019 Faster Tree Pattern Matching

    1/9

    Faster Tree Pattern Matching

    MOSHE DUBINERT&AL)k unlLWsi~, ~d-ZfL)iL>, I.SII.W[

    ZVI GALILTel-Ati( UruLcrsi~, Tel-Al iL, Israel, and CLIllu}dm[ Unulers[fy, New York New YorkANDEDITH MAGENTd-i4L)iL utl iLWYi~, Td-i4L3LL, Israel

    Abstract. Recently, R. Kosaraju gave an O(nm(7 polylog(m )) step algorithm for tree patternmatching. We Improve this result by designing a simple 0( nfipolylog(m )) algorithm,Categories and Suhjcct Descriptors: F.2.2! [Analysis of Algorithms and Problem Complexity]:Nonnumerical Algorithms and Problemsco~?zp~t[ario~~s}1dmretc .sfructum.General Terms: Algorithms, Theory.Additional Key Words and Phrases: convolution, dont-care, pattern matching, period. stringmatching, tree.

    1. IntroductionFor brevity, we shall ignore the log(m) factor in the time complexity, using thenotation O(~(rL, m )) = 0(~( n, n2)polylog( m )).We consider ordered, labeled trees, except that roots are unlabeled. This is

    equivalent to edge labeling. (Having unlabeled roots is for technical conve-nience only.) A pattern tree P matches a target tree T at node L if there existsa one-to-one map from the nodes of P into the nodes of T such that(1) the root of P maps to LI,(2) if x maps to y and x is not the root of P, then x and Y have the samelabels, and(3) if x maps to y and x is not a leaf, then the ith child of x maps to the ithchild of y. (In particular the degree of y is no less than the degree of x.)Z. Gali l was partially supported by National Science Foundation grant CCR 90-14625 and CISEInstitutional Infrastructure grant COA 90-24735.Authors addresses: M. Dubiner, Department of Applied Mathematics. Tel-Aviv University,Tel-Aviv, Israel; Z. Gali l, Department of Computer Science, 450 Computer Science Building,Columbia University, New York, NY 10027; E. Magen, Department of Computer Science,Tel-Aviv University, Tel-Aviv, Israel.Permission to copywithout fee all or part of this material is granted prowded that the copies arenot made or distributed for direct commercial advantage, the ACM copyright notice and the titleof the publication and its date appear, and notice is given that copying is by permission of theAssociation for Computing Machinery, To copy otberwk, or to republish, requires a fcc and/ol-specific permission.01994 ACM 0004-5411/94/0300-0205 $03.50

    Journal of the Aswciatmn for Computmg Machlntxy. Vol. 41. No. 2, March lY94, pp205213

  • 7/27/2019 Faster Tree Pattern Matching

    2/9

    ~(jfj M. DUB[NER EI AL.Given trees P and T of sizes m and lZ respectively (m < H), we want to

    compute the set of nodes of T at which P matches. We assume that the orderinformation of the children of a node is absorbed into the childrens nodelabels. Consequently, the labels of the children of any node are distinct. Forconvenience, we assume that the alphabet is {O, 1}. If this is not the case, weencode every symbol with 0s and 1s. Consequently, ~ur trees will be larger by afactor of at most log(m), which is absorbed in the O notation.Tree pattern matching is an important problem with many applications (see

    Hoffman and ODonnell [1982]). The obvious algorithm takes O(m-n) time. Aclassic open problem was whether this bound could be improved. Recently,Kosaraju [1989] broke the 0( nm ) barrier for this problem with an O(n/~~[) 75)algorithm. We improve this result giving an O(n&) algorithm. Kosarajuintroduced three new techniques:(1) a suffix tree of a tree;(2) convolution of a tree and a string:(3) partitioning of trees into chains and anti-chains.Even though our algorithm was inspired by Kosarajus algorithm, we do not

    use any of his techniques. A different version of our algorithm can use thesuffix tree of trees. Instead, we use truncated suffix trees that can be con-structed in an obvious way. The improvement is obtained by discovering andexploiting periodical strings appearing in P.We make use of very simple properties of periods of strings. We denote by

    Ia I the length of the string a. A string CYis a period of a string ~ if ~ is aprefix of a ~ for some k > 0. The next two facts are well known and the thirdfollows from the second:Fact 1.1. a is a period ~ iff ~ = ay = yt$.Fact 1.2,. ~ has a period of length k iff ~t = ~,+, for 1 s i < IPl - k.Fact 1.3. If ap and ~-y have a period of length k and I ~ I > k, then a~y

    has a period of length k.Throughout the paper, there are details of the algorithm that could be

    performed more efficiently than described. However, when the total time is notaffected, we prefer simplicity over efficiency.2. The Tree Pattern Matchirzg AlgorithmFor every node w of P, denote by q, the labeled path from the root of P to lv.Let w be a leaf of P. We say that w matches T at L if ~,,, considered as a tree,matches T at L. Using our assumption that the labels encode the orderinformation, we see that P matches T at L iff every leaf w of P matches T at L,.Therefore, we may partition the set of leaves of P into subsets (as we shalloften do), check where each subset matches, and in O(n) time conclude whereP matches.The following lemma observes that the time of the naive algorithm (in which

    we try to match P at every node of T in a straightforward way) is actually betterthan O(mn ) in many cases.

    LEMMA 2.1. If the height of P is h, then the naiie algorithm tukes time 0( nh).PROOF. A node 1 in T is compared with a node u in P at level r only if the

    path consisting of the r first ancestors of [ matches the path of the ancestors

  • 7/27/2019 Faster Tree Pattern Matching

    3/9

    Faster Tree Pattern Matching 207of u in P. Since different children have different labels, the path of the rancestors of L! cannot match a path to another node in P. Thus, every node ofT is compared with at most one node of P at every depth-level. uThe following is a well-known fact. The proof (following Knuth et al. [1977])

    is given for completeness.LEMMA 2.2. A string, s, of length m can be matched with a tree, T, of size n ill

    time 0( m + n).PROOF. Define a function, f, as follows: For every 1 s i < m, let

    f(i) = the largest j for which s, s, is a proper suffix of s, . .. S,.Given ~, matching s and T in linear time is very simple: Traverse T by DFS,keeping at each node, 1, a pointer to its highest ancestor, w, such that the pathfrom w to 1 is a prefix of s. Consider a node, 1, and suppose the pointer of itsparent points to node w. Initially, set the pointer of z to w, and let i be thelength of the path from the pointer of [ to LI. If this path is a prefix of s (this ischecked just by comparing the label of L), continue with the DFS. Otherwise,reset the pointer of LI to its f(i l)th ancestor (by starting at w andfollowing a prefix of s of length i 1 f(i 1)) and compare again, until a(possibly empty) prefix of s is found. Since the pointers can only advance withthe DFS, the total time spent is linear.The function f itself (the failure function of Knuth et al. [1977]) is con-

    structed recursively in a similar way, only now the role of T is played by thestring Sz . s,,,. uRemark. One can prove in a similar way that a set of strings can be

    matched in linear time, provided there is no string in the set that is a suffix ofanother string in the set.For every node ZI in P, denote by q,, ~ the string of the last k symbols of q,,

    if they exist; u,, ~ = q, in case Iq, I < k. The k-truncated suffLx tree of the tree P,~{. h> iS defined tO be the trie of the set {~,!~ 1~ is a node of p}, where R standsfor reversal. This means that for any node u in P there is a corresponding nodeC in Z,,, ~, such that the path from the root of XP ~ to 0 is the reversal of thepath q,, ~. Different nodes of P may have equal corresponding nodes in 2P, ~.~P, k may have additional nodes, but each of its leaves correspond to a node,not necessarily a leaf, of P.Example. Figure 1 shows a tree P (of height 4) with its 4-truncated suffix

    tree and its 2-truncated suffix tree. The label of each node is marked on theedge leading to that node.LEMMA 2.3. For a tree P of size m, 2P, ~ can be computed in O(mk) steps.PROOF. Simply insert the strings cr,,~~ into the trie one at a time. uThe following fact follows immediately from the definition of 2P ~.Fact 2.4. ti is an ancestor of 0 in 2P, ~ iff OU,~ is a suffix of q,, k. Inparticular, if U and fi are leaves of X,,, ~, then q, and q, cannot be a suffix of

    one another.Set 1 = [&l.

  • 7/27/2019 Faster Tree Pattern Matching

    4/9

    hf. DUBINER ET AL.

    o

    f)o1012/ 301 I4 V5 6/o 07 8P FIGLIRE 1The first step in our algorithm computes the 31-truncated suffix tree of P,~ - x~>,ll, in O(ml) = O(WZVZ) steps.case 1. ~ has at least 1 leaves.Choose a set, S, of 1 nodes in P, which corresponds to leaves in X. For eachnode ( of S, mark the nodes of T at which the string q, matches in time O(n)

    using Lemma 2.2. By Fact 2.4, any node of T can be the end of at most onepath to a node of S. Therefore, at most n marks are made, and at most ( n./l)nodes of T can be marked 1 times (i.e., for all members of S) and beconsidered as possible roots of P (i.e., nodes at which P could match T). Ateach of the possible roots, we check in O(n Z)time whether P matches or not.The total time spent in this case is therefore O(ln + (~z/l)nz) = O(~z& ),Renmrk. By the remark of Lemma 2.2, we see that marking the possible

    roots can be done in linear time. However, so far we do not know how to usethis to improve the total time of the algorithm.cLzse 2. Y has at most 1 leaves.Let P,] be the subtrce of P composed of the paths to the leaves of P whose

    depths are at most 31. By Lemma 2.1, we can find all the nodes of T at whichthese high leaves match in time O(~zl) = 0( ~z~) using the naive algorithmfor P{] and T. So, without loss of generality, we may assume that the depth ofevery leaf of P is at least 31. In this case, if t is a leaf of P, then Iq ,,~11= 31and 0 is a leaf of Y.

    LEMMA 2.5. Gi[viz 1 + 1 izodes in P: L(l, . . . . [[, there are i + j sz[ciz tizat wL,,,3[is L7szfffi~ of u,,,,~l.

    PROOF. Since there are at most 1 leaves in X, there are i + j for which 0, isequal to or an ancestor of 0, in S. By Fact 2.4, m,,,,~{ is a suffix of 01,,,l. uhWhIA 2.6. For CL3Cg1 leaf L of P, the string of all but the at most 1 [ast ~zodcs

    of u,, has u period of length at most 1.PROOF. Consider any node in P, L![,, with depth at least 1, and let Lit, i s 1,be its ith ancestor. By Lemma 2.5, there are O s i < j

  • 7/27/2019 Faster Tree Pattern Matching

    5/9

    Faster Tree Patterrz Matchilzg 209such that UC ~1 labeisthe path from u to Z, and let t- bethelabel of the pathfrom u to L~~ t- is a prefix and a suffix of q, ~1 (since it is a suffix of q,,, 31). ByFact 1.1, au ~[ has a period of length O < j L i s 1.Now let ~~be a leaf of P. By consideration above, there is an O s i 1), gives us the period of L insome cyclic permutation. The exact period and tail can now be easily found intime 0(1) from s.In order to find the set of t~e at most 1 pairs, we sort lexicographically theO(m) pairs we found in time 0( n71). uIn the following, we show how to match a set of leaves with a given pair of

    tail-period in time d(~z ). Hence, we will obtain an ~(ln) = d(n&) algorithmfor matching P. Consider a fixed tail-period pair and denote the period by p.We call a path in a tree a nmximal-periodic-puth if

    (1) the path has p as a period (it may start at any place of p, but it ends at theend of a full period),(2) the period is repeated at least twice in the path, and(3) the path is maximal in the sense that it cannot be extended to a longer pathwith property (1).LEMMA 2.9. If two maximal-periodic-paths intersect at a node L!, then for oneof the paths L is among the first Ip I nodes.PROOF. Otherwise, at least the first Ipl ancestors of L are common to the

    two paths. Since both paths are periodic with the same period and maximal

  • 7/27/2019 Faster Tree Pattern Matching

    6/9

    210 M. DUBINER ET AL.(and recall that different children must have different labels), they must beequal. u

    COROLLARY 2.10. The total length of all maxinzal-pe~iodic-paths in a tree T ofsize n is O(n).

    PROOF. For every maximal-periodic-path, we call its suffix obtained bydeleting its prefix of length Ipl the main path. By Lemma 2.9, the main pathsin T are disjoint. Since every maximal-periodic-path has length at least 2 I p 1,thetotal length is bounded above by twice the total length of the main paths, whichis no more than 2n. u

    LEMMA 2.11. Finding all nuz.~imal-periodic-p athsin a ti-ee of size n can bedone in time 0(n).PROOF. In the same manner as finding the occurrences of a string in a tree

    (Lemma 2.2), we can find in T all sequences of as many (at least two)consecutive periods as possible in time O(n). For every such sequence, weaugment it as far as possible upwards. By Corollary 2.10, this takes 0(n) stepsas well. uUsing Lemma 2.11 and Lemma 2.2, we find in time O(n) all maximal-peri-

    odic-paths in T, as well as all the occurrences of the tail in T.For every maximal-periodic-path in T, we define the {O, 1}-sequence (starting

    with i = O)

    (= 1 if after i periods in the path the tail occurs,1 0 otherwise.The b-sequence can be computed in time to proportional the length of the

    maximal-periodic-path. Hence, by Corollary 2.10, all the b-sequences can becomputed in time 0(~~). Note that the length of the b-sequence is at most thelength of the maximal-periodic-path divided by Ipl plus 1.Now we find in time O(nl) the maximal-periodic-paths in P that start at the

    root of P, and from which tails lead on to leaves with the given tail and period.(Note the two distinctions from maximal-periodic-paths in T.) The occurrencesof the tail in P, that start at an end of a period and end at a leaf are found aswell. This is done simply by tracing in P all paths from leaves with the given tailand period up to the root or to a previously visited node in a maximal-perimlic-path. By kmmd 2.9, there al-e at most Ipl maximal-periodic-paths in Pthat start at the root: at most one for every possible starting place in theperiod.For every such maximal-periodic-path in P, we define the {O, 1}-sequence

    (1 if after i periods in the path the tail occurs and ends at a leaf,al = O otherwise.

    The (1-sequence can be computed in time proportional to the length of themaximal-periodic-path. Hence, by Corollary 2.10 applied to P, all thea-sequences can be computed in time O(nZ). Note that the length of the

  • 7/27/2019 Faster Tree Pattern Matching

    7/9

    a-sequence is at most the length of the maximal-periodic-path divided by Ip Iplus 1.We show how to match the leaves of a fixed maximal-periodic-path of P in

    time 0(/z /l p[). Thus, matching the up to Ip I sets of leaves (corresponjin,g tothe up to Ip I maximal-periodic-paths in P) will be done in time 0(n) aspromised. So, let us fix one such maximal-periodic-path in P with the corre-sponding u-sequence.For our set of leaves to match at a node L of T, there should be amaximal-periodic-path in T that passes through Z such that

    (1) in the path of T L is at the same place in the period as the root of P is inthe maximal-periodic-path of P, and(2) whenever a, is 1, the corresponding b, of the path through c should be 1too.This means that our problem amounts to solving a problem of string matchingwith dont-care for the a-sequence and any one of the b-sequences, with analphabet of {O, 1}, where the O plays the role of the dont-care symbol in thea-sequence. Each such problem is solved via the convolution algorithm in time

    (ength of path in To 12A )(see Fischer and Paterson [1974]). (We only perform the computation if thea-sequence is no longer than the b-sequence.) Finally, -by Corollary 2.10, allthese matching problems can be solved in total time 0( n /l p l). Every matchbetween the a-sequence and a b-sequence is translated to at most one node inthe maximal-periodic-path of T at which our set of leaves matches, dependingon the starting place in p of the maximal-periodic-path of P, for which thea-sequence was built. (If the a-sequence occurs at position Oof the b-sequence,and the corresponding maximal-periodic-path of T starts at a later place in pthan the corresponding maximal-periodic-path of P, then the occurrence is nottranslated into a match of the set of leaves.)We summarize the two cases above as follows:THEOREM 2.12. Jfatching a pattern tree P of size nl and a target tree T of size H

    can be done in time O(n %).Example. We take 1 = [~fil in order to have a small instructive example.

    Figure 2 shows a pattern tree, P, of size m = 24 (1 = 3), with its 31-truncatedsuffix tree, X, and a target tree, T. Since Z has only 3 ( < 0 leaves this is Case2 of the algorithm. Since all the leaves are of a depth that is at least 9 ( = 31),we do not use the naive algorithm.Leaves Z~l,Zl~,ZIJ all have period 01 and tail l; tz has period 10 and a

    null tail.The pair (01, l). P has two maximal-periodic-paths with period 01that start at the root. One going left with an a-sequence 000011, and one going

    right (starting with the suffix 1 of the period) with an a-sequence 00001. Thas two maximal-periodic-paths with period 01. (They intersect only at a

  • 7/27/2019 Faster Tree Pattern Matching

    8/9

    M. DLIBINER El Al,.Lz,

    u:

    PY

    ~

    FIGURE 2

    node that is the first node of one of them.) The two corresponding h-sequencesare 1000111 for the path starting at WI, and 000010 for the path starting at w,.Since the a-sequence 000011 matches the b-sequence 1000111 at positions Oand 1 and does not match the b-sequence 000010, we have two nodes in Twhere [11and u~ match: w, and W1.Since 00001 matches 1000111 at position O,1, and 2 and also matches 000010 at position O, we have four nodes in T whereC? matches: w{), W2,W6, and wk.The pair (10, nz411). P has only one maximal-periodic-path with period

    10 that starts at the root and from which there is a (null) tail leading to a leaf( Zl). The corresponding a-sequence for this path is 000001. T has two maxi-mal-periodic-paths with period 10. The corresponding b-sequences are1111111 for the path starting at w,} and 111111 for the path starting at WI.Matching the a-sequence against the b-sequences as before, gives the nodesWO,W., w, at which Uz matches.By ~raversing T one final time, we find that P matches T at w, that is the

    only node of T at which the four leaves of P match.

  • 7/27/2019 Faster Tree Pattern Matching

    9/9

    Faster Tvee Pattcm MatchingREFERENCESFISCHER, AI. J.. ANDPATERSON,M. S. 1974. String-matching and other products. In Przxxmiingsof the SIA fWAhfS Symposwn on Complexity of Computation, R. M. Karp, ed. SIAM, New York,pp. 113-125.HOFFM,ANN,C. M. ANDODONNELL,M. J. 1982. Pattern matching in trees. JACM 29, 1 (Jan.),68-95,KNurlI, D. E., MORRIS,J. H. AND PRATT,V. R. 1977. Fast pattern matching in strings. SIAM J.Comp, 6, 323-350.KOSARAJU,S.R. 1989. Eff icient tree pattern matching. In Proceedzrzgs oft /ze30th afznz~a/ZEEESy/?zposzz[/)z ofz Fo~~}zc/afio?t.~ of Co/?zpt~fer Scze/zce. IEEE, New York, pp. 178-183.

    RECEIVED FEBRUARY 1991; REVISED OCTOBER 1992?; ACCEPTED DECEMBER 1992