13
Checking Multivalued Dependencies in XML Jixue Liu 1 Millist Vincent 1 Chengfei Liu 2 Mukesh Mohania 3 1 School of Computer and Information Science, University of South Australia 2 School of Information Technology, Swinburne University of Technology 3 IBM Research Laboratories, New Delhi, India email: {jixue.liu, millist.vincent}@unisa.edu.au, chengfi[email protected] Abstract Recently, the issues of how to define functional dependencies (XFDs) and multivalued dependencies (XMVDs) in XML have been investigated. In this paper we consider the problem of checking the satisfaction of a set of XMVDs in an XML document. We present an algorithm using extensible hashing to check whether an XML document satisfies a given set of XMVDs. The performance of the algorithm is shown to be linear in relation to the ”tuple size” of the XML document, a measure which is related to, but not the same as, the size of the XML document. We then propose a method to estimate the ”tuple size” of an XML document. We also conduct a comparison between the hashed based approach and a sorting based approach to checking XMVDs and show that the hash based approach provides superior performance. 1 Introduction XML has recently emerged as a standard for data representation and interchange on the Internet [18, 1]. While providing syntactic flexibility, XML provides little semantic content and as a result several papers have addressed the topic of how to improve the semantic expressiveness of XML. Among the most important of these approaches has been that of defining integrity constraints in XML [8]. Several different classes of integrity constraints for XML have been defined including key constraints [6, 7], path constraints [9, 2], and inclusion constraints [10, 11], and properties such as axiomatization and satisfiability have been investigated for these constraints. Most of the constraints just discussed are particular to XML, but recently constraints which correspond to the classical relational constraints have been defined. These include functional dependencies [3, 4, 5, 16], multivalued dependencies (XMVDs) [14, 13, 15] and inclusion dependencies [17]. The motivation for the study of such constraints is that at the moment a significant amount of XML data is generated from relational databases and so it is important to understand how relational constraints are propagated to XML, given the growing recognition of the importance of semantics in XML and the WWW. Once such constraints have been defined, one important issue that arises is to develop efficient methods of checking an XML document for constraint satisfaction, which is the topic of this paper. In this paper we address the problem of developing an efficient algorithm for checking whether an XML document satisfies a set of XMVDs. The problem is addressed in several aspects. Firstly, we propose an algorithm that is based on a modification of the extensible hashing technique. Another key idea of the algorithm is an encryption technique which reduces the problem of checking XMVD satisfaction to the problem of checking MVD satisfaction. Secondly, we show that this algorithm scans a document only once and all the information required for XMVD checking can be extracted in this single scan, even when there are multiple XMVDs to be checked. At the same time, we show that the algorithm runs in linear time in relation to what we call the ”tuple size” of the XML document. The ”tuple size” of a document is different (though related) to the size of the XML document and is the number of combinations (we call them tuples) of the path values involved in the XMVDs. We then propose a method for estimating the ”tuple size” of an XML document. Finally, we compare the performance of our hashing based approach with that of a sorting based approach to highlight the advantages of the hash based approach. The rest of this paper is organized as follows. Section 2 contains some preliminary definitions. Section 3 contains the definition of an XMVD. In Section 4 we present our main algorithm for checking XMVD satisfaction. Section 5 contains some experiments with our algorithm and Section 6 contains some concluding comments. 1

Checking Multivalued Dependencies in XML

Embed Size (px)

Citation preview

Checking Multivalued Dependencies in XML

Jixue Liu 1 Millist Vincent 1 Chengfei Liu 2 Mukesh Mohania 3

1 School of Computer and Information Science, University of South Australia2 School of Information Technology, Swinburne University of Technology

3 IBM Research Laboratories, New Delhi, Indiaemail: {jixue.liu, millist.vincent}@unisa.edu.au, [email protected]

Abstract

Recently, the issues of how to define functional dependencies (XFDs) and multivalued dependencies(XMVDs) in XML have been investigated. In this paper we consider the problem of checking the satisfactionof a set of XMVDs in an XML document. We present an algorithm using extensible hashing to check whetheran XML document satisfies a given set of XMVDs. The performance of the algorithm is shown to be linearin relation to the ”tuple size” of the XML document, a measure which is related to, but not the same as, thesize of the XML document. We then propose a method to estimate the ”tuple size” of an XML document.We also conduct a comparison between the hashed based approach and a sorting based approach to checkingXMVDs and show that the hash based approach provides superior performance.

1 Introduction

XML has recently emerged as a standard for data representation and interchange on the Internet [18, 1].While providing syntactic flexibility, XML provides little semantic content and as a result several papers haveaddressed the topic of how to improve the semantic expressiveness of XML. Among the most important ofthese approaches has been that of defining integrity constraints in XML [8]. Several different classes of integrityconstraints for XML have been defined including key constraints [6, 7], path constraints [9, 2], and inclusionconstraints [10, 11], and properties such as axiomatization and satisfiability have been investigated for theseconstraints. Most of the constraints just discussed are particular to XML, but recently constraints whichcorrespond to the classical relational constraints have been defined. These include functional dependencies[3, 4, 5, 16], multivalued dependencies (XMVDs) [14, 13, 15] and inclusion dependencies [17]. The motivationfor the study of such constraints is that at the moment a significant amount of XML data is generated fromrelational databases and so it is important to understand how relational constraints are propagated to XML,given the growing recognition of the importance of semantics in XML and the WWW. Once such constraintshave been defined, one important issue that arises is to develop efficient methods of checking an XML documentfor constraint satisfaction, which is the topic of this paper.

In this paper we address the problem of developing an efficient algorithm for checking whether an XMLdocument satisfies a set of XMVDs. The problem is addressed in several aspects. Firstly, we propose analgorithm that is based on a modification of the extensible hashing technique. Another key idea of the algorithmis an encryption technique which reduces the problem of checking XMVD satisfaction to the problem of checkingMVD satisfaction. Secondly, we show that this algorithm scans a document only once and all the informationrequired for XMVD checking can be extracted in this single scan, even when there are multiple XMVDs tobe checked. At the same time, we show that the algorithm runs in linear time in relation to what we call the”tuple size” of the XML document. The ”tuple size” of a document is different (though related) to the sizeof the XML document and is the number of combinations (we call them tuples) of the path values involvedin the XMVDs. We then propose a method for estimating the ”tuple size” of an XML document. Finally, wecompare the performance of our hashing based approach with that of a sorting based approach to highlight theadvantages of the hash based approach.

The rest of this paper is organized as follows. Section 2 contains some preliminary definitions. Section 3contains the definition of an XMVD. In Section 4 we present our main algorithm for checking XMVD satisfaction.Section 5 contains some experiments with our algorithm and Section 6 contains some concluding comments.

1

2 Preliminary Definitions

In this section we review some preliminary definitions.

Definition 2.1 Assume a countably infinite set E of element labels (tags), a countable infinite set A of attributenames and a symbol S indicating text. An XML tree is defined to be T = (V, lab, ele, att, val, vr) where Vis a finite set of nodes in T ; lab is a function from V to E ∪ A ∪ {S}; ele is a partial function from V to asequence of V nodes such that for any v ∈ V , if ele(v) is defined then lab(v) ∈ E; att is a partial function fromV × A to V such that for any v ∈ V and l ∈ A, if att(v, l) = v1 then lab(v) ∈ E and lab(v1) = l; val is afunction such that for any node in v ∈ V, val(v) = v if lab(v) ∈ E and val(v) is a string if either lab(v) = Sor lab(v) ∈ A; vr is a distinguished node in V called the root of T and we define lab(vr) = root. Since nodeidentifiers are unique, a consequence of the definition of val is that if v1 ∈ E and v2 ∈ E and v1 6= v2 thenval(v1) 6= val(v2). We also extend the definition of val to sets of nodes and if V1 ⊆ V , then val(V1) is the setdefined by val(V1) = {val(v)|v ∈ V1}.

For any v ∈ V , if ele(v) is defined then the nodes in ele(v) are called subelements of v. For any l ∈ A, ifatt(v, l) = v1 then v1 is called an attribute of v. The set of ancestors of a node v, is denoted by Ancestor(v)and the parent node of v is denoted by parentV (v) where the suffix V means that the parent is a vertex (node).2

Definition 2.2 (path) A path is an expression of the form l1. · · · .ln, n ≥ 1, where li ∈ E ∪A ∪ {S} for alli, 1 ≤ i ≤ n and l1 = root. If p is the path l1. · · · .ln then endL(p) = ln. 2

For instance, if E = {root, Dept, Section, Emp} and A = {Project} then root, root.Dept,root.Dept.Section, root.Dept.Section.Project, root.Section.Emp.S are all paths.

Definition 2.3 Let p denote the path l1. · · · .ln and the function parentP (p) return the path l1. · · · .ln−1. Letq denote the path q1. · · · .qm. The path p is said to be a prefix of the path q, denoted by p ⊆ q, if n ≤ m andl1 = q1, . . . , ln = qn. Two paths p and q are equal, denoted by p = q, if p is a prefix of q and q is a prefix of p.The path p is said to be a strict prefix of q, denoted by p ⊂ q, if p is a prefix of q and p 6= q. We also definethe intersection of two paths p1 and p2, denoted by p1 ∩ p2, to be the maximal common prefix of both paths.It is clear that the intersection of two paths is also a path. 2

For example, root.Dept is a strict prefix of root.Dept.Emp and root.Dept.Section.Emp ∩root.Dept.Section.Project = root.Dept.Section.

Definition 2.4 A path instance in an XML tree T is a sequence v1. · · · .vn such that v1 = vr and for allvi, 1 < i ≤ n,vi ∈ V and vi is a child of vi−1. A path instance v1. · · · .vn is said to be defined over the pathl1. · · · .ln if for all vi, 1 ≤ i ≤ n, lab(vi) = li. Two path instances v1. · · · .vn and v′1. · · · .v′n are said to be distinctif vi 6= v′i for some i, 1 ≤ i ≤ n. The path instance v1. · · · .vn is said to be a prefix of v′1. · · · .v′m if n ≤ m andvi = v′i for all i, 1 ≤ i ≤ n. The path instance v1. · · · .vn is said to be a strict prefix of v′1. · · · .v′m if n < m andvi = v′i for all i, 1 ≤ i ≤ n. The set of path instances over a path p in a tree T is denoted by instances(p). Fora node v, we use instnodes(v) to denote all nodes of the path instance ended at v. 2

For example, in Figure 1, vr.v1.v3 is a path instance defined over the path root.Dept.Section and vr.v1.v3

is a strict prefix of vr.v1.v3.v4. instances(root.Dept) = {vr.v1, vr.v2}. instnodes(v2) = {vr, v2}.We now assume the existence of a set of legal paths P for an XML application. Essentially, P defines the

semantics of an XML application in the same way that a set of relational schema define the semantics of arelational application. P may be derived from the DTD, if one exists, or P be derived from some other sourcewhich understands the semantics of the application if no DTD exists. The advantage of assuming the existenceof a set of paths, rather than a DTD, is that it allows for a greater degree of generality since having an XML treeconforming to a set of paths is much less restrictive than having it conform to a DTD. This is because havinga tree conform to a DTD requires not only that the paths in the tree are consistent with those generated fromthe DTD, but also the tree must obey the cardinality and structural constraints defined by the DTD. Firstlywe place the following restriction on the set of paths.

Definition 2.5 A set P of paths is consistent if for any path p ∈ P , if p1 ⊂ p then p1 ∈ P . 2

2

Figure 1: An XML tree.

This is natural restriction on the set of paths and any set of paths that is generated from a DTD will beconsistent. We now define the notion of an XML tree conforming to a set of paths P .

Definition 2.6 An XML tree T is said to conform to a set P of paths if every path instance in T is a pathinstance over a path in P . 2

The next definition is to limit XML trees from having missing information.

Definition 2.7 Let P be a consistent set of paths, let T be an XML tree that conforms to P . Then T isdefined to be complete if whenever there exist paths p1 and p2 in P such that p1 ⊂ p2 and there exists a pathinstance v1. · · · .vn defined over p1, in T , then there exists a path instance v′1. · · · .v′m defined over p2 in T suchthat v1. · · · .vn is a prefix of the instance v′1. · · · .v′m. 2

For example, if we take P to be {root, root.Dept, root.Dept.Section, root.Dept.Section.Emp,root.Dept.Section.Project} then the tree in Figure 1 conforms to P and is complete.

The next function returns all the final nodes of the path instances of a path p in T .

Definition 2.8 Let P be a consistent set of paths, let T be an XML tree that conforms to P . The functionendnodes(p), where p ∈ P , is the set of nodes defined by endnodes(p) = {v|v1. · · · .vn ∈ instances(p)∧ v = vn}.2

For example, in Figure 1, endnodes(root.Dept) = {v1, v2}.The next function returns the end nodes of the path instances of a path p under a given node. It is a

restriction to endnodes(p).

Definition 2.9 Let P be a consistent set of paths, let T be an XML tree that conforms to P . The functionbranEndnodes(v, p) (meaning branch end nodes), where v ∈ V and p ∈ P , is the set of nodes in T defined bybranEndnodes(v, p) = {x|x ∈ endnodes(p) ∧ v ∈ instnodes(x)} 2

For example in Figure 1 , branEndnodes(v1, root.Dept.Section.Emp) = {v4, v5}.We also define a partial ordering on the set of nodes and the set of paths as follows. We use the same symbol

for both orderings but this causes no confusion as they are being applied to different sets.

Definition 2.10 The partial ordering > on the set of paths P is defined by p1 > p2 if p2 is a strict prefix ofp1, where p1 ∈ P and p2 ∈ P . The partial ordering > on the set of nodes V in an XML tree T is defined byv1 > v2 iff v2 ∈ Ancestor(v1), where v1 ∈ V and v2 ∈ V . 2

3 XMVDs in XML

In this section, we present the XMVD definition and then give two examples.

3

Definition 3.1 Let P be a consistent set of paths and let T be an XML tree that conforms to P and is complete.An XMVD is a statement of the form p1, · · · , pk →→ q1, · · · , qm|r1, · · · , rs where p1, · · · , pk, q1, · · · , qm andr1, · · · , rs are paths in P and {p1, · · · , pk} ∩ {q1, · · · , qm} ∩ {r1, · · · , rs} = φ. A tree T satisfies p1, · · · , pk →→q1, · · · , qm|r1, · · · , rs if whenever there exists a qi, 1 ≤ i ≤ m, and two distinct path instances vi

1. · · · .vin and

wi1. · · · .wi

n in instances(qi) such that:(i) val(vi

n) 6= val(win);

(ii) there exists a rj , 1 ≤ j ≤ s, and two nodes z1, z2, where z1 ∈ branEndnodes(xij , rj) and z2 ∈branEndnodes(yij , rj) such that val(z1) 6= val(z2);

(iii) for all pl, 1 ≤ l ≤ k, there exists two nodes z3 and z4, where z3 ∈ branEndnodes(xijl, pl) and

z4 ∈ branEndnodes(yijl, pl), such that val(z3) = val(z4);

then:(a) there exists a path instance v′i1 . · · · .v′in in instances(qi) such that val(v′in) = val(vi

n) and there ex-ists a node z′1 in branEndnodes(x′

ij, ri) such that val(z′1) = val(z2) and there exists a node z′3 in

branEndnodes(x′ijl

, pl) such that val(z′3) = val(z3);(b) there exists a path instance w′i

1 . · · · .w′in in instances(qi) such that val(w′i

n) = val(win) and there

exists a node z′2 in branEndnodes(y′ij, ri) such that val(z′2) = val(z1) and there exists a node z′4 in

branEndnodes(y′ijl, pl) such that val(z′4) = val(z4);

where xij= {v|v ∈ {vi

1, · · · , vin} ∧ v ∈ instnodes(endnodes(rj ∩ qi))} and yij

= {v|v ∈ {wi1, · · · , wi

n} ∧ v ∈instnodes(endnodes(rj ∩ qi))} and xijl

= {v|v ∈ {vi1, · · · , vi

n} ∧ v ∈ instnodes(endnodes(pl ∩ rj ∩ qi))} andyijl

= {v|v ∈ {wi1, · · · , wi

n} ∧ v ∈ instnodes(endnodes(pl ∩ rj ∩ qi))} and x′ij

= {v|v ∈ {v′i1 , · · · , v′in} ∧ v ∈instnodes(endnodes(rj ∩ qi))} and y′ij

= {v|v ∈ {w′i1 , · · · , w′i

n} ∧ v ∈ instnodes(endnodes(rj ∩ qi))} andx′

ijl= {v|v ∈ {v′i1 , · · · , v′in} ∧ v ∈ instnodes(endnodes(pl ∩ rj ∩ qi))} and y′ijl

= {v|v ∈ {w′i1 , · · · , w′i

n} ∧ v ∈instnodes(endnodes(pl ∩ rj ∩ qi))}. 2

We note that since the path rj ∩ qi is a prefix of qi, there exists only one node in vi1. · · · .vi

n thatis also in branEndnodes(rj ∩ qi) and so xij is always defined and is a single node. Similarly foryij

, xijl, yijl

, x′ij

, y′ij, x′

ijl, y′ijl

. We also note that the definition of an XMVD is symmetrical, i.e. the XMVDp1, · · · , pk →→ q1, · · · , qm|r1, · · · , rs holds if and only if the XMVD p1, · · · , pk →→ r1, · · · , rs|q1, · · · , qm holds.We now illustrate the definition by some examples.

Example 3.1 Consider the XML tree shown in Figure 2 and the XMVD C : root.A.Course →→root.A.B.Teacher.S|root.A.C.Text.S. Let vi

1. · · · .vin be the path instance vr.v12.v2.v4.v8 and let wi

1. · · · .win

be the path instance vr.v12.v2.v5.v9. Both path instances are in instances(root.A.B.Teacher.S) and val(v8) 6=val(v9). Moreover, x11 = v12, y11 = v12, x111

= v12 and y111= v12. So if we let z1 = v10 and z2 = v11 then

z1 ∈ branEndnodes(x11 , root.A.C.Text.C) and z2 ∈ branEndnodes(y11 , root.A.C.Text.S). Also if we let z3 = v1

and z4 = v1 then z3 ∈ branEendnodes(x111, root.A.Course) and z4 ∈ branEndnodes(y111

, root.A.Course)and val(z3) = val(z4). Hence conditions (i), (ii) and (iii) of the definition of an XMVD are satisfied.If we let v′i1 . · · · .v′in be the path vr.v12.v2.v4.v8 we firstly have that val(v′in) = val(vn) as required. Also,since the path instances are the same we have that x11 = x′

11and x111

= x′111

. So if we let z′1 = v11

then z′1 ∈ branEndnodes(x′11

, root.A.C.Text.S) and val(z′1) = val(z2) and if we let z′3 = v1 then z′3 ∈branEndnodes(x′

111, root.A.Course) and val(z′3) = val(z3). So part (a) of the definition of an XMVD is

satisfied. Next if we let w′i1 . · · · .w′i

n be the path vr.v8.v2.v5.v9 then we firstly have that val(w′in) = val(wn) since

the paths are the same. Also, since the paths are the same we have that y11 = y′11and y111

= y′111. So if we

let z′2 = v10 then z′2 ∈ branEndnodes(y′11, root.A.C.Text.S) and val(z′2) = val(z1) and if we let z′4 = v1 then

z′4 ∈ branEndnodes(x′11l

, root.A.Course) and val(z′4) = val(z4). Hence part (b) on the definition of an XMVDis satisfied and so T satisfies the XMVD C. 2

Example 3.2 Consider the XML tree shown in Figure 3 and the XMVD C : root.Project.P# →→Root.Project.Person.Name|root.Project.Part.Pid. For the path instances vr.v1.v5.v13 and vr.v2.v8.v16 ininstances(Root.Project.Person.Name) we have that val(v13) 6= val(v16). Moreover, x11 = v1, y11 = v2, x111

=v1 and y111

= v2. So if we let z1 = v15 and z2 = v18 then z1 ∈ branEndnodes(x11 , root.Project.Part.Pid)and z2 ∈ branEndnodes(y11 , root.Project.Part.Pid). Also if we let z3 = v4 and z4 = v7 then z3 ∈branEndnodes(x111

, root.Project.P#) and z4 ∈ branEndnodes(y111, root.Project.P#) and val(z3) = val(z4).

Hence conditions (i), (ii) and (iii) of the definition of an XMVD are satisfied. However, for the right most

4

Figure 2: An XML tree

path instance in instances(root.Project.Person.Name), namely vr.v3.v11.v19 we have that x′11

= v3 and sobranEndnodes(x′

11, root.Project.part.Pid) = v21 and since val(v21) 6= val(z2) and so it does not satisfy

condition (a) of the definition of an XMVD and thus the XMVD C is violated in T .Consider Figure 3 again and another XMVD C′ : root.Project.Person.Name →→

root.Project.Person.Skill|root.Project.P#. For the path instances vr.v1.v5.v14 and vr.v3.v11.v20 ininstances(root.Project.Person.Skill) we have that val(v14) 6= val(v20). Moreover, x11 = v1, y11 = v3,x111

= v1 and y111= v3. So if we let z1 = v4 and z2 = v10 then z1 ∈ branEndnodes(x11 , root.Project.P#) and

z2 ∈ branEndnodes(y11 , root.Project.P#) and val(z1) 6= val(z2). Also if we let z3 = v13 and z4 = v19 then z3 ∈branEndnodes(x111

, root.Project.Person.Name) and z4 ∈ branEndnodes(y111, root.Project.Person.Name)

and val(z3) = val(z4). Hence the conditions of (i), (ii) and (iii) of the definition of an XMVD are satisfied.However there does not exist another path instance in instances(root.Project.Person.Skill) such that valof the last node in the path is equal to val(v14) and so part (a) of the definition of an XMVD is violated. 2

Figure 3: An XML tree

4 Algorithm of checking XMVD

In this section, we present our algorithm for checking XMVD satisfaction.Given an XMVD P →→ Q|R, where P = {p1, · · · pnp

}, Q = {q1, · · · qnq}, and R = {r1, · · · rnr

}, we letS = {s′1, ..., s′n} = P ∪Q ∪R. We call n the number of paths invovled in the XMVD. To check this XMVDagainst a document, we firstly parse the document to extract values for s′i, 1 ≤ i ≤ n. These values are then

5

combined into tuples of the form < v1 · · · vn > where vi is a value for the path s′i. Finally, the tuples are usedto check the satisfaction of the XMVD. For parsing a document, we need to define a control structure based onthe paths involved.

4.1 Defining parsing control structure

We sort S = P ∪ Q ∪ R by using string sorting and denote the result by So = [s1, ..., sn]. We call the set ofend elements of the paths in So prime end elements and denoted by PE, i.e., PE = {endL(s)|s ∈ S} whereendL() is defined in Definition 2.2. Note that So being a list can simplify the calculation of all the intersectionsof path in So as shown below. Consider the example in Figure 4 which will be a running example in this section.Let an XMVD be r.A.C.F→→ r.A.C.D.E|r.A.B. Then So = [r.A.B, r.A.C.D.E, r.A.C.F] and PE = {B, E, F}.

Figure 4: a set of paths and an example XML document

Let H = {h′1, · · · , h′

m} be a set of paths that are the intersections of paths in So where h′i = si ∩ sj ,

(i = 1, · · · ,m), j = i + 1 and m = n − 1. We call the elements ending the paths in H ′ intersection endelements and denote the set by IE. We call both intersection end elements and prime end elements key endelements and denote them by KE. In Figure 4, H = {r.A, r.A.C}, IE = {A, C}, KE = {A, C, B, E, F}.

Now we define and calculate the contributing elements of intersection elements. Let iE be the intersectionend element of a intersection path h,i.e., iE = endL(h). The contributing elements of iE, denoted by CE(iE),are defined to be all key end elements of So and H under h but not contributing elements of any other intersectionend elements. To calculate the contributing elements, we first sort the paths in H by applying string sorting andput the result in Ho as [h1, · · · , hm]. We then follow the following algorithm to calculate contributing elements.

Algorithm 4.1 (calculation of contributing elements)

Inpput: So = [s1, · · · , sn] and Ho = [h1, · · · , hm]Do : For h = hm, · · · , h1 in order,

For each s in So, if s is not marked and if endL(s) is a descendentof h, put endL(s) in CE(endL(h)) and mark s.

For each hx ∈ Ho, if hx is not marked and if endL(hx) is a descendentof h, put endL(hx) in CE(endL(h)) and mark hx.

Output: CE(endL(h)) for all h ∈ Ho.

In the running example of Figure 4, the results of applying Algorithm 4.1 are CE(C) = {E, F} andCE(A) = {B, C}.

4.2 Parsing a document

We define the function val(k) to mean the value set of a key end element. Note that if k is a prime end element,val(k) will be accumulated if there are multiple presences of the same elements under a same node. If k is anintersection end element, val(k) is a set of tuples generated by the production of the values and/or value sets ofcontributing elements of k. We call each element of the production a tuple. Obviously, val(k) changes as theparsing progresses. We now show some examples of val(k) w.r.t Figure 4. When parsing reaches tag <C> in Line3, val(C), val(E) and val(F ) are all set to empty. When parsing has completed Line 6, val(E) = {”e1”, ”e2”}

6

and val(F ) = {”f1”, ”f2”}. When parsing of Line 7 is completed, val(C) = {< ”e1”, ”f1” >, < ”e1”, ”f2” >, < ”e2”, ”f1” >, < ”e2”, ”f2” >}.

We further define some notation. stk denotes a stack while stkTop is used to refer to the element currentlyat the top of the stack. For simplicity, we define e ∈ X, where e is an element and X is a path set, to be trueif there exists a path x ∈ X such that endL(x) = e. With all these definitions, we present the algorithm thatparses a document. Also for simplicity, we use val(h) to mean val(endL(h)) where h is a path.

Algorithm 4.2 (parsing documents)

Inpput: So, Ho, CE, PE, an empty stk, and a documentDo : For each element e in the document in the order of presence

if e ∈ Ho and e closes stkTop, do production ofthe contributing attributes as the following:let CE(e) = {e1, ..., ec}, thenval(e) = val(e) ∪ {val(e1) × · · · × val(ec)}Note that some val(ei) can be sets of tuples while the otherscan be sets of values.

else if e ∈ Ho

push e to stkreset val(e1), · · ·, val(ec) to empty where

e1, · · · , ec ∈ CE(e)else if e ∈ PE

read the value v of e and add v to val(e)if e is a leaf element, v is the constant string on the node.if e is an internal node, v is ’null’.

elseignore the element and keep reading

Output: val(h1) where h1 is the first element in Ho - the shortest intersection path.

Note that the algorithm scans the document only once and all tuples are generated for the XMVD.In the algorithm, the structures So, Ho, CE, PE, an empty stk form a structure group. If there are multiple

XMVDs to be checked at the same time, we create a structure group for each XMVD. During document parsing,for each element e read from the document, the above algorithm is applied to all structure groups. This meansfor checking multiple XMVDs, the same document is still scanned once.

4.3 Tuple attribute shifting and XMVD checking

For each tuple t in val(h1), where h1 is the first element in Ho - the shortest intersection path, it contains avalue for each paths of the XMVD, but the values are in the order of their presence in the document. Thisorder is different from the order of paths in the XMVD. For example in Figure 4, the tuple < ”b1”, ”e1”, ”f1” >is in val(root.A) and the values of the tuple are for the paths r.A.B, r.A.C.D.E, r.A.C.F in order. This order isdifferent from the order of paths in the XMVD: r.A.B, r.A.C.D.E r.A.C.F. We define the operation shiftAttr()to rearrange the order of values of t so that the values for the paths in P are moved to the beginning of t, thevalues for the paths in Q are moved to the middle of t, and the values for the paths in R are at the end of t. TheshiftAttr() operation is applied to every tuple of val(h1) and the result is denoted by Tsa = shiftAttr(val(h1)).

We define a further function removeDuplicates(Tsa) to remove duplicating tuples in Tsa and we denote thereturned set as Tdist. Thus, the following algorithm checks whether an XML document satisfies an XMVD andthis algorithm is one of the main results of our proposal. The basic idea of checking is to group all the tuples forthe XMVD so that tuples with the same P value is put into one group. In each group, the number of distinctQ values, |Q|, and the number of distinct R values, |R|, are calculated. Then if |Q| × |R| is the same as thenumber of distinct tuples in the group, the group satisfies the XMVD; otherwise, it violates the XMVD. If allthe groups satisfy the XMVD, the document satisfies the XMVD.

Algorithm 4.3 (checking mvd)

Inpput: Tdist

Do :violated = 0For G be each set of all tuples in Tdist having the same P value

7

let |G| be the number of tuples in Glet dist(Q) and dist(R) be the numbers of distinct Q values and

of distinct R values in G respectivelyif ( |G| ! = dist(Q)× dist(R) ) then violated + +;

end forOutput: if ( violated == 0 ) return TRUE; otherwise return FLASE.

Note that this algorithm assumes that G is the set of all tuples having the same P value. To satisfy thisassumption, one has to group tuples in Tdist so that tuples with the same P value can be put together. A quicksolution to this is not direct as show in the next section and therefore the way of achieving the assumptiongreatly affects the performance of whole checking algorithm.

5 Implementation and Performance

In this section, we present the performance results of two implementations, each with a different way of groupingtuples to satisfy the assumption of Algorithm 4.3. The first implementation uses sorting, an approach that isthe simplest to implement. The second implementation uses hashing to group tuples and this approach is shownto have superior performance to that of the sort-based method.

The performance of the two methods were tested on a Pentium 4 computer with 398 MB of main memory.Figure 5 shows the DTD structure used in the implementation. Based on the DTD, XMVDs involving 12 paths,9 paths, 6 paths, and 3 paths were specified. In the XML documents used for the experiments, random stringvalues of about 15 characters were generated as values for labels prefixed by id and c. This means that if thereare three paths involved in an XMVD, the length of a tuple is about 45 characters while if there are twelvepaths in an XMVD, the length of a tuple is around 180 characters.

Figure 5: DTD used in implementation

5.1 Performance of the sorting approach

In Algorithm 4.3, the assumption is that all tuples having the same P value are grouped together. At the sametime, in each group, the number of distinct Q values and the number of distinct R values need to be calculated.To do so with sorting, our implementation firstly sorts all tuples in Tdist based on P values using string sorting.Thus tuples of the same P values are put together, which we refer as a group. In the same group, we sort andcalculate the number of distinct P values and the number of R values and then used Algorithm 4.3 to checkXMVD satisfaction.

8

One important issue in analyzing the performance of the implementation is that we can not use file sizeas the independent variable when measuring the performance. The correct factor is the number of tuples thealgorithms have to process. Although the number of tuples is affected by the file size, it is mainly determinedby the number of paths involved in an XMVD. The reason that the number of tuples is mainly determined bythe number of paths involved in an XMVD is that as the number of paths in an XMVD increases, the numberof intersection elements increases, which means more production operations are applied when the document isparsed and more tuples are generated. This is demonstrated in Figure 6. In the figure, the vertical axis indicatesnumber of tuples and the horizontal axis indicates file size in megabytes. The line of diamonds represents thecase where an XMVD involves 6 paths while the line of squares represents the case where an XMVD involvesonly 3 paths. From the figure we see that with the same file size, the number of tuples of 6 paths is about 8times of the number of tuples of 3 paths. On the other hand, to produce the same number of tuples, the filesize of the case with 3 paths is about 70 times larger than that of the case with 6 paths.

Figure 6: File size, number of tuples, and number of paths in XMVD

The first experiment performed was to examine the time distribution of different processes involved in thesorting implementation and the results are given in Table 1. In the table, the column parsing is the time ofapplying Algorithm 4.2 to construct tuples. The next column is the time for shifting attributes of tuples tocomply with the order of paths in the XMVD. The column rem-dup means the time to remove duplicate tuplesand the column checking is the time taken for applying Algorithm 4.3. The last column is the total of allprevious columns. The table shows that with the sorting approach, the main cost is from sorting tuples and allother costs are negligible.

Table 1: Time distribution of the sorting approachprocess parsing shifting sorting rem-dup checking total

time (sec) 0.88 0.59 1655.37 0.14 1.01 1657.99

A graph of the running time of the algorithm when the number of tuples is varied is given in Figure 7.Because the running time is dominated by the time taken for sorting, which in our implementation is aninsertion sort, the running time of this approach is O(n2) where n is the number of tuples. At the same time,the performance gets worse as the number of paths involved in an XMVD increases. This is because an increasein the number of paths in the XMVD causes the length of tuples to increase and therefore the comparison oftwo tuples takes longer. Note that we are dealing with XML data and the comparison of two tuples has to bedone by string comparison. Fortunately, the relationship between the the number of paths and running time islinear, as indicated in Figure 8.

From these experiments, we see the importance of the number of tuples in in the time taken to check XMVDsatisfaction. We now propose a formula for estimating the number of tuples. Before presenting the formula,we define two functions. We recall that PE is the set of ending elements of all paths involved in the XMVDand IE is the set of ending elements of all intersection paths H. We define RE to be set of all elementsthat are in the paths of PE and are descendants of the shortest path in H. Let e be an element in RE andei ∈ Lch(e) [i = 1, · · · ,m] be a child label of e. We define the following functions.

• tpl(e) returns the number of tuples generated for element e;• rep(e) is the number of repeating e elements under the same parent.

With these functions, the following recursive formula estimates the number of tuples.

9

Figure 7: time and number of tuples

Figure 8: time and the number of paths in XMVD

tpl(e) =

rep(e) ∗∏m

i=1 tpl(ei), if e ∈ IErep(e) ∗ tpl(e1), if e ∈ RE ∧ e 6∈ IE ∧ e 6∈ PErep(e), if e ∈ PE

Note that if the condition of the second option of the formula is true, then e has only one child elementrelating to a path in S and this child element is denoted by e1. The formula estimates the number of tuplesbased on three factors: the number of levels of the document, the number of repetitions of the same element ateach level and the number of paths involved in the XMVD. The number of levels is represented in the formula byrecursion. The number of repetitions is represented by the function rep() and the number of paths is representedby the product in the first option of the formula. Take Figure 4 as an example, then rep(E) = 2, rep(D) =2, rep(F ) = 2, rep(C) = 1, rep(B) = 2, rep(A) = 1, tpl(E) = 2, tpl(D) = rep(D) ∗ tpl(E) = 4, tpl(F ) =2, tpl(C) = rep(C) ∗ tpl(D) ∗ tpl(F ) = 8, tpl(B) = 2, tpl(A) = rep(A) ∗ tpl(B) ∗ tpl(C) = 16.

5.2 Performance of the hashing approach

Our second implementation uses the standard extensible hashing [12] with some modifications. In the standardextensible hashing technique, each object to be hashed has a distinct key value. By using a hash function, thekey value is mapped to an index to a pointer, pointing to a fixed size basket, in a pointer directory. Theobject with the key value will then be put into the pointed basket. Every time a basket becomes full, thedirectory space is doubled and the full basket is split into two. In our implementation, we use the digests,integers converted from strings, of P values of tuples as the key values of the standard extensible hashing. Ourmodification to the standard extensible hashing technique is the following.

The baskets we use are extensible, meaning that the size of each basket is not fixed. We allow only thosetuples with the same key value to be put into a basket. We call the key value of the tuples in the basket thebasket key. Tuples with different keys are said to conflict. Every time when placing a tuple into an existing

10

basket causes a conflict, a new basket is created and the conflicting tuple is put into the new basket. At thesame time, the directory is doubled and new hash codes are calculated for both the existing basket key and thenew basket key. The doubling process continues until the two hash codes are different. Then the existing andthe new baskets are connected to the pointers indexed by the corresponding hash codes in the directory.

Figure 9 shows the directory and a basket. Note that because of directory space doubling, a basket maybe referenced by multiple pointers. In the diagram, p stands for the basket key, the three spaces on the rightof p are lists storing distinct Q values, distinct R values and distinct Q and R combinations. On top of thelists, three counters are defined. nq stands for the number of distinct Q values, nr the number of distinct Rvalues and nqr the number of distinct Q and R combinations. After hashing is completed, the three countersare used to check the XMVD as required by Algorithm 4.3. When a new tuple t =< p, q, r >, where p, q andr are values for P, Q and R respectively, with the same p value is inserted to a basket, q is checked against allexisting values to see if it equals to one of them. If yes, the q value is ignored; otherwise, the q value is appendedto the end of the list and the counter is stepped. Similar processes are applied to insert r and the combination< q, r >.

Figure 9: The structure of a basket and one extension

We now analyze the performance of hashing. Obviously the calculation of hash codes to find baskets fortuples is linear in relation to the number of tuples. When a tuple t =< p, q, r > is put into a basket that hasalready has tuples, then comparisons are needed to see if q, r and the combination < q, r > are already in thelists. The performance of the comparison relates to the number of distinct existing values. Generally, if weneed to put n′ values into a list that has had m distinct values, then the performance is O(n′ ∗m) comparisons.Obviously the worst case is that all n′ values are distinct and the performance is then O(n′2). This is similar towhat occurs in sorting. However, it is different from the sorting implementation. The sorting implementationhas the performance of O(n2) but n is the number of total tuples. With the hashing implementation, becausehashing has spread the total n tuples into different baskets and therefore reduced greatly the number of tuplesto be put in a basket to n′.

We conducted a number of experiments to analyze the performance of this hash-based implementation andthe results are given in Figure 10. In this experiment we plotted the time taken for checking XMVD satisfactionagainst the number of tuples in the document, for varying numbers of paths in the XMVD (we used 6, 9, and12). In the cases of 3 paths and 6 paths, we see that the former has a higher cost. This can be explainedbecause the overall performance contains the time for parsing documents. As shown in Figure 6, to have thesame number of tuples in the cases of 3 paths and 6 paths, the 3 path case has a much larger file size, about 70times of that of 6 paths and therefore the parsing time used is much larger in contrast to that of 6 path case.It is the parsing time that makes performance for 3 paths worse than that for 9 or 12 paths.

5.3 Performance comparison of sorting and hashing

By comparing the performances of the sorting implementation and the hashing implementation, we see thathashing performs significantly better. Firstly, the performance of hashing is linear in the number of tuples whilethe performance of sorting is quadratic. Secondly, as shown in Table 2, the time taken for hashing is muchsmaller than that of sorting, for the same number of tuples, and the difference becomes larger as the number ofpaths involved in the XMVD increases.

On the other hand, hashing relies on an uniform key value (P value) distribution. If all tuples have thesame key value, then the performance of hashing would be the same as sorting. Secondly, if the hashing keydistribution is not uniform, then there will be more conflicting tuples and this causes the directory space to

11

Figure 10: Performance of hash method

Table 2: Comparison of performances of sorting and hashing12paths 9paths 6path 3paths

sorting 2048.8 1658 1259 809.8hashing 5.35 5.96 3.38 25

double. When memory usage reaches the limit of available physical memory, external memory would have tobe used. We argue that even in this case, linear performance is still achievable.

6 Conclusions

In this paper we have addressed the problem of developing an efficient algorithm for checking the satisfactionof XMVDs, a new type of XML constraint that has recently been introduced [14, 13, 15]. We have developedan algorithm bases on extendible hashing that requires only one scan of the XML document to check XMVDs.At the same time, its running time is linear in the size of the application which is proved to be the number oftuples. The algorithm can check not only the cases where there is only one XMVD, but also the cases involvingmultiple XMVDs. We also conducted comparisons between the hashing approach and another approach whichuses sorting. The results of comparison indicate that the hashing approach is a much better one.

One issue we intend to investigate in the future is the issue of incrementally checking XMVDs when an XMLdocument is updated. Solving this problem not only is an extension to solution of the problem solved in thispaper, but also has significance in cases where XML documents are changing frequently and the enforcement ofthe constraints is critical in the application.

References

[1] S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web. Morgan Kauffman, 2000.

[2] S. Abiteboul and O. M. Duschka. Complexity of answering queries using materialized views. In Proceedingsof the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages254 – 263, 1998.

[3] M. Arenas and L. Libkin. A normal form for XML documents. In Proc. ACM PODS Conference, pages85–96, 2002.

[4] M. Arenas and L. Libkin. An information-theoretic approach to normal forms for relational and XML data.In Proc. ACM PODS Conference, pages 15–26, 2003.

[5] M. Arenas and L. Libkin. A normal form for XML documents. TODS, 29(1):195 – 232, 2004.

[6] P. Buneman, S. Davidson, W. Fan, C. Hara, and W. Tan. Keys for XML. Computer Networks, 39(5):473–487, 2002.

12

[7] P. Buneman, S. Davidson, W. Fan, C. Hara, and W. Tan. Reasoning about keys for XML. InformationSystems, 28(8):1037–1063, 2003.

[8] P. Buneman, W. Fan, J. Simeon, and S. Weinstein. Constraints for semistructured data and XML. ACMSIGMOD Record, 30(1):45–47, 2001.

[9] P. Buneman, W. Fan, and S. Weinstein. Path constraints in semistructured data. Journal of Computerand System Sciences, 61:146 – 193, 2000.

[10] W. Fan and L. Libkin. On XML integrity constraints in the presence of DTDs. Journal of the ACM,49(3):368 – 406, 2002.

[11] W. Fan and J. Simeon. Integrity constraints for XML. Journal of Computer and System Sciences, 66(1):254–291, 2003.

[12] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw Hill, 2002.

[13] M.W. Vincent and J. Liu. Multivalued dependencies and a 4NF for XML. In 15th International Conferenceon Advanced Information Systems Engineering (CAISE), pages 14–29. Lecture Notes in Computer Science2681 Springer, 2003.

[14] M.W. Vincent and J. Liu. Multivalued dependencies in XML. In 20th British National Conference onDatabases (BNCOD), pages 4–18. Lecture Notes in Computer Science 2712 Springer, 2003.

[15] M.W. Vincent, J. Liu, and C. Liu. Multivalued dependencies and a redundancy free 4NF for XML. InInternational XML database symposium (Xsym), pages 254–266. Lecture Notes in Computer Science 2824Springer, 2003.

[16] M.W. Vincent, J. Liu, and C. Liu. Strong functional dependencies and their application to normal formsin XML. ACM Transactions on Database Systems, 29(3):445–462, 2004.

[17] M.W. Vincent, M. Schrefl, J. Liu, C. Liu, and S. Dogen. Generalized inclusion dependencies in XML. InThe Sixth Asia Pacific Web Conference., 2004.

[18] J. Widom. Data management for XML - research directions. IEEE data Engineering Bulletin, 22(3):44–52,1999.

13