59
Containment of Nested XML Queries Presented by: Orly Goren Xin Dong, Igor Tatarinov Alon Halevy,

Containment of Nested XML Queries Presented by: Orly Goren Xin Dong, Igor TatarinovAlon Halevy,

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Containment of Nested XML Queries

Presented by: Orly Goren

Xin Dong,

Igor Tatarinov

Alon Halevy,

Query Containment

The most fundamental relationship between a pair of queries

Query Q is contained in Q’ if:For any database D,Q(D) is a subset of Q’(D)

Roadmap Introduction and problem definition Containment of a subset of XML queries

Query containment is decidable

Query containment in practice Relaxing the assumptions

Conclusions

DepthFanout

Fixed Arbitrary

= 1 PTIME PTIME

Arbitrary coNP complete

In coNEXPTIME

Applications of Query Containment Semantic caching Determining independence of

database updates Query answering using views Detecting that a reformulated

query is redundant Query minimization Verification of knowledge bases

Query Processing in PDMS XML Query Containment in Peer Data

Management System (PDMS)

Answering queries using views to extract remote data

Removing redundant queries to enhance performance

MWS

MPW

MSB

MBW

QWQW

UW Stanford

Berkeley UPenn

QW

QP QB1

QB2

QS

QB1

QS

QB2 QB1

Query Containment: Relational v.s. XML

Relational

Input D Sets of tuples

Output Q(D) A set of tuples

Instance containment

Q(D) Q’(D)– Subset

Query containment

Q Q’– for every

input D, Q(D) Q’(D)

Query Containment: Relational v.s. XML

Relational XML

Input D Sets of tuplesAn XML instance

tree

Output Q(D) A set of tuplesAn XML instance

tree

Instance containment

Q(D) Q’(D)– Subset

Q(D) Q’(D)– Tree

embedding

Query containment

Q Q’– for every

input D, Q(D) Q’(D)

Q Q’– for every input

D, Q(D) Q’(D)

Example – An XML Instance

D:

<project>

<member>Alice</member>

</project>

<project>

<member>Bob</member>

</project>

project project

member member

Alice Bob

Example – An XML QueryQ:for $x in /project return<group>{

for $y in $x/member return <name>{

where $y=“Alice”return <Alice/>

where $y=“Bob”return <Bob/>

}</name>}</group>

D:

Q(D):

group

name

group

name

Alice Bob

project project

member member

Alice Bob

Example – Another XML Query

Q’:for $x in /project return<group>{

for $y in /project/member return <name>{

where $y=“Alice”return <Alice/>

where $y=“Bob”return <Bob/>

}</name>}</group>

D:

Q’(D):

name

group

name

Alice Bob

project project

member member

Alice Bob

Tree Embedding

Given two trees, a node mappingψfrom T1 to T2 is said to be an embedding from T1 to T2 if:

ψmaps the root of T1 to the root of T2.

If node n2 is a child of node n1 in T1, thenψ(n2) is a child ofψ(n1), and the labels of n1 and n2 has the same labels asψ(n1) andψ(n2).

What is the time complexity of

finding an embedding from t1

to t2?

Let e and e’ be two XML instances. e is contained in e’, denoted as e e’, if the tree of e can be embedded in the tree of e’.

Containment is reflexive and transitive.Containment is not antisymmetric: e e’

and e’ e do not imply e = e’.

XML Instance Containment

aa

b

a

b

Two XML instances that contain each

other but are not equivalent.

XML Query Containment

Let Q and Q’ be two XML queries.Q is contained in Q’, denoted as Q Q’, if for every input XML instance D, Q(D) Q’(D).

Q’(D):Q(D):

X

Example – Tree Embedding and Query Containment

Q (D) Q’(D)

Q’(D) Q (D)

name

group

name

Alice Bob

group

name

group

name

Alice Bob

Q’(D):Q(D):

name

group

name

Alice Bob

group

name

group

name

Alice Bob

Query Containment Problem

From answer containment to query containment

Our problemsGiven queries Q and Q’, decide whether Q

Q’The complexity of query containment

Q’(D) Q (D) Q’ Q

Q (D) Q’(D)

Q Q’

Previous Work (I)

Relational query containment Conjunctive queries [Chandra and Merlin, STOC

1977] Acyclic queries [Yannakakis, VLDB 1981] Queries with union [Sagiv and Yannakakis, JACM

1980] Queries with negation [Levy and Sagiv, VLDB 1993] Queries with arithmetic comparisons [Klug, JACM

1988] Recursive queries

[Shmueli, 1993], [Chaudhuri and Vardi, 1992] Queries over bags [Ioannidis and Ramakrishnan,

1995]

Previous Work (II)

XML query containment – two new challenges XPath containment

With *, // and […] [Miklau and Suciu, PODS 2002] With equality testing on tag variables

[Deutsch and Tannen, KRDB 2001] Conjunctive queries over path expressions

[Florescu, Levy and Suciu, PODS 1998] Nested query containment

Containment Cannot be Determined Solely by Comparing XPath Components

Q: for $g in /group where $g/gname/text() = “database”return<area>{

for $p in $g/person return <person> <name>{$p/text()}</name>{for $q in $g/paper where $q/author/text() = $p/text() return

<paper>{$q/title/text()}</paper>}</person>

}</area>

Q’: for $g in /group return<area>{

for $p in $g/person return <person> <name>{$p/text()}</name> <group>{$g/gname/text()}</group>{for $q in $g/paper where $q/author/text() = $p/text() return

<paper>{$q/title/text()}</paper>}</person>

}</area>

Previous Work (II)

XML query containment – two new challenges XPath containment

With *, // and […] [Miklau and Suciu, PODS 2002] With equality testing on tag variables

[Deutsch and Tannen, KRDB 2001] Conjunctive queries over path expressions

[Florescu, Levy and Suciu, PODS 1998] Nested query containment

Complex object query containment [Levy and Suciu, PODS 1997]Containment of nested XML queries Containment of nested XML queries

has has notnot been fully studied been fully studied

Conjunctive XML Queries (c-XQueries)

Returned variables are bound to tag names or text values only.

Conjunctive – no two sibling query blocks return the same tag

XPath: HAVE

Child axis (/) Wildcards (*) Branches ([…])

NOT HAVE descendant // Arithmetic comparison Union

Here, XPath containment is in Here, XPath containment is in PTIMEPTIME

Conjunctive Queries – cont.

A c-XQuery consists of nested query blocks.

The fan-out of a query block is the number of its immediate sub-blocks.

The nesting depth of a query is 1 plus the maximal nesting depth if its sub-blocks.The nesting depth of the query is the depth of

its outer-most block.

Query Head Tree

The structure of an XML query and its answers can be described using a query head tree. Edges represents query blocks.

The label of the node n in the head tree is the returned tag of the block corresponding to the incoming edge of n in Q .

A head tree is also an XML instance if its variables are substituted with actual values.

Query Head Tree Example:

Q: for $x in /project return<group>{for $s in $x/title/text() return<projtitile>{$s}</projtitle>} {for $t in $x/member/text() return<name>{$t}</name>}</group>

Query Head Treegroup

name

projtitle s

t

What is the fan-out and the nesting depth of Q?

Constant Conjunctive XML Queries (cc-XQueries) A cc-XQuery is a c-XQuery that

does not return tag variables.

The head tree of a cc-XQuery has constant labels only.

Roadmap Introduction and problem definition Containment of a subset of XML queries

Query containment is decidable

Query containment in practice Relaxing the assumptions

Conclusions

DepthFanout

Fixed Arbitrary

= 1 PTIME PTIME

Arbitrary coNP complete

In coNEXPTIME

Deciding Q Q’?

How to find a property for an infinite number of input XML instances

Standard technique Find a finite set of input representatives – Canonical

Databases Relational query: each canonical database is a

minimal input to generate the answer template XML query answers have infinite number of shapes

Find a finite set of answer templates – Canonical Answers

Answer Shapes Determined by the Head Tree

Q’:

for $x in /project return

<group>{

for $y in /project/member return

<name>{where $y=“Alice”

return <Alice/>

where $y=“Bob”

return <Bob/>

}</name>

}</group>

Alice

Bob

Head Tree:

group

namegroup

name

group

group

Alice

name

group

name

Bob

group

Alice

name

Bob

Head Tree:

An Additional Candidate Answer

name

group

name

Alice Bob

group

name

group

group

Alice

name

group

name

Bob

group

Alice

name

Bob

Head Tree:

Why Consider the Additional Case

name

group

name

Alice Bob

project project

member member

Alice Bob

Q(D):

group

name

group

name

Alice Bob

Q’(D):

D:

What can Serve as Canonical Answers?

Prefix subtrees of the head tree? – necessary but not sufficient

Trees contained in the head tree? – necessary and sufficient– but, too many and too complex

A Head Tree can Have Many Trees Contained in it

group

name name

Alice BobAlice

group

name name

Alice BobAliceBob

name

group group

Alice BobAliceBob

group

name name name

group

Alice

name

Bob

Head Tree:

What can Serve as Canonical Answers? Prefix subtrees of the head tree?

– necessary but not sufficient Trees contained in the head tree?

– necessary and sufficient– but, too many and too complex

Solution: consider only minimal trees that are contained in the head tree

Canonical Answer A minimal XML instance: No two sibling

subtrees where one is contained in the other Canonical Answer : A minimal XML instance

contained in the head tree

Every answer A of query Q corresponds to a unique canonical answer CA, s.t. A CA, CA A

group

name name

Alice BobAlice

group

Alice

name

Bob

group

name name

Alice Bob

Canonical Database Canonical Database: DBCA

The minimal XML instance to generate CA

project

member

project

member

Alice Bob

project

group

name name

Alice Bob

CA:

DB:

for $x in /project return

<group>{

for $y in /project/member return

<name>{

where $y=“Alice”

return <Alice/>

where $y=“Bob”

return <Bob/>

}</name>

}</group>

Canonical Database – Formal Def. Canonical Database of a cc-XQuery – DBCA.

DBCA is an XML instance, s.t. for each node N of CA where

N’s generator query block is qn the following holds:

Let p0/p1/…pn be a path expression in qn, where p0 is an

optional node variable from an ancestor query block.

For each pi, i [1,n], there is a distinct node, labeled i, that

is a

child of the node for pi-1. If p0 is absent, then p1 is a child of

DBCA’s root.

Sound and Complete Conditions for Nested Query ContainmentLet Q and Q’ be two cc-XQueries.

The following three conditions are equivalent:

1. Q Q’

2. For every canonical database DB of Q, Q(DB) Q’(DB)

3. For every canonical answer CA of Q,

a) CA is a canonical answer of Q’

b) DB’CA DBCA

Properties of Canonical Answers and Databases.

Lemma 1: Let Q be a cc-XQuery and D be an XML instance. There exist a unique canonical answer CA of Q, s.t. Q(D) CA and CA Q(D).

Lemma 2: Let Q be a cc-XQuery, CA be a

canonical answer of Q, DBCA be the canonical

database for CA of Q, and D be an XML instance.

CA Q(D) if only if DBCA D.

Containment of cc-XQueries – Proof (1)

1) => 2) Follows from definition.

2) => 3) CA Q(DBCA) Q(DBCA)

Q’(DBCA)

CA Q’(DBCA) a)

holds.

CA is a canonical answer of Q’ (a), CA

Q’(DBCA ),

DB’CA DBCA b) holds.

Lemma 2

2)

Containment is transitive

Lemma 2

Containment of cc-XQueries – Proof (2)

3) => 2) To show Q Q’, we need to show for every XML instance D, Q(D) Q’(D).

There exists a unique CA of Q, s.t. Q(D) CA and CA Q(D)

DBCA D.

DB’CA DBCA DB’CA D.

CA Q’(D) Q(D) Q’(D).

Lemma 1

Lemma 2

3) b) transitive

Lemma 2

transitive

Query Containment Algorithm Algorithm:

for every canonical answer CA of Q do

1. check whether CA is a canonical answer of Q’

2. generate DBCA and DB’CA

3. check DB’CA DBCA

Roadmap Introduction and problem definition Containment of a subset of XML queries

Query containment is decidable

Query containment in practice Relaxing the assumptions

Conclusions

DepthFanout

Fixed Arbitrary

= 1 ? ?

Arbitrary ? ?

Query Containment Algorithm Algorithm:

for every canonical answer CA of Q do

1. check whether CA is a canonical answer of Q’

2. generate DBCA and DB’CA

3. check DB’CA DBCA

Polynomial in the size and number of canonical answers What are the sizes of canonical answers? What is the number of canonical answers?

Containment of XML Queries with Fanout 1 E.g. d=3 – the depth; m=1 – the maximum fanout

Canonical Answers and Complexity Number: the depth of the query Size: bounded by the depth of the query Complexity: O( d·|Q|·|Q’|)

Theorem: Testing containment of XML Queries with fanout 1 is in PTIME

for $x in /project return

<group>{for $y in /project/member return

<name>{where $y =“Alice” return <Alice/>

}</name>

}</group>

group

Alice

name

group

name

group

Nesting with fanout 1 does not Nesting with fanout 1 does not increase complexityincrease complexity

Roadmap Introduction and problem definition Containment of a subset of XML queries

Query containment is decidable

Query containment in practice Relaxing the assumptions

Conclusions

DepthFanout

Fixed Arbitrary

= 1 PTIME PTIME

Arbitrary ? ?

Containment of XML Queries with Arbitrary Fanout E.g. d=4 – the depth; m=3 – the maximum fanout

Canonical Answers Complexity Number:

Size:

Theorem: Testing containment of XML Queries with depth 2 and arbitrary fanout is coNP-hard

1 2 3 1 2 2 33 1 1 2 2 3 2 33 1 3 11 21 2 2 31 2 3

d

d-1

d

Roadmap

Introduction and problem definition Containment of a subset of XML queries

Query containment is decidable

NOT

TIGHT

Query containment in practice Conclusions

DepthFanout

Fixed Arbitrary

= 1 PTIME PTIME

Arbitrary coNP hard coNP hard

Effect of the Depth on Containment of XML Queries Insight: Kernel Canonical Answer

The root node has a single child In any subtree, a path pattern is repeated no more than

cd times.d – query depthc – #(maximum path steps in a query block)

The size of kernel canonical answers Polynomial in the query size (for fixed nesting depth). Exponential in the query depth (for arbitrary depth).

Theorem: Testing containment of XML queries with fixed depth is

coNP-complete Testing containment of XML queries with arbitrary

depth is in coNEXPTIME

Effect of the Depth on Containment of XML Queries – Cont. Lemma 3: Let Q and Q’ be two cc-

XQueries. Q Q’ iff for each KCA of Q 1. KCA is a Canonical Answer of Q’. 2. DB’KCA DBKCA.

The size of a KCA is O(bcd)d

The number of KCA is O(m(bcd)d) b = #(query blocks in Q). m = #(maximum fanout in Q).

Effect of the Depth on Containment of XML Queries – Cont. Lemma 3: Let Q and Q’ be two cc-

XQueries. Q Q’ iff for each KCA of Q 1. KCA is a Canonical Answer of Q’. 2. DB’KCA DBKCA.

The size of a KCA is O(bcd)d

The number of KCA is O(m(bcd)d) b = #(query blocks in Q). m = #(maximum fanout in Q).

Roadmap Introduction and problem definition Containment of a subset of XML queries

Query containment is decidable

Query containment in practice Relaxing the assumptions

Conclusions

DepthFanout

Fixed Arbitrary

= 1 PTIME PTIME

Arbitrary coNP complete

In coNEXPTIME

Containment Checking in Practice Analyze element cardinality to reduce the

number of canonical answers for containment checking Given the query structure and the underlying XML

database schema, we can infer the cardinality of elements in the query answer.

Specifically, CAs are pruned according to the following 3 rules: 1. (=1) The schema implies that the a certain element

occurs exactly once under its parent element. 2. (≥1) A schema implies that t will occur at least

once under its parent element. 3. (≤1) Schema indicates a certain element occurs at

most once under its parent element.

Containment Checking in Practice – ExampleQ:

for $g in /group where $g/gname/text() = “database”return<area>{

for $p in $g/person return <person> <name>{$p/text()}</name>{for $q in $g/paper where $q/author/text() = $p/text() return

<paper>{$q/title/text()}</paper>}</person>

}</area>

Q’: for $g in /group return<area>{

for $p in $g/person return <person> <name>{$p/text()}</name> <group>{$g/gname/text()}</group>{for $q in $g/paper where $q/author/text() = $p/text() return

<paper>{$q/title/text()}</paper>}</person>

}</area>

#canonical answers – originally : 71

after analysis : 2

Roadmap

Introduction and problem definition Containment of a subset of XML

queries Query containment is decidable

Query containment in practice Relaxing the assumptions

Conclusions

DepthFanout

Fixed Arbitrary

= 1 PTIME PTIME

Arbitrary coNP complete

In coNEXPTIME

An Example Query that Returns Tag Variables

for $x in dbGrp return<result>{

for $y in $x/proj return <group>{

for $u in $y/member return <name> $u/text() </name>for $v in $y/paper return <pub> $v/text() </pub>

}</group>}</result>

Deciding Query Containment Leverage previous results –

simulation mapping [Levy and Suciu, PODS’97]

Check query simulation mapping for every canonical answer

Complexity Simulation mapping can be checked in

polynomial time in terms of query size Complexity of checking containment

does not arise

Roadmap Introduction and problem definition Containment of a subset of XML queries

Query containment is decidable

Query containment in practice Relaxing the assumptions

Conclusions

DepthFanout

Fixed Arbitrary

= 1 PTIME PTIME

Arbitrary coNP complete

In coNEXPTIME

Other Extensions

Query

Type

No tag variab

les

With tag

variables

With unions

Withneg

With//

Witheuiq-join on

tags

With arith comp

Un-neste

d

PTIME

PTIME

coNP complet

e

coNP comple

te

coNP complet

e

NP comple

te

2P

complete

Fan-out=1

PTIME

PTIME

coNP complet

e

coNP comple

te

coNP complet

e

NP comple

te

2P

complete

Fixed- depth

coNP complet

e

coNP complet

e

coNP complet

e

coNP comple

te

coNP complet

e

2P

complete

2P

complete

General

in coNEXPTIME

Conclusions

ContributionsA sound and complete condition for

containment of nested XML queriesDetailed complexity analysis

Future workEvaluate and optimize the containment

algorithm with element cardinality analysis

Answering nested XML queries using views