36
Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

  • Upload
    wendi

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries. Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9. Outline. Motivation - PowerPoint PPT Presentation

Citation preview

Page 1: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Unordered Tree Matching and Strict Unordered Tree Matching: the

Evaluation of Tree Pattern Queries

Dr. Yangjun Chen

Dept. Applied Computer Science,

University of Winnipeg

515 Portage Ave.

Winnipeg, Manitoba, Canada R3B 2E9

Page 2: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Outline

Motivation Algorithm for unordered tree pattern

query evaluation- Tree encoding- Algorithm description

On the strict unordered tree matching

Experiment results Summary

Page 3: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Efficient method to evaluate XPath expression queries – XML query processing

XML documentsa tree pattern query

Motivation

Page 4: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Document:<Purchase>

<Seller><Name>dell</Name><Item>

<Manufacturer>IBM</Manufacturer><Name>part#1</Name><Item>

<Manufacturer>Intel</Manufacturer></Item>

</Item><Item>

<Name>Part#2</Name></Item><Location>Houston</Location>

</Seller><Buyer>

<Location>Winnipeg</Location><Name>Y-Chen</Name>

</Buyer></Purchase>

P

S B

I I L L NN

I

M

M N N

IBM Part#1 Part#2

Dell Houston Winnipeg Y-Chen

Intel

Motivation

Page 5: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Motivation

Document: Query – XPath expressions:

Q1: /Purchase[Seller[Loc=‘Boston’]]/Buyer[Loc = ‘New York’

Q2: /Purchase//Item[Manufacturer = ‘Intel’]

Purchase

Seller Buyer

Location Location

‘Houston’ ‘Winnipeg’

Buyer

Item

Manufacturer

‘Intel’

d-edge: ancestor-descendant relationship

c-edge: parent-childrelationship

P

S B

I I L L NN

I

M

M N N

IBM Part#1 Part#2

Dell Houston Winnipeg Y-Chen

Intel

Page 6: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Motivation

- XPath expression

a[b[c and .//d]]/b[c and e//d]

book[title = ‘Art of Programming’]//author[fn = ‘Donald’ and

ln = ‘Knuth’]

a

b b

c d c e

d

title

Art of Programming

book

author

fn ln

KnuthDonald

<document><book>

<title>Art of Programming

</title><author>

<fn>Donald Knuth</fn>… …

XPath evaluation against XML documents

Page 7: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

- Evaluation based on unordered tree matching:Definition An embedding of a tree pattern Q into an XML document T is a mapping f: Q T, from the nodes of Q to the nodes of T, which satisfies the following conditions:

(i) Preserve node label: For each u Q, label(u) matches label(f(u)).

(ii) Preserve parent-child/ancestor-descendant relationships: If u v in Q, then f(v) is a child of f(u) in T; if u v in Q, then f(v) is a descendant of f(u) in T.

Motivation XPath evaluation against XML documents

a

b c

Q: q3

q1 q2

a

b c

c b

b

T:

v1

v2

v3

v4 v5

v6

Page 8: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

- Evaluation based on unordered tree matching:Definition 1 An embedding of a tree pattern Q into an XML document T is a mapping f: Q T, from the nodes of Q to the nodes of T, which satisfies the following conditions:

(i) Preserve node label: For each u Q, label(u) matches label(f(u)).

(ii) Preserve parent-child/ancestor-descendant relationships: If u v in Q, then f(v) is a child of f(u) in T; if u v in Q, then f(v) is a descendant of f(u) in T.

Motivation XPath evaluation against XML documents

a

b c

Q: q3

q1 q2

a

b c

c b

b

T:

v1

v2

v3

v4 v5

v6

Page 9: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Tree encoding

Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document.

<A><B>

<C></C><B>

<C></C><C></C><D></D>

</B></B><B></B>

</A>

(1, 1, 11, 1)

(1, 10, 10, 2)B v8

A v1

(1, 7, 7, 4)

(1, 6, 6, 4)

T:

(1, 5, 5, 4)

(1, 3, 3, 3)

B v2

v3 C B v4

D v7v5 C

(1, 4, 8, 3)

(1, 2, 9, 2)

v6 C

12

3

43

5

65

6

77

891010

11

Page 10: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Tree encoding

Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), denoted as (v), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document.

(i) ancestor-descendant: a node v1 associated with (d1, l1, r1, ln1) is an ancestor of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, l1 < l2, and r1 > r2.

(ii) parent-child: a node v1 associated with (d1, l1, r1, ln1) is the parent of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, l1 < l2, r1 > r2, and ln2 = ln1 + 1.

(iii)from left to right: a node v1 associated with (d1, l1, r1, ln1) is to the left of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, r1 < l2.

Page 11: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

A: (1, 1, 11, 1)

B: (1, 2, 9, 2) (1, 4, 8, 3), (1, 10, 10, 2)C: (1, 3, 3, 3) (1, 5, 5, 4), (1, 6, 6, 4)

D: (1, 7, 7, 4)

Data streams:

Algorithm for query evaluation Tree encoding

(1, 1, 11, 1)

(1, 10, 10, 2)B v8

A v1

(1, 7, 7, 4)

(1, 6, 6, 4)

T:

(1, 5, 5, 4)

(1, 3, 3, 3)

B v2

v3 C B v4

D v7v5 C

(1, 4, 8, 3)

(1, 2, 9, 2)

v6 C

sorted by LeftPos values

<A><B>

<C></C><B>

<C></C><C></C><D></D>

</B></B><B></B>

</A>

12

3

43

5

65

6

77

891010

11

Page 12: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Algorithm description

• Our algorithm works bottom-up. Therefore, we need to sort XMLstreams by (DocID, RightPos) values.

• Each time a query Q is submitted to the system, we will associateeach query node q with a data stream L(q) such that foreach v L(q) label(v) = label(q), in which each query node isattached with a list of matching nodes of the document tree.

{v1}

A q1

q2 B B q5

q3 C C q4

L(q1)Q:

L(q2) L(q5)

L(q4)L(q3)

L(q1 ) = (1, 1, 11, 1) -

L(q2 ) = L(q5) = (1, 4, 8, 3), (1, 2, 9, 2) (1, 10, 10, 2) -

L(q3) = L(q4) = (1, 3, 3, 3) (1, 5, 5, 4), (1, 6, 6, 4) -

{v4, v2, v8}

{v3, v5, v6}

sorted by RightPos values

T: Q:

Page 13: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Algorithm description

1. S(v) – a set of query nodes q associated with document tree nodev such that Q[q] can be embedded in T[v].

2. (q) – a variable associated with query node q.• During the process, (q) will be dynamically assigned a series

of values a0, a1, ..., am for some m in sequence, where a0 = and ai’s (i = 1, ..., m) are different nodes of T.

• Initially, (q) is set to a0 = . (q) will be changed from ai-1 to

ai = v (i = 1, ..., m) when the following conditions are satisfied.i) v is the node currently encountered. ii) q appears in S(u) for some child node u of v.iii) q is a d-child, or

q is a c-child, and u is a c-child with label(u) = label(q).

v S(v)T:

q (q)Q:

Page 14: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Algorithm description

i) v is the node currently encountered. ii) q appears in S(u) for some child node u of v.iii) q is a d-child, or

q is a c-child, and u is a c-child with label(u) = label(q).

v

u

S(u) = {…, q, …}

q’

q

v

u

S(u) = {…, q, …}

q’

q

label(u) = label(q).

u must be a direct child of v.

(q) is changedfrom ai-1 to ai = v

d-child

c-child

Page 15: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

v

(q1) = (q2) = … = (ql) = v

q’

q1

Algorithm for query evaluation Algorithm description

3. Subtree embedding by using (q)

ql

label(v) = label(q’}

T[v] embeds Q[q’].

Page 16: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Algorithm description

4. Construction of S(v)

5

3

A q1

B q5

C q4

q2 B

q3 C

Q :

1 2

4

• The nodes of Q are numbered in postorder. So the nodes in Q will be referenced by their postorder numbers.

• For a leaf node v in T, any encountered leaf node q in Q will be inserted into S(v) if label(u) = label(q). (Since Q is explored bottom-up, the query nodes in S(v) must be increasingly sorted.)

• For a non-leaf node v with children v1, …, vk, S(v) S(v1) … S(vk) {all nodes q with T[v] embedding Q[q]}.

Since S(v) is sorted, ‘union’ can beimplemented using the ‘merge’ operation.

Page 17: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Strict unordered tree matching Definition

Definition 2 A strict embedding of a tree pattern Q into an XMLdocument T is a mapping f: Q T, from the nodes of Q to the nodesof T, which satisfies the following conditions:

(i) same as (i) in Definition 1.(ii) same as (ii) in Definition 1.(iii) For any two nodes v1 Q and v2 Q, if v1 and v2 are not

related by an ancestor-descendant relationship, then f(v1) andf(v2) in T are not related by an ancestor-descendant relationship.

a

b c

Q: q3

q1 q2

a

b c

c b

b

T:

v1

v2

v3

v4 v5

v6

Page 18: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Strict unordered tree matching Definition

Definition 2 A strict embedding of a tree pattern Q into an XMLdocument T is a mapping f: Q T, from the nodes of Q to the nodesof T, which satisfies the following conditions:

(i) same as (i) in Definition 1.(ii) same as (ii) in Definition 1.(iii) For any two nodes v1 Q and v2 Q, if v1 and v2 are not

related by an ancestor-descendant relationship, then f(v1) andf(v2) in T are not related by an ancestor-descendant relationship.

a

b c

Q: q3

q1 q2

a

b c

c b

b

T:

v1

v2

v3

v4 v5

v6

By Definition 2, this mappingis not allowed.

Page 19: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Strict unordered tree matching

Strict unordered tree matching needs the exponential time.

Q:

a

T:

b dc e

a

x dy bce…

O(|T||Q|dk)

d – the largest out-degree of the nodes in T.k – the largest out-degree of the nodes in Q.

Using the concept of hypergraphs, the time complexity can be slightly improvedto O(|T||Q|2k).

Page 20: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Experiments

• We conducted our experiments on a DELL desktop PCequipped with Pentium(R) 4 CPU 2.80GHz, 0.99GB RAMand 20GB hard disk. The code was compiled usingMicrosoft Visual C++ compiler version 6.0, runningstandalone.

• Tested methodsIn the experiments, we have tested four methods:- TwigStack (TS for short),- Twig2Stack (T2S for short),- Twig-List (TL for short),- One-Phase Holistic (OPH for short),- tree-embedding (discussed in this paper, TE for short).

Page 21: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

• Tested methodsIn the experiments, we have tested five methods:- TwigStack (TS for short) [1],- Twig2Stack (T2S for short) [2],- Twig-List (TL for short) [3],- One-Phase Holistic (OPH for short) [4],- tree-embedding (discussed in this paper, TE for short).

Algorithm for query evaluation Experiments

[1] N. Bruno, N. Koudas, and D. Srivastava, Holistic Twig Joins: Optimal XML Pattern Matching, in Proc. SIGMOD Int. Conf. on Management of Data, Madison, Wisconsin, June 2002, pp. 310-321.

[2] S. Chen, H-G. Li, J. Tatemura, W-P. Hsiung, D. Agrawa, and K.S. Canda, Twig2Stack: Bottom-up Processing of General ized-Tree-Pattern Queries over XML Documents, in Proc. VLDB, Seoul, Korea, Sept. 2006, pp. 283-294.

[3] Qin, L., Yu, J. X., and Ding, B., “TwigList: Make Twig Pattern Matching fast,” In Proc. 12th Int’l Conf. on Database Systems for Advanced Applications (DASFAA),pp. 850-862, Apr. 2007.

[4] Jiang, Z., Luo, C., Hou, W.-C., Zhu, Q., and Che, D., “Efficient Processing of XML Twig Pattern: A Novel One-Phase Holistic Solution,” In Proc. The 18th Int’l Conf. on Database and Expert Systems Applications (DEXA), pp. 87-97, Sept. 2007.

Page 22: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Experiments

• Theoretical computational complexities

methods Query time Runtime space usage

TwigStack O(|D||Q|) O(|D||Q|)

Twig2Stack O(|D||Q|2+|subTwig-Results|

O(|D||Q|)

TL O(|D||Q|2) O(|D||Q|)

OPH O(|D||Q|2) O(|D||Q|)

TE O(|D||Q|) O(|D||Q|)

D - a largest data stream associated with a node q of Q such that for each vin the data stream we have label(v) = label(q).

Page 23: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Experiments

• Indexes

All the tested methods use XB-trees as their indexstructure.

Page 24: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Experiments

• Data sets

The data sets used for the tests are DBLP data set [30] and a syntheticXMARK data set [35]. The DBLP data set is another real data set withhigh similarity in structure. It is in fact a wide and shallow document.

DBLP XMark

Data size 127 (MB) 113 (MB)

Number of nodes 3332k 1666k

Max/Avg. depth 6/2.9 12/5.5

Page 25: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Experiments

• Queries – altogether 20 queries, divided into 4 groups

Q1: //inproceedings [author]//year [text() = ‘2004’]

Q2: //inproceedings [author and title]//year [text() = ‘2004’]

Q3: //inproceedings [author and title and .//pages]//year[text() = ‘2004’]

Q4: //inproceedings [author and title and .//pages and .//url]//year [text() = 2004’]

Q5: //articles [author and title and .//volume and .//pages and //url]//year [text() = ‘2004’]

Group I:

Page 26: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Experiments

• Test resultsFor all the experiments, the buffer pool size was fixed at 2000 pages. Thepage size of 8KB was used. For each data set, all the tag names are storedin a single list and then each tag name is represented by its order numberin that list during the evaluation of queries. In our implementation, eachDocId occupies 4 bytes while a number in a Prüfer sequence, a LeftPos ora RightPos occupies 2 bytes. A levelNum value takes only 1 byte.

Page 27: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Q1 Q2 Q3 Q4 Q5

5

4

3

2

1

execution time (sec.)

TST2SOPHTLTM

+

++

+

Algorithm for query evaluation Experiments

*

*

*

Page 28: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Experiments

• Queries – altogether 20 queries, divided into 4 groups

Q6: //inproceedings[author/* and ./*]/year

Q7: //inproceedings[author/* and title and ./*]/year

Q8: //inproceedings[author/* and title and .//pages and ./*]/year

Q9: //inproceeding[author/* and title and .//pages and .//url and ./*]/year

Q10: //articles[author/* and title and .//volume and .//pages and .//pages and .//url and ./*]/year

Group II:

Page 29: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Q6 Q7 Q8 Q9 Q10

60

50

40

30

20

execution time (sec.)

+

+

Algorithm for query evaluation Experiments

TST2SOPHTLTM

+*

**

**

Page 30: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Experiments

• Queries – altogether 20 queries, divided into 4 groups

Q11: /site//open_auction[.//seller/person]//date[text() = ‘10/23/2006’]

Q12: /site//open_auction[.//seller/person and .//bidder]//date[text() = ‘10/23/2006’]

Q13: /site//open_auction[.//seller/person and /./bidder/increase]//date[text() = ‘10/23/2006’]

Q14: /site//open_auction[.//seller/person and .//bidder/increase and .//initial]//date [text() = ‘10/23/ 2006’]

Q15: /site//open_auction[.//seller/person and .//bidder/increase and //initial and .//description]//date [text() = ‘10/23/2006’]

Group III:

Page 31: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Q11 Q12 Q13 Q14 Q15

5

4

3

2

1

execution time (sec.)

++

+TST2SOPHTLTM

+*

**

Algorithm for query evaluation Experiments

* *

Page 32: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Experiments

• Queries – altogether 20 queries, divided into 4 groups

Q16: /site//open_auction[.//seller/person/* and ./*]/date

Q17: /site//open_auction[.//seller/person/* and .//bidder and ./*]/date

Q19: /site//open_auction[.//seller/person/* and .//bidder/increase and .//initial and ./*]/date

Q20: /site//open_auction[.//seller/person/* and .//bidder/increase and .//initial and .//description and ./*]/ date

Group IV:

Q18: /site//open_auction[.//seller/person/* and .//bidder/increase]/date

Page 33: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Q16 Q17 Q18 Q19 Q20

5

4

3

2

1

execution time (sec.)

+TST2SOPHTLTM

+*

*

*

*

*

Algorithm for query evaluation Experiments

Page 34: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Algorithm for query evaluation Experiments

• Test results

0

2000

4000

6000

8000

10000

Q6 Q7 Q8 Q9 Q10

TS T2S OPH TL TM

Run t

ime s

pace

usa

ge

In the following figure, we compare the runtime memory usage of all thefive tested approaches for the second group of queries. By the memoryusage, we mean the intermediate data structures, not including datastream (concretely, path stacks for TwigStack; hierarchical stacks forTwig2Stack, TL and OPH; and QSs for ours.)

Page 35: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Summary

• An efficient method for evaluating unordered tree pattern queries in XML document databases- parent/child and ancestor/descendant relation- O(|D||Q|) time and O(|D||Q|) space

• Strict unordered tree matching- exponential time

• Experiments- TreeBank database, DBLP and XMark documents - I/O time and CPU processing time

Page 36: Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Thank you.