Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Unordered Tree Matching and Strict Unordered Tree Matching: the

Evaluation of Tree Pattern Queries

Dr. Yangjun Chen

Dept. Applied Computer Science,

University of Winnipeg

515 Portage Ave.

Winnipeg, Manitoba, Canada R3B 2E9

Outline

Motivation Algorithm for unordered tree pattern

query evaluation- Tree encoding- Algorithm description

On the strict unordered tree matching

Experiment results Summary

Efficient method to evaluate XPath expression queries – XML query processing

XML documentsa tree pattern query

Motivation

Document:<Purchase>

<Seller><Name>dell</Name><Item>

<Manufacturer>IBM</Manufacturer><Name>part#1</Name><Item>

<Manufacturer>Intel</Manufacturer></Item>

</Item><Item>

<Name>Part#2</Name></Item><Location>Houston</Location>

</Seller><Buyer>

<Location>Winnipeg</Location><Name>Y-Chen</Name>

</Buyer></Purchase>

P

S B

I I L L NN

I

M

M N N

IBM Part#1 Part#2

Dell Houston Winnipeg Y-Chen

Intel

Motivation

Motivation

Document: Query – XPath expressions:

Q1: /Purchase[Seller[Loc=‘Boston’]]/Buyer[Loc = ‘New York’

Q2: /Purchase//Item[Manufacturer = ‘Intel’]

Purchase

Seller Buyer

Location Location

‘Houston’ ‘Winnipeg’

Buyer

Item

Manufacturer

‘Intel’

d-edge: ancestor-descendant relationship

c-edge: parent-childrelationship

P

S B

I I L L NN

I

M

M N N

IBM Part#1 Part#2

Dell Houston Winnipeg Y-Chen

Intel

Motivation

- XPath expression

a[b[c and .//d]]/b[c and e//d]

book[title = ‘Art of Programming’]//author[fn = ‘Donald’ and

ln = ‘Knuth’]

a

b b

c d c e

d

title

Art of Programming

book

author

fn ln

KnuthDonald

<document><book>

<title>Art of Programming

</title><author>

<fn>Donald Knuth</fn>… …

XPath evaluation against XML documents

- Evaluation based on unordered tree matching:Definition An embedding of a tree pattern Q into an XML document T is a mapping f: Q T, from the nodes of Q to the nodes of T, which satisfies the following conditions:

(i) Preserve node label: For each u Q, label(u) matches label(f(u)).

(ii) Preserve parent-child/ancestor-descendant relationships: If u v in Q, then f(v) is a child of f(u) in T; if u v in Q, then f(v) is a descendant of f(u) in T.

Motivation XPath evaluation against XML documents

a

b c

Q: q3

q1 q2

a

b c

c b

b

T:

v1

v2

v3

v4 v5

v6

- Evaluation based on unordered tree matching:Definition 1 An embedding of a tree pattern Q into an XML document T is a mapping f: Q T, from the nodes of Q to the nodes of T, which satisfies the following conditions:

(i) Preserve node label: For each u Q, label(u) matches label(f(u)).

(ii) Preserve parent-child/ancestor-descendant relationships: If u v in Q, then f(v) is a child of f(u) in T; if u v in Q, then f(v) is a descendant of f(u) in T.

Motivation XPath evaluation against XML documents

a

b c

Q: q3

q1 q2

a

b c

c b

b

T:

v1

v2

v3

v4 v5

v6

Algorithm for query evaluation Tree encoding

Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document.

<A>

<C></C>

<C></C><C></C><D></D>



</A>

(1, 1, 11, 1)

(1, 10, 10, 2)B v8

A v1

(1, 7, 7, 4)

(1, 6, 6, 4)

T:

(1, 5, 5, 4)

(1, 3, 3, 3)

B v2

v3 C B v4

D v7v5 C

(1, 4, 8, 3)

(1, 2, 9, 2)

v6 C

12

3

43

5

65

6

77

891010

11

Tree encoding

Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), denoted as (v), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document.

(i) ancestor-descendant: a node v1 associated with (d1, l1, r1, ln1) is an ancestor of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, l1 < l2, and r1 > r2.

(ii) parent-child: a node v1 associated with (d1, l1, r1, ln1) is the parent of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, l1 < l2, r1 > r2, and ln2 = ln1 + 1.

(iii)from left to right: a node v1 associated with (d1, l1, r1, ln1) is to the left of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, r1 < l2.

A: (1, 1, 11, 1)

B: (1, 2, 9, 2) (1, 4, 8, 3), (1, 10, 10, 2)C: (1, 3, 3, 3) (1, 5, 5, 4), (1, 6, 6, 4)

D: (1, 7, 7, 4)

Data streams:

Algorithm for query evaluation Tree encoding

(1, 1, 11, 1)

(1, 10, 10, 2)B v8

A v1

(1, 7, 7, 4)

(1, 6, 6, 4)

T:

(1, 5, 5, 4)

(1, 3, 3, 3)

B v2

v3 C B v4

D v7v5 C

(1, 4, 8, 3)

(1, 2, 9, 2)

v6 C

sorted by LeftPos values

<A>

<C></C>

<C></C><C></C><D></D>



</A>

12

3

43

5

65

6

77

891010

11

Algorithm for query evaluation Algorithm description

• Our algorithm works bottom-up. Therefore, we need to sort XMLstreams by (DocID, RightPos) values.

• Each time a query Q is submitted to the system, we will associateeach query node q with a data stream L(q) such that foreach v L(q) label(v) = label(q), in which each query node isattached with a list of matching nodes of the document tree.

{v1}

A q1

q2 B B q5

q3 C C q4

L(q1)Q:

L(q2) L(q5)

L(q4)L(q3)

L(q1 ) = (1, 1, 11, 1) -

L(q2 ) = L(q5) = (1, 4, 8, 3), (1, 2, 9, 2) (1, 10, 10, 2) -

L(q3) = L(q4) = (1, 3, 3, 3) (1, 5, 5, 4), (1, 6, 6, 4) -

{v4, v2, v8}

{v3, v5, v6}

sorted by RightPos values

T: Q:


1. S(v) – a set of query nodes q associated with document tree nodev such that Q[q] can be embedded in T[v].

2. (q) – a variable associated with query node q.• During the process, (q) will be dynamically assigned a series

of values a0, a1, ..., am for some m in sequence, where a0 = and ai’s (i = 1, ..., m) are different nodes of T.

• Initially, (q) is set to a0 = . (q) will be changed from ai-1 to

ai = v (i = 1, ..., m) when the following conditions are satisfied.i) v is the node currently encountered. ii) q appears in S(u) for some child node u of v.iii) q is a d-child, or

q is a c-child, and u is a c-child with label(u) = label(q).

v S(v)T:

q (q)Q:


i) v is the node currently encountered. ii) q appears in S(u) for some child node u of v.iii) q is a d-child, or

q is a c-child, and u is a c-child with label(u) = label(q).

v

u

S(u) = {…, q, …}

q’

q

v

u

S(u) = {…, q, …}

q’

q

label(u) = label(q).

u must be a direct child of v.

(q) is changedfrom ai-1 to ai = v

d-child

c-child

v

(q1) = (q2) = … = (ql) = v

q’

q1


3. Subtree embedding by using (q)

ql

label(v) = label(q’}

T[v] embeds Q[q’].


4. Construction of S(v)

5

3

A q1

B q5

C q4

q2 B

q3 C

Q :

1 2

4

• The nodes of Q are numbered in postorder. So the nodes in Q will be referenced by their postorder numbers.

• For a leaf node v in T, any encountered leaf node q in Q will be inserted into S(v) if label(u) = label(q). (Since Q is explored bottom-up, the query nodes in S(v) must be increasingly sorted.)

• For a non-leaf node v with children v1, …, vk, S(v) S(v1) … S(vk) {all nodes q with T[v] embedding Q[q]}.

Since S(v) is sorted, ‘union’ can beimplemented using the ‘merge’ operation.

Strict unordered tree matching Definition

Definition 2 A strict embedding of a tree pattern Q into an XMLdocument T is a mapping f: Q T, from the nodes of Q to the nodesof T, which satisfies the following conditions:

(i) same as (i) in Definition 1.(ii) same as (ii) in Definition 1.(iii) For any two nodes v1 Q and v2 Q, if v1 and v2 are not

related by an ancestor-descendant relationship, then f(v1) andf(v2) in T are not related by an ancestor-descendant relationship.

a

b c

Q: q3

q1 q2

a

b c

c b

b

T:

v1

v2

v3

v4 v5

v6

Strict unordered tree matching Definition

Definition 2 A strict embedding of a tree pattern Q into an XMLdocument T is a mapping f: Q T, from the nodes of Q to the nodesof T, which satisfies the following conditions:

(i) same as (i) in Definition 1.(ii) same as (ii) in Definition 1.(iii) For any two nodes v1 Q and v2 Q, if v1 and v2 are not

related by an ancestor-descendant relationship, then f(v1) andf(v2) in T are not related by an ancestor-descendant relationship.

a

b c

Q: q3

q1 q2

a

b c

c b

b

T:

v1

v2

v3

v4 v5

v6

By Definition 2, this mappingis not allowed.

Strict unordered tree matching

Strict unordered tree matching needs the exponential time.

Q:

a

T:

b dc e

a

x dy bce…

O(|T||Q|dk)

d – the largest out-degree of the nodes in T.k – the largest out-degree of the nodes in Q.

Using the concept of hypergraphs, the time complexity can be slightly improvedto O(|T||Q|2k).

Algorithm for query evaluation Experiments

• We conducted our experiments on a DELL desktop PCequipped with Pentium(R) 4 CPU 2.80GHz, 0.99GB RAMand 20GB hard disk. The code was compiled usingMicrosoft Visual C++ compiler version 6.0, runningstandalone.

• Tested methodsIn the experiments, we have tested four methods:- TwigStack (TS for short),- Twig2Stack (T2S for short),- Twig-List (TL for short),- One-Phase Holistic (OPH for short),- tree-embedding (discussed in this paper, TE for short).

• Tested methodsIn the experiments, we have tested five methods:- TwigStack (TS for short) [1],- Twig2Stack (T2S for short) [2],- Twig-List (TL for short) [3],- One-Phase Holistic (OPH for short) [4],- tree-embedding (discussed in this paper, TE for short).


[1] N. Bruno, N. Koudas, and D. Srivastava, Holistic Twig Joins: Optimal XML Pattern Matching, in Proc. SIGMOD Int. Conf. on Management of Data, Madison, Wisconsin, June 2002, pp. 310-321.

[2] S. Chen, H-G. Li, J. Tatemura, W-P. Hsiung, D. Agrawa, and K.S. Canda, Twig2Stack: Bottom-up Processing of General ized-Tree-Pattern Queries over XML Documents, in Proc. VLDB, Seoul, Korea, Sept. 2006, pp. 283-294.

[3] Qin, L., Yu, J. X., and Ding, B., “TwigList: Make Twig Pattern Matching fast,” In Proc. 12th Int’l Conf. on Database Systems for Advanced Applications (DASFAA),pp. 850-862, Apr. 2007.

[4] Jiang, Z., Luo, C., Hou, W.-C., Zhu, Q., and Che, D., “Efficient Processing of XML Twig Pattern: A Novel One-Phase Holistic Solution,” In Proc. The 18th Int’l Conf. on Database and Expert Systems Applications (DEXA), pp. 87-97, Sept. 2007.


• Theoretical computational complexities

methods Query time Runtime space usage

TwigStack O(|D||Q|) O(|D||Q|)

Twig2Stack O(|D||Q|2+|subTwig-Results|

O(|D||Q|)

TL O(|D||Q|2) O(|D||Q|)

OPH O(|D||Q|2) O(|D||Q|)

TE O(|D||Q|) O(|D||Q|)

D - a largest data stream associated with a node q of Q such that for each vin the data stream we have label(v) = label(q).


• Indexes

All the tested methods use XB-trees as their indexstructure.


• Data sets

The data sets used for the tests are DBLP data set [30] and a syntheticXMARK data set [35]. The DBLP data set is another real data set withhigh similarity in structure. It is in fact a wide and shallow document.

DBLP XMark

Data size 127 (MB) 113 (MB)

Number of nodes 3332k 1666k

Max/Avg. depth 6/2.9 12/5.5


• Queries – altogether 20 queries, divided into 4 groups

Q1: //inproceedings [author]//year [text() = ‘2004’]

Q2: //inproceedings [author and title]//year [text() = ‘2004’]

Q3: //inproceedings [author and title and .//pages]//year[text() = ‘2004’]

Q4: //inproceedings [author and title and .//pages and .//url]//year [text() = 2004’]

Q5: //articles [author and title and .//volume and .//pages and //url]//year [text() = ‘2004’]

Group I:


• Test resultsFor all the experiments, the buffer pool size was fixed at 2000 pages. Thepage size of 8KB was used. For each data set, all the tag names are storedin a single list and then each tag name is represented by its order numberin that list during the evaluation of queries. In our implementation, eachDocId occupies 4 bytes while a number in a Prüfer sequence, a LeftPos ora RightPos occupies 2 bytes. A levelNum value takes only 1 byte.

Q1 Q2 Q3 Q4 Q5

5

4

3

2

1

execution time (sec.)

TST2SOPHTLTM

+

++

+


*

*

*



Q6: //inproceedings[author/* and ./*]/year

Q7: //inproceedings[author/* and title and ./*]/year

Q8: //inproceedings[author/* and title and .//pages and ./*]/year

Q9: //inproceeding[author/* and title and .//pages and .//url and ./*]/year

Q10: //articles[author/* and title and .//volume and .//pages and .//pages and .//url and ./*]/year

Group II:

Q6 Q7 Q8 Q9 Q10

60

50

40

30

20


+

+


TST2SOPHTLTM

+*

**

**



Q11: /site//open_auction[.//seller/person]//date[text() = ‘10/23/2006’]

Q12: /site//open_auction[.//seller/person and .//bidder]//date[text() = ‘10/23/2006’]

Q13: /site//open_auction[.//seller/person and /./bidder/increase]//date[text() = ‘10/23/2006’]

Q14: /site//open_auction[.//seller/person and .//bidder/increase and .//initial]//date [text() = ‘10/23/ 2006’]

Q15: /site//open_auction[.//seller/person and .//bidder/increase and //initial and .//description]//date [text() = ‘10/23/2006’]

Group III:

Q11 Q12 Q13 Q14 Q15

5

4

3

2

1


++

+TST2SOPHTLTM

+*

**


* *



Q16: /site//open_auction[.//seller/person/* and ./*]/date

Q17: /site//open_auction[.//seller/person/* and .//bidder and ./*]/date

Q19: /site//open_auction[.//seller/person/* and .//bidder/increase and .//initial and ./*]/date

Q20: /site//open_auction[.//seller/person/* and .//bidder/increase and .//initial and .//description and ./*]/ date

Group IV:

Q18: /site//open_auction[.//seller/person/* and .//bidder/increase]/date

Q16 Q17 Q18 Q19 Q20

5

4

3

2

1


+TST2SOPHTLTM

+*

*

*

*

*



• Test results

0

2000

4000

6000

8000

10000

Q6 Q7 Q8 Q9 Q10

TS T2S OPH TL TM

Run t

ime s

pace

usa

ge

In the following figure, we compare the runtime memory usage of all thefive tested approaches for the second group of queries. By the memoryusage, we mean the intermediate data structures, not including datastream (concretely, path stacks for TwigStack; hierarchical stacks forTwig2Stack, TL and OPH; and QSs for ours.)

Summary

• An efficient method for evaluating unordered tree pattern queries in XML document databases- parent/child and ancestor/descendant relation- O(|D||Q|) time and O(|D||Q|) space

• Strict unordered tree matching- exponential time

• Experiments- TreeBank database, DBLP and XMark documents - I/O time and CPU processing time

Thank you.