45
1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore Sept, 2004

1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

Embed Size (px)

Citation preview

Page 1: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

1

Prefix Path Streaming: a New Clustering Method for XML Twig

Pattern Matching

Ting Chen, Tok Wang Ling, Chee-Yong ChanSchool of Computing,

National University of SingaporeSept, 2004

Page 2: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

2

Outline

XML Twig Pattern Matching Background State of the Art: TwigStack Sub-optimality of TwigStack Contributions of our Work

PPS: Prefix-Path-Streaming Notations Difference of Tag Streaming and PP Streaming

PPS Stream Pruning PPS Based Twig Pattern Matching

Algorithm: PPS-TwigStack Analysis

Performance Conclusion

Page 3: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

3

XML Twig Pattern Matching XML Data Model

A XML document is commonly modeled as a rooted, ordered and labeled tree.

E.g. Note that identifiers (e.g. b1) are given to tree nodes for easy reference

book

preface chapter chapter

paragraph section

section

figure

paragraph

section

figure

paragraph figure

paragraph

………….

title

title

p1

t1

c1

s1

s2

t2 p2

pf1

f1

s3

p3

f2

c2

f3

p4

b1

D1:

Page 4: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

4

XML Twig Pattern Matching Regional Coding [1]

Node Label: (startPos: endPos, LevelNum) startPos and endPos are calculated by performing a pre-order

traversal of the document tree; LevelNum is the level of the node in the tree.

E.g. book (0: 50, 1)

preface (1:3, 2) chapter (4:22, 2) chapter(23:45, 2)

paragraph (2:2, 3) section (5:21, 3)

section(7:12, 4)

figure (10:10, 6)

paragraph(9:11, 5)

section(13:17, 4)

figure (15:15, 6)

paragraph(14:16, 5) figure (19:19, 5)

paragraph(18:20, 4)title: (6:6, 4)

title: (8:8, 5)

D1:

1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.

Page 5: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

5

XML Twig Pattern Matching

What is a Twig Pattern? A twig pattern is a small tree whose nodes are predicates (e.g.

element type test) and edges are either Parent-Child (P-C)

edges or Ancestor-Descendant (A-D) edges. E.g. An XPath query Q1 selects Figure elements which are

descendants of some Paragraph elements which in turn are children of Section elements having at least one child element Title

Section

Title Paragraph

Figure

Q1: Section[Title]/Paragraph//Figure

Page 6: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

6

XML Twig Pattern Matching

Twig Pattern Matching Problem Statement

Given a query twig pattern Q, and a XML database D that has index structures (e.g. regional coding scheme) to identify database nodes that satisfy each of Q’s node predicates, compute ALL the answers to

Q in D. E.g. The matches for twig pattern Section[Title]/Paragraph//Figure in the

document D1 are:

(s1, t1, p4, f3)

(s2, t2, p2, f1)

D1: b1

pf1 c1 c2

p1 s1

s2

f1

p2

s3

f2

p3 f3

s4t1

t2

Page 7: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

7

XML Twig Pattern Matching TwigStack[2]: a Holistic approach

Tag Streaming: all elements of tag q are grouped in a stream Tq ordered by their startPos

Optimal when all the edges in twig pattern are A-D edges Two-phase algorithm:

Phase 1 TwigJoin: a list of intermediate paths are outputted Phase 2 Merge: merge the intermediate path list to get the result

1. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.

Page 8: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

8

XML Twig Pattern Matching TwigStack Review

A node q in a twig pattern Q is coupled with a stack Sq An element e with tag E is pushed into its stack SE if and only if e is in some

match to Q. E.g. Only color highlighted elements are pushed into their stacks. Thus it is ensured that no redundant paths are output.

An element e is popped out from its stack if all matches involving it have been reported

Thus we ensure that the memory space used by stacks is bounded.

Q: Section[//Title]//Paragraph//Figure

SSection

STitle

SParagraph

SFigure

D1:b1

pf1 c1 c2

p1 s1

s2

f1

p2

s3

f2

p3 f3

s4t1

t2

Page 9: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

9

XML Twig Pattern Matching Optimality of TwigStack for A-D edge only twig pattern

Each stream Tq is scanned only once Obviously q should appear the twig pattern

No redundant intermediate result: All intermediate paths output in Phase 1 appear in the final result;

CPU and I/O cost: O(|Input| + |Output|) Space Complexity: O(|Longest Path in the XML tree|) E.g. Q2:Section[//Title]//Paragraph//Figure on document D1

Phase 1 Output: <s1,t1>, <s1,t2>, <s2,t2> <s1,p2,f1>, <s2,p2,f1>, <s1,p3,f2>, <s1,p4,f3>,Note that <s3,p3,f2> will not be output ; although it matches one root to leaf

path of Q2, it is not in a match to Q2. Phase 2 (i.e. the final) output:

< s1,t1,p2,f1>, < s1,t1,p3,f2>, < s1,t1,p4,f3>, < s1,t2,p2,f1>, < s1,t2,p3,f2>, < s1,t2,p4,f3>, < s2,t2,p2,f1>

Page 10: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

10

XML Twig Pattern Matching TwigStack is not optimal for twig pattern with P-C

edge E.g. Q1:Section[Title]/Paragraph//Figure on document D1

s1 s2 s3

t1 t2 p1 p2 p3 p4

f1 f2 f3

•The only match involving s1 is

<s1,t1,p4,f3>

•To know s1 is in a match requires to advance the stream Tpara to p4. p2 and p3 should be read.

•p2 is in a match <s2,t2,p2,f1> to Q1 but p3 is not. However, since s2 and s3 have not been read, both p2 and p3 should be stored.

•Space complexity is no longer

O(|Longest Path in the XML tree|)

Tsection

Ttitle Tpara

Tfigure

Page 11: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

11

XML Twig Pattern Matching TwigStack is not optimal for twig pattern with P-C edge

Remove elements p4 and f3 in document D1 to get D2 To answer Q1:Section[Title]/Paragraph//Figure on document D2

TwigStack will push s1 into stack Ssection because s1 is in a match to Section[//Title]//Paragraph//Figure (instead of Q1)

Consequently, a redundant intermediate path s1/t1 will be output.

D1: Matches to Q1 are highlighted D2: Matches to Q1 are highlighted

b1

pf1 c1 c2

p1 s1

s2

f1

p2

s3

f2

p3 f3

s4t1

t2

b1

pf1 c1 c2

p1 s1

s2

f1

p2

s3

f2

p3

t1

t2

Page 12: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

12

XML Twig Pattern Matching Why is TwigStack not optimal for twig pattern with

P-C edge ? Match Order

s1 s2 s3

t1 t2 p1 p2 p3 p4

f1 f2 f3

•The only match involving s1 is

m1 = <s1,t1,p4,f3>

•The only match involving s2 is

m2 = <s2,t2,p2,f1>

•s1 is ahead of s2 in Tsection

but

p4 is behind p2 in stream Tpara

•m1 doesn’t have all its elements be ahead of the corresponding elements of m2 in their streams. Neither does m2.

Tsection

Ttitle Tpara

Tfigure

Q1:Section[Title]/Paragraph//Figure

Page 13: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

13

XML Twig Pattern Matching Why is TwigStack optimal for twig pattern with only

A-D edge? Match Order

s1 s2 s3

t1 t2 p1 p2 p3 p4

f1 f2 f3

•s1 is in a match m1 = <s1,t1,p2,f1> to Q

•The only match involving s2 is

m2 = <s2,t2,p2,f1>

•Now m1 has all its elements be ahead of the corresponding elements of m2 in their streams.

•It is possible to know whether s1 is in a match to Q without skipping any matching elements of s2 to Q.

Tsection

Ttitle Tpara

Tfigure

Q:Section[//Title]//Paragraph//Figure

Page 14: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

14

XML Twig Pattern Matching Tag Streaming can not achieve optimality. How about other streaming method? Solution: Divide elements of the same tag into more than one stream as the

following diagram shows E.g: Notice that now the match of s1 is not “blocked” by that of s2 any more. Question: How can the division be done?

s1

t1

p2 p3

f1 f2

Q1:Section[Title]/Paragraph//Figure

s2 s3 p4

t2 f3

s1 s2 s3

t1 t2 p1 p2 p3 p4

f1 f2 f3

Tsection

Ttitle Tpara

Tfigure

Page 15: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

15

Our Contributions Prefix Path Streaming: a New XML streaming scheme for XML twig

pattern matching. PPS reduces I/O cost by pruning irrelevant streams. PPS allows optimal processing of a large class of twig patterns with

both A-D and P-C edges. Optimality achieved by outputting only intermediate paths (as Phase 1

output) which appears in final results for the class of twig patterns. The class of twig pattern includes:

A-D only twig pattern (with any number of branch nodes) ; and P-C only twig pattern (with any number of branch nodes); and 1-branch node twig pattern (which can have both A-D and P-C edges)

Page 16: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

16

XML Twig Pattern Matching Our contributions

XML Twig Pattern

A-D edge only Twig Pattern

1-branchnode Twig Pattern

P-C edge only Twig Pattern

Optimal for Tag Streaming

Optimal for PP Streaming

Page 17: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

17

Prefix Path Streaming The prefix-path of an element in a XML document tree is the

path from the document root to the element. A prefix-path stream (PP stream) contains all elements in the

document tree with the same prefix-path, ordered by their startPos numbers.

Example: For the sample XML document D1, there are 11 PP streams

compared with 7 streams generated by Tag Streaming. For example, instead of a single stream Tsection in Tag Streaming,

now we have two PP streams: Tbook/chapter/section and Tbook/chapter/section/section.

Page 18: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

18

Prefix Path Streaming Notations

Twig Pattern Q: a twig pattern q : a node in twig pattern. Qq: the sub-twig of Q rooted at q We do not allow nodes with the same tag in a twig pattern; therefore we

will use node q and tag q interchangeably according to the context Element e is used to refer to a node in an XML document tree; the

capital letter E is used to denote the tag of e Element Stream

T is used to denote an element stream Tag Stream is denoted by Tq where q is the tag name of elements in Tq. PP Stream is denoted by Tprefix-path where prefix-path is the common prefix

path of elements in Tprefix-path A PP Stream is of class q if its elements are of tag q.

A match to a twig pattern is denoted by M

Page 19: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

19

Prefix Path Streaming PP Streaming allows some degree of “parallel” access to sequential files

containing elements with the same tag. More formally, Definition 1: (Match Order)

Intuitively a match MA is ahead of another match MB if no element of MA is placed behind any element of MB in the same stream (either tag stream or PP stream)

Note that the definition of Match Order can be applied to both Tag Streaming and PP streaming

Page 20: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

20

Prefix Path Streaming

Tag Streaming Neither M1 ≤ M2 nor M2 ≤ M1

because of (s1,s2) and (p2, p4)

Informally, we can think M1 and M2 are “tangled”.

PP Streaming M1 ≤ M2 and M2 ≤ M1 M1 and M2 are not

“tangled”

s1 s2 s3

t1 t2 p1 p2 p3 p4

f1 f2 f3

Tsection

Ttitle Tpara

Tfigure

Match OrderExample: Q1:Section[Title]/Paragraph//Figure

M1: <s1,t1,p4,f3> and M2: <s2,t2,p2,f1>

s1

t1

p2 p3

f1 f2

s2 s3 p4

t2 f3

TB/C/S

TB/C/S/T TB/C/S/S TB/C/S/P

TB/C/S/S/P

TB/C/S/S/T

TB/C/S/S/P/F

TB/C/S/P/F

Page 21: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

21

Properties of Prefix Path Streaming Definition (Element Order)

Example Under Tag Streaming for the XML document and Twig Pattern

Q1 in the previous slide t1 ≤ Q1 s2 ; but s2 ≤ Q1 t1 does not hold because t2 is placed behind

t1 in the stream TTitle

Neither s1 ≤ Q1 s2 nor s2 ≤ Q1 s1 Under PP Streaming

s1 ≤ Q1 s2 and s2 ≤ Q1 s1 Intuitively, a ≤Q b means it is possible to know a

is in match to QA without skipping any match involving b to QB.

Page 22: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

22

Properties of Prefix Path Streaming Why can Tag Streaming provide optimal solution for A-D edge only twig

pattern?

PP Streaming has similar property for a larger class of twig pattern

Informally, referring back to slide 20, Lemma 1 says under Tag Streaming, there is no problem of “tangle” for A-D only twig pattern;

Lemma 2 says under PP Streaming, there is no problem of “tangle” for all three classes of twig patterns.

Lemma 2 makes it possible to devise optimal algorithms for a larger class of twig patterns on PP streams

Page 23: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

23

Prefix Path Stream Pruning

Using the set of PP Stream labels, we can prune away many PP streams without accessing the data E.g for Q1:Section[Title]/Paragraph//Figure , the PP

stream in document D1 TBook/Preface/Paragraph can be pruned

Details please refer to our paper [3].

Page 24: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

24

PP Stream Based Twig Pattern Matching PPS-TwigStack:

Similar to TwigStack, a stack Sq is coupled with a node q in the twig pattern.

Using stacks, exponential number of matches can be encoded in linear space. For more details, see [2].

E.g.SSection

STitle SParagraph

SFigure

Section

Title Paragraph

Figure

Page 25: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

25

PP Stream Based Twig Pattern Matching

Overview Similar to TwigStack, our algorithm

operates in two phases (See slide 7).

Our goal is to reduce the number of intermediate paths output in Phase 1 Twig Join (See slide 7).

Flow Chart of the procedure of Phase 1

In each step getNext() finds a “potential” matching element e

Depending on the current stack states, e is then being pushed into its stack SE or being used only for updating stack matching states. (See TwigStack for details)

Phase 1 ends when all leaf streams (streams whose tag is leave node of Q)

Phase 2 “Merge” of TwigStack remains unchanged.

TA

TB

e := getNext()

If e should be pushed into its stack*

Push e to its stack and output a path if e is a leaf element

If all leaf streams end END

Advance the stream e belongs to

……

NO

NO

YES

YES

Pop elements from stack SE and Sparent(E) which are not ancestor of e

*: e is pushed into its stack SE if it has a parent or ancestor element in stack Sparent(E)

Page 26: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

26

PP Stream Based Twig Pattern Matching

Which element will getNext() select in the remaining elements? 1. e of tag E is in a match to QE and e ≤Q e’ for each remaining element

e’; and2. e does not have a parent (ancestor)* element p of tag parent(E) such

that p is in a match to Qparent(E). AndAdditionally, to correctly maintain the stack states, we require3. e has the smallest startPos among elements which satisfy condition (1)

and (2) and have tag E or Sibling(E). Sibling(E) is a node which is sibling node of E in Q.

Why? To find e, no element appearing in matches will be skipped because of (1) and

thus we don’t need to store any elements. (2) ensures that an ancestor element in a match to Q will always be found before

its descendant element. (3) ensures that when an element is popped out of its stack, all of its matches

have been output.

*: depending on the edge type of <E,parent(E)>

Page 27: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

27

PP Stream Based Twig Pattern Matching Example: PPS-TwigStack

– Q1:Section[Title]/Paragraph//Figure

s1

t1

p2 p3

f1 f2

s2 s3 p4

t2 f3

TB/C/S

TB/C/S/S TB/C/S/P

TB/C/S/S/P

TB/C/S/S/P/F

TB/C/S/P/F

SSection

STitle SParagraph

SFigure

s1

1st call getNext() returns s1

•s1 is then pushed into Ssection

•Note that s1 is in a match to QSection which is Q1 itself.TB/C/S/T

TB/C/S/S/T

current stream cursor position

current stream end

Page 28: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

28

PP Stream Based Twig Pattern Matching Example: PPS-TwigStack

– Q1:Section[Title]/Paragraph//Figure

s1

t1

p2 p3

f1 f2

s2 s3 p4

t2 f3

TB/C/S

TB/C/S/S TB/C/S/P

TB/C/S/S/P

TB/C/S/S/P/F

TB/C/S/P/F

SSection

STitle SParagraph

SFigure

s1

2nd call getNext() returns t1

•t1 is then pushed into STitle

•Path s1/t1 outputted; since s1 is in a match to Q1, s1/t1 will be in the final match

t1

TB/C/S/T

TB/C/S/S/T

current stream cursor position

current stream end

Page 29: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

29

PP Stream Based Twig Pattern Matching Example: PPS-TwigStack

– Q1:Section[Title]/Paragraph//Figure

s1

t1

p2 p3

f1 f2

s2 s3 p4

t2 f3

TB/C/S

TB/C/S/S TB/C/S/P

TB/C/S/S/P

TB/C/S/S/P/F

TB/C/S/P/F

SSection

STitle SParagraph

SFigure

s1

3rd call getNext() returns s2

•s2 is then pushed into Ssection

s2TB/C/S/T

TB/C/S/S/T

current stream cursor position

current stream end

Page 30: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

30

PP Stream Based Twig Pattern Matching Example: PPS-TwigStack

– Q1:Section[Title]/Paragraph//Figure

s1

t1

p2 p3

f1 f2

s2 s3 p4

t2 f3

TB/C/S

TB/C/S/S TB/C/S/P

TB/C/S/S/P

TB/C/S/S/P/F

TB/C/S/P/F

SSection

STitle SParagraph

SFigure

s1

4th call getNext() returns t2

•t2 is then pushed into Stitle

•Path s2/t2 outputted

•Note that p4 is not selected because it satisfies all conditions but (3); the reason is that by selecting p4, s2 will be popped out prematurely.

s2

TB/C/S/T

TB/C/S/S/T

t2

current stream cursor position

current stream end

Page 31: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

31

PP Stream Based Twig Pattern Matching Example: PPS-TwigStack

– Q1:Section[Title]/Paragraph//Figure

s1

t1

p2 p3

f1 f2

s2 s3 p4

t2 f3

TB/C/S

TB/C/S/S TB/C/S/P

TB/C/S/S/P

TB/C/S/S/P/F

TB/C/S/P/F

SSection

STitle SParagraph

SFigure

s1

5th call getNext() returns p2

•p2 is pushed into Sparagraph

s2

TB/C/S/T

TB/C/S/S/T

p2

current stream cursor position

current stream end

Page 32: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

32

PP Stream Based Twig Pattern Matching Example: PPS-TwigStack

– Q1:Section[Title]/Paragraph//Figure

s1

t1

p2 p3

f1 f2

s2 s3 p4

t2 f3

TB/C/S

TB/C/S/S TB/C/S/P

TB/C/S/S/P

TB/C/S/S/P/F

TB/C/S/P/F

SSection

STitle SParagraph

SFigure

s1

6th call getNext() returns f1

•f1 is pushed into SFigure

•Path s2/p2/f1 outputted

s2

TB/C/S/T

TB/C/S/S/T

p2

f1

current stream cursor position

current stream end

Page 33: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

33

PP Stream Based Twig Pattern Matching Example: PPS-TwigStack

– Q1:Section[Title]/Paragraph//Figure

s1

t1

p2 p3

f1 f2

s2 s3 p4

t2 f3

TB/C/S

TB/C/S/S TB/C/S/P

TB/C/S/S/P

TB/C/S/S/P/F

TB/C/S/P/F

SSection

STitle SParagraph

SFigure

s1

7th call getNext() returns p3

•s3 is skipped because it is not in any match to Qsection

•p3 is selected but discarded because it doesn’t have a parent element in Ssection

•p2 and s2 are popped out because they are not ancestor of p3

s2

TB/C/S/T

TB/C/S/S/T

p2

current stream cursor position

current stream end

Page 34: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

34

PP Stream Based Twig Pattern Matching Example: PPS-TwigStack

– Q1:Section[Title]/Paragraph//Figure

s1

t1

p2 p3

f1 f2

s2 s3 p4

t2 f3

TB/C/S

TB/C/S/S TB/C/S/P

TB/C/S/S/P

TB/C/S/S/P/F

TB/C/S/P/F

SSection

STitle SParagraph

SFigure

s1

8th call getNext() returns f2

•f2 is then discarded because it does not have an ancestor element in Sparagraph

TB/C/S/T

TB/C/S/S/T

current stream cursor position

current stream end

Page 35: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

35

PP Stream Based Twig Pattern Matching Example: PPS-TwigStack

– Q1:Section[Title]/Paragraph//Figure

s1

t1

p2 p3

f1 f2

s2 s3 p4

t2 f3

TB/C/S

TB/C/S/S TB/C/S/P

TB/C/S/S/P

TB/C/S/S/P/F

TB/C/S/P/F

SSection

STitle SParagraph

SFigure

s1

9th call getNext() returns p4

•p4 is pushed into Sparagraph

TB/C/S/T

TB/C/S/S/T

p4

current stream cursor position

current stream end

Page 36: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

36

PP Stream Based Twig Pattern Matching Example: PPS-TwigStack

– Q1:Section[Title]/Paragraph//Figure

s1

t1

p2 p3

f1 f2

s2 s3 p4

t2 f3

TB/C/S

TB/C/S/S TB/C/S/P

TB/C/S/S/P

TB/C/S/S/P/F

TB/C/S/P/F

SSection

STitle SParagraph

SFigure

s1

10th call getNext() returns f3

•f3 is pushed into SFigure

•Path s1/p4/f3 outputted

•Matching ends because all streams end.TB/C/S/T

TB/C/S/S/T

p4

f3current stream cursor position

current stream end

Page 37: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

37

PP Stream Based Twig Pattern Matching

Just like TwigStack for A-D only twig patterns, for the three classes of twig pattern PPS-TwigStack Push into stacks only elements which appear in the final results No redundant paths are output as the intermediate results

getNext() in PPS-TwigStack selects element e of tag E such

that it is in a match to sub-twig QE getNext() is a recursive method as its counterpart in TwigStack It works by ensuring for each child node Ei of E in Q,

there exists an element ei such that ei is in a match of sub-twig QEi

e and ei satisfy the A-D or P-C relationship of edge <E,Ei> getNext() in PPS-TwigStack is more complex than that of

TwigStack because the descendant elements of an element e can scatter on several PP streams

For each child node Ei of E in Q, we consider all possible PP streams of class Ei to find the suitable child/descendant element ei of e.

Page 38: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

38

PP Stream Based Twig Pattern Matching

Example: Call stacks of the 1st getNext() element s1 is returned The current cursor position of each stream is denoted in

parentheses. e.g. TBCS(s1)

<paragraph, {TB/C/S/S/P(p2),TB/C/S/P(p4)} >

<section, {TB/C/S(s1),TB/C/S/S(s2)}>

< figure, {TB/C/S/S/P/F(f1),TB/C/S/P/F(f3)} >

getNext(Paragraph)

getNext(Figure)

getNext(title)

<title, {TB/C/S/T(t1),TB/C/S/S/T(t2)}>

getNext(Section)

Page 39: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

39

PP Stream Based Twig Pattern Matching

Analysis of PPS-TwigStack Perform a single forward scan of all useful PP streams. No redundant intermediate paths

The worst case I/O complexity is linear in the sum of sizes of the useful PP streams and the output list.

CPU complexity is O((|no. of PP streams after pruning|) * (|Input list|+|Output list|)),

which is also independent of the size of partial matches to any root-to-leaf path of the twig.

The space complexity is the maximum length of a root-to-leaf path.

If a twig pattern is not in any of the three classes afore-mentioned, we can simply run TwigStack by simulating a single stream Tq in Tag Streaming using multiple PP streams of class q.

Page 40: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

40

Performance The impact of Prefix-Path streaming on XML document

storage. In most practical cases, the total number of PPS streams in XML

documents can be considered as linear to the number of different tags. Compared with the space used by Tag Streaming approach, PP

streaming requires little extra space. Fragmentation Factor = space used by PPS / space used by Tag Streaming

Data Set Size(MB)

No. of tags

No of PP streams

Max no. of PP streams per tag

Fragmentation factor

XMark xmark1 50 77 536 96 1.08

xmark2 130 77 548 99 1.06

xmark3 500 77 548 99 1.02

DBLP 100 27 63 6 1.01

Page 41: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

41

Performance Effects of Prefix-Path stream pruning.

We measure the effect of PPS pruning on several twig queries in XMark benchmark .

Pruning can reduce the I/O cost significantly Query 1: site/people/person/name

Query 2: site/open_auctions/open_auction/bidder/increase/

Query 3: site/closed_auctions/closed_auction/price

Query 4: site/regions//item

Query 5: site/people/person[//age and //income]/name

Query No. of Pages before pruning

No of PPS streams before pruning

No. of Pages after Pruning

No. of PPS streams after pruning

Page PruningRatio

Q1 293 17 60 4 4.9

Q2 180 6 68 5 2.6

Q3 14 4 14 4 1

Q4 40 10 11 8 3.6

Q5 316 19 84 6 3.76

Page 42: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

42

Performance Performance Comparison of PPS-TwigStack and

TwigStack The running times of PPS-TwigStack (with pruning) and the TwigStack on

the five queries on a XMark document of size 200M Bytes PPS-TwigStack out-performs TwigStack in all the queries except Q3

which PPS-pruning can’t find any irrelevant PPS stream.

0

2

4

6

8

Exe

cuti

on

T

ime(

Sec

on

d)

Q1 Q2 Q3 Q4 Q5

PPS TwigStack TwigStack

Page 43: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

43

Performance Performance Comparison of PPS-TwigStack and TwigStack

Even there is no PP streams which can be pruned, PPS-TwigStack still outperforms TwigStack because it outputs fewer intermediate paths.

We synthesize XML documents which have a lot of partial matches to branches of twig pattern and there is no PP stream can be pruned.

We control and vary the ratio of portions of XML document that contain partial matches and that contain full matches.

Run PPS-TwigStack and TwigStack on a 300 MB documents and twig pattern A/B[C//D]//E. Notice PPS-TwigStack is optimal for the query.

020406080

100

5% 10% 15% 20% 25% 30%Portions of document having matches

Exe

cuti

on

T

ime(

Sec

on

d)

PPS TwigStack TwigStack

Page 44: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

44

Conclusion

We propose a novel clustering method Prefix-Path Streaming (PPS) for XML twig pattern matching.

PP Streaming can reduce I/O cost by avoiding scanning of irrelevant PP streams.

PP Streaming optimal processing of a large class of twig patterns with both A-D and P-C edges.

The work is published in [3].

Page 45: 1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University

45

References:

1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.

2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.

3. T. Chen, T. W. Ling, C.Y. Chan, Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching. In In Proceedings of DEXA, 2004.