20
<AUTHORS> <NAME ID=1>QUANZHONG LI</NAME> <NAME ID=2>BONGKI MOON</NAME> <AUTHORS> <TITLE> Indexing & Querying XML Data for ../Regular Path Expressions/* </TITLE> <PRESENTERS> <NAME UFID=1234567>SUNDAR</NAME> <NAME UFID=7654321>SUPRIYA</NAME> <PRESENTERS>

Indexing & Querying XML Data for ../Regular Path Expressions/*

  • Upload
    ovidio

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Indexing & Querying XML Data for ../Regular Path Expressions/* . Quanzhong li Bongki MOON . SUNDAR sUPRIYA . - PowerPoint PPT Presentation

Citation preview

Page 1: Indexing & Querying XML Data for  ../Regular Path Expressions/*

<AUTHORS> <NAME ID=1>QUANZHONG LI</NAME> <NAME ID=2>BONGKI MOON</NAME><AUTHORS>

<TITLE>Indexing & Querying XML Data for

../Regular Path Expressions/*</TITLE>

<PRESENTERS> <NAME UFID=1234567>SUNDAR</NAME> <NAME UFID=7654321>SUPRIYA</NAME><PRESENTERS>

Page 2: Indexing & Querying XML Data for  ../Regular Path Expressions/*

Need for this paper

XML – emerged as a popular standard for data representation and data exchange on the Internet

XML Query Languages use Regular Path Expressions to query the data

Conventional approaches (for indexing & searching this data) based on Tree traversals goes for a toss! – under heavy access requests Traversing this hierarchy of XML data becomes a

overhead if the path lengths are long or unknown

What can be done???

Page 3: Indexing & Querying XML Data for  ../Regular Path Expressions/*

Try our System and the Algorithms !!!

New system for indexing & storing XML data – XISS New numbering scheme for elements and attributes

Quick in figuring-out ‘ancestor-descendant’ relationship New index structures

Easier to find all elements and attributes with a particular given name string

Join algorithms for processing Reg-Path-Exp queries EE-Join – to search paths from element to element EA-Join – to find element-attribute pairs KC-Join – to find KC (*) on repeated paths or elements

Page 4: Indexing & Querying XML Data for  ../Regular Path Expressions/*

Go XISS!!!

In general, XML data can be queried for a particular value (or) a structure

By Value: get me “document”; get me “element=‘node1’ ” or “attribute=10”

By Structure: get me parent and child elements/attributes for a given element

Components: Index Structure: element, attribute and structure

(index) Data Loader Query Processor

Numbering Scheme first…..

Page 5: Indexing & Querying XML Data for  ../Regular Path Expressions/*

Deitz vs. Li-Moon…

Deitz says, “If x and y are the nodes of a tree T, x is an ancestor of y iff x comes before y when I climb down the tree (pre-order), and after y when I climb up (post-order)” and shows us his scheme,

Ancestor-Descendant relationship determination in constant timeLi-Moon says, “but this lacks flexibility”This leads to many re-computations when a new node is inserted.Hmm… let us check-out Li-Moon’s….

Page 6: Indexing & Querying XML Data for  ../Regular Path Expressions/*

Li-Moon’s Numbering…

Hey folks, we are going to extend this preorder and cover up a range of descendants

Just associate a pair of numbers <order, size> with each node

Parent node x says to its child node y, “I came before you so my order is less than yours & my size is >= (your order + your size) and so your interval is always contained in my interval”

If there are siblings x & y (same parent), say, x is before y, then order(x) + size(x) < order(y)

Page 7: Indexing & Querying XML Data for  ../Regular Path Expressions/*

Voila!

Here it goes,

So, for any node x, size(x) >= size of all its direct children [ size(x) is Laarrrge!]

That being said, “Given nodes x and y of a tree T, x is an ancestor of y iff

order(x) < order(y) <= order(x) + size(x)

Page 8: Indexing & Querying XML Data for  ../Regular Path Expressions/*

Good news!

Easy accommodation of future insertions – more flexible

Global reordering not necessary until no more reserved spaces

order in <order, size> pair is an unique identifier for each element and attribute in the document

Attribute nodes are placed before their sibling elements in the order – why?

How this scheme helps? – wait till the algorithms!

Switching back to XISS…

Page 9: Indexing & Querying XML Data for  ../Regular Path Expressions/*

Internals of XISS

Index Structure Overview

Page 10: Indexing & Querying XML Data for  ../Regular Path Expressions/*

More structures…

Element Index

Structure Index

Page 11: Indexing & Querying XML Data for  ../Regular Path Expressions/*

Path Join Algorithms

Conventional approaches (top down, bottom up and hybrid traversals) – not effective

Main Idea of proposed algorithm: For a given query “chapter/-*/figure”, - find all ‘chapter’ elements - find all ‘figure’ elements - join the qualified ‘chapter-figure’ pairs

without traversing XML data trees (if ancestor- descendant relationship is obtained quickly)

Page 12: Indexing & Querying XML Data for  ../Regular Path Expressions/*

Complex -> Simple

Complex path expression decomposed to many simple path expressions

Intermediate results are joined to get the final result.

Different types of sub-expressions

Page 13: Indexing & Querying XML Data for  ../Regular Path Expressions/*

EA-Join Algorithm

To join intermediate results from sub-expressions with a list of elements and a list of attributes

E.g. “figure[@caption=‘flowchart’]”Attributes should be placed before sibling

elements in the order by the numbering scheme

Page 14: Indexing & Querying XML Data for  ../Regular Path Expressions/*

EA-Join Algorithm

Input: List of “figure” elements and List of “caption” attributes grouped by documents

Steps: (2 stages) Element sets and attribute sets merged by doc. Id

(single scan) Elements and attributes are merged by figuring out

the parent-child relationship using <order> value (single scan)

Output: A set of (e, a) pairs where e is the parent of a

Page 15: Indexing & Querying XML Data for  ../Regular Path Expressions/*

EE-Join Algorithm

To join intermediate results each of which is a list of elements from a sub-expression

E.g. “chapter/-*/figure”Input: List of “chapter” elements and List of

“figure” elementsSteps (2 stages) are similar to EA-Algorithm

Both element sets are merged by doc. Id (single scan) Chapter element and Figure element are merged by

finding the ancestor-descendant relationship using <order, size> values

Output: A set of (e, f) pairs where e is the ancestor of f

Page 16: Indexing & Querying XML Data for  ../Regular Path Expressions/*

EE-Algorithm

The second stage cannot be done in a single scanIn this E.g. , a “figure” element can be

descendant of more than one “chapter” element (see book1.xml)

order(figure) will lie in more than one chapter interval ([order(chapter), order(chapter) + size(chapter)])

This multiple-times scan is still highly effective in searching long or unknown length paths when compared to the conventional tree traversals.

Page 17: Indexing & Querying XML Data for  ../Regular Path Expressions/*

KC-Algorithm

Processes a regular path expression with zero, one or more occurrences of a subexpression

E.g. “chapter*”, “chapter+”Input: Set of elements from an XML documentSteps:

In each stage applies EE-Algorithm to previous stage’s result

Repeat until no change in result

Output: Kleene Closure of all elements in the given input set

Page 18: Indexing & Querying XML Data for  ../Regular Path Expressions/*

Experiments..

Prototype of XISS was implementedQuery Interface – C++; Parse XML – Gnome

XML Parser; B+-Tree - GiST C++ LibraryWorkstation:

Sun Ultrasparc-II running on Solaris 2.7 RAM: 256 MB; Hard-disk: 20GB

Data Sets Shakespeare’s Plays SIGMOD Record NITF100 and NITF1

Page 19: Indexing & Querying XML Data for  ../Regular Path Expressions/*

Performance Comparison

EE-Join Query: Outperformed bottom-up method by a wide margin

Real-World data set: an order of magnitude faster Synthetic data set: 6 to 10 times faster

Disk IO was a dominant Cost factor – 60% to 90% of total elapsed time

EA-Join Query: It was comparatively better than top-down and

bottom-up approachesKC-Join Query:

Performance was not measured; dependent on EE’s performance

Page 20: Indexing & Querying XML Data for  ../Regular Path Expressions/*

THE END!

Hope this presentation was usefulTHANKS!