28
A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML Represented by: Ai Mu Based on the paper written by Ning Zhang, Varun Kacholia, M.Tamer Ozsu.

A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

  • Upload
    rocco

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML. Represented by: Ai Mu Based on the paper written by Ning Zhang, Varun Kacholia, M.Tamer Ozsu. Outline. Introduction Preliminaries NoK pattern matching at the logical level Physical storage - PowerPoint PPT Presentation

Citation preview

Page 1: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

A Succinct Physical Storage Scheme for Efficient Evaluation

of Path Queries in XML

Represented by: Ai Mu

Based on the paper written by Ning Zhang, Varun Kacholia, M.Tamer Ozsu.

Page 2: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Outline

IntroductionPreliminariesNoK pattern matching at the logical levelPhysical storageXML path queries at the physical levelExperimental evaluationConclusion

Page 3: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

IntroductionThe increasingly wider use of XML leads to

the need to store large volumes of data encoded in XMLthe need to query XML data more efficiently

Path expressions are the most natural way to query tree-structured data such as XML tree

evaluate path expressions against XML tree – tree pattern matchinga path expression: a pattern tree that specifies a set of constraintsTPM problem: to find the nodes in the XML tree that satisfy all the constraints

Page 4: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Existing evaluation approachNavigational Approach

traverse the tree structuretest whether a tree node satisfies the constraints by the path expression

Join-based ApproachSelect a list of XML tree nodes that satisfy the node-associated constraints for each pattern tree nodeJoin the lists based on their structural relationship

However, these two are not adaptive to the streaming XML data.

Page 5: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

A Novel ApproachDefine a special pattern tree and pattern matching

Next-of-Kin pattern tree in which nodes are connected by parent-child and following/preceding-sibling relationship onlyNext-of-Kin pattern matching

• speed up the node selection step• reduce the join size in the second step

Design a novel, succinct physical storage schemesupport efficient NoK query evaluation

Page 6: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

PreliminariesConsider the bibliography XML

<bib>

<book year=“1994”>

<title>TCP/IP Illustrated</title>

<author><last>Stevens</last>

<first>W.</first></author>

<publisher>Addison-Wesley</publisher>

<price>65.95</price>

</book>

<book year=“1992”>

<title>Advanced Programming in the Unix</title>

<author><last>Stevens</last>

<first>W.</first></author>

<publisher>Addison-Wesley</publisher>

<price>65.95</price>

</book>

<book year=“2000”>

<title>Data on the Web</title>

<author><last>Abiteboul</last> <first>Serge</first></author><author><last>Buneman</last> <first>Peter</first></author><author><last>Suciu</last> <first>Dan</first></author><publisher>Morgan Kaufmann

Publisher</publisher><price>39.93</price></book><book year=“1999”><title>The Economics of Technology</title><editor><last>Gerbarg</last><first>Darcy</

first> <affiliation>CiTI</affiliation></editor><publisher>Kluwer Academic

Publisher</publisher><price>129.95</price></book></bib>

Page 7: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Subject treeSubject tree or XML tree

a b b b b

z e c i j z e c i j z e c c c i j z e d i j

f g f g f g f g f g f

gNote: bib-> a book-> b @year->z author->c title->e publisher-> i price-> j first->f last-> g editor->d

Page 8: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Pattern tree

Query: “find all books written by

Stevens whose price is less than 100”.

Path expression:

//book[author/last=“Stevens”]

[price<100].

Pattern tree A graphical representation

of constraints specified in a path expression

root

book

author price<100

Last=“Stevens”

////

// //

//

Page 9: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Nok pattern matching at the logical level

Next-of-Kin pattern tree:Consists of edges whose labels are in {parent-child relationship, following-sibling relationship}.

Two steps in the process of matching Nok pattern tree to the subject tree:

Locate the nodes in the subject tree to start pattern matching;Nok pattern matching from that starting node.

Page 10: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Locate the starting nodeMany options to locate the starting point:

Naïve approach: traverse the whole subject tree in document order and try to

match each node with the root of the Nok pattern tree;

Index on tag names: If have a B+ tree on tag names, an index lookup for the

root of the NoK pattern tree will generate all possible starting points;

Index on data values: If there are value constraints in the NoK pattern tree (such

as last=“Stevens”) and we have a B+ tree for all values in XML document, we can use that value-based index to locate all nodes having the particular value and use them as the starting points.

Page 11: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

ExampleConsider the subject tree and NoK pattern tree with tag names: b[c/g=“Stevens”][j<100]Suppose : the starting point snode -- the first node b of subject tree, which matches proot and is appended to the result set R

iterates over b’s children to check whether they match any node in the set {c,j};third node of snode matches with c, a recursive call will be invoked to match the NoK pattern tree c/g=“Stevens” with the subject tree rooted at snode/c;The recursive call returns True, check the other children and eventually j is matched, causing the set = 0;The result R contains the starting point b.

b

i jz e c

f g

Page 12: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Physical storageDesideration for designing the physical storage scheme are:

Structural information should be stored separately from the value information.The subject tree should be “materialized” to fit into the paged I/O model.The storage scheme should have enough auxiliary information (e.g.indexed on values and tag names) to speed up Nok pattern matching.The storage scheme should be adaptable to support updates.

Page 13: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Value information storageBased on two observations, value information and structural information should be stored separately:

An XML document is a mixture of structural information and value information;Any path query can be divided into two subqueries: pattern matching on the tree structural and selection based on values.

Example: Path expression: //book[author/last=“Stevens”][price<100]. structural constraints: //book[author/last][price]

value constraints: last=“Stevens” and price<100

Separating structural and value information --- separate the different concerns and address each appropriately

B+ tree on the value information;path index or tag name index on the structural information.

Page 14: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Value information storage(cont)Maintain connection between structural and value information

Use Dewey ID as key of tree nodes to reconnect, e.g. Dewey ID of root a =0, Dewey ID of its second child b =0.2 ;

Given a Dewey ID, another B+ tree to locate value of node in the data file.

B+ treeB+ treeHashedValue->HashedValue->

Dewey IDDewey ID

B+ treeB+ treeDewey ID->Dewey ID->

Pointer to value in data filePointer to value in data file

Data FileData File

Page 15: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Value information storage(cont)

In the data file, each element content could be represented by a binary tuple (len,value)

e.g. (4,”1994”),(7,”Stevens”),(5,”69.95”)

Dewey ID B+ tree: position of these records in the data file.

More than one node with same value, just keep one copy and let these nodes point to the same position.

Page 16: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Structural information storage

Store the nodes in pre-order and keep the tree structure by inserting pairs of parentheses.

E.g. (a(b)(c)) – represent the tree that has a root a and its two children b and

c “(”: indicate the beginning of a subtree; “)”: indicate the end of the subtree.

Each node implies an open parentheses, so a b) c)(a (b) (c))

Page 17: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

String representation

The string representation of an XML treeThe string representation of an XML treeThe depth of The depth of node from the node from the root.root.

Page 18: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Structural information storage(cont)

For each page, an extra tuple (st, lo, hi) is stored, where

st: the level of the last node in the previous page,

lo and hi: the minimum and maximum levels of all nodes in that page.

Page layout for structural info.

a b z) e) c f ) g ) ) i) j ) ) b z ) e ) c fa b z) e) c f ) g ) ) i) j ) ) b z ) e ) c f(st,lo,hi)(st,lo,hi)

nextpagenextpage

HeaderHeader String RepresentationString Representation Reserved for updateReserved for update

Page 19: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Advantages for page layoutUsing extra tuple (st,lo,hi) can guess the page where the following sibling or parent is located.Easy to insert nodes into the string representation of the tree

E.g. to insert a b) c)) as a subtree of the first f node in page 1:

• Allocate a new page with the content a b) c));

• Cut-and-paste the content after f in page 1 to the end of content of the new page;

• Insert the new page between page 1 and 2;

• Update the tuple (st,lo,hi) information for page 1.

in page 1: a b z) e) c f ) g)) new page: a b) c)) construct new page: a b) c)) ) g))

a b z) e) c f

i) j)) b z)e) d f

a b)c )) ) g))

Page 20: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

XML path queries at the physical level

In the Nok pattern matching, the only operation on the subject tree is the iteration over children of a specific node.

Using the physical storage technique, this operation is divided into:

find the first child of a specific nodefind the following sibling of a node

According to the pre-order property of the string representation, these two operations can be performed by looking at the node level information of each page from left to right without reconstructing the tree structure.

Page 21: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

ExampleFind the first child of character b in the first page.

The first child of b must be the next character if it is not “)”.

If b is at level L ,the first child of b should at level L+1.

Answer: right neighbor z

Page 22: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Example

Find b’s following sibling.

The following sibling must be located to the right of b in the string and its level must be the same as b’s.

Answer: b in page 2.

Page 23: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Experimental SettingSelected queries are based on the following three properties of path expression:

Selectivity: a path expression returning a small number of results should be evaluated faster than those returning a large number;Topology: the shape of the pattern tree could be a single path or bushy.Value constraints: the existence of value constrains and index on values may be used for fast locating the starting point for Nok pattern matching.

Page 24: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Performance

Page 25: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

ConclusionHave defined a special type of pattern tree – NoK pattern tree;Proposed a novel approach for efficient evaluating path expression by NoK pattern matching;NoK pattern matching can be evaluated efficiently using the physical storage scheme;Performance evaluation has shown that this system is better or comparable performance than the existing systems.

Page 26: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Limitation

More optimization on the locating step of NoK pattern tree matching process.

Use path index instead of tag-name index.

Consider how to employ concurrency control and how it affect the update process.

Page 27: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Reference

Ning Zhang, Varun Kacholia, M.Tamer Ozsu. A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML.D. Chamberlin, P. Fankhauser, M. Marchiori, and J. Robie. XML Query Use Case. Available at http://www.w3.org/TR/xmlquery-use-case.E.Cohen, H. Kaplan, S. Padmanabhan, and R. Bordawekar. Labeling Your XML. Preliminary version presented at CASCON’02, October 2002.N. Zhang and M. T. Ozsu. Optimizing Correlated Path Expressions in XML Languages. Technical Report CS-2002-36, University of Waterloo, November 2002. Available at http://db.uwaterloo.ca/~ddbms/publications/xml/TR-CS-2002-36.pdf.

Page 28: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

Thank You !Thank You ! Question?Question?