42
XML Data Management Ning Zhang University of Waterloo

XML Data Management Ning Zhang University of Waterloo

Embed Size (px)

Citation preview

Page 1: XML Data Management Ning Zhang University of Waterloo

XML Data Management

Ning Zhang

University of Waterloo

Page 2: XML Data Management Ning Zhang University of Waterloo

What is XML?

XML documents have elements and attributes Elements (indicated by begin & end tags)

can be nested but cannot interleave each other can have arbitrary number of sub-elements can have free text as values<chap title = “Introduction To XML”>

some free text<sect title = “What is XML?”> … </sect>

<sect title = “Elements”> … </sect><sect title = “Why XML?”> … </sect>… possibly more free text

</chap>

begin element

attribute end elemen

t

Elements w/ same

name can be nested

Page 3: XML Data Management Ning Zhang University of Waterloo

Why XML?

Database Side: XML is a new way to organize data Relational databases organize

data in tables XML documents organize data

in ordered trees Document Side: XML is a

semantic markup language HTML focuses on presentation XML focuses on

semantics/structure in the data

chap

sect

sect

sect

sect sect sect

<html> <h1> Chapter 1… </h1> some free text<h2> Section 1… </h2> some more free text <h3> Section 1.1 </h3></html>

Page 4: XML Data Management Ning Zhang University of Waterloo

Data Management: -- Relational vs. XML

Relational data are well organized – fully structured (more strict): E-R modeling to model the data structures in the

application; E-R diagram is converted to relational tables and integrity

constraints (relational schemas) XML data are semi-structured (more flexible):

Schemas may be unfixed, or unknown (flexible – anyone can author a document);

Suitable for data integration (data on the web, data exchange between different enterprises).

Page 5: XML Data Management Ning Zhang University of Waterloo

More about Relational vs. XML

XML is not meant to replace relational database systems RDBMSs are well suited to OLTP

applications (e.g., electronic banking) which has 1000+ small transactions per minute.

XML is suitable data exchange over heterogeneous data sources (e.g., Web services) that allow them to “talk”.

Page 6: XML Data Management Ning Zhang University of Waterloo

When should we use XML? (1)

Document representation language XML can be transformed to other format

(e.g., by XSLT) XML HTML XML LaTeX, bibTeX XML PDF DocBook (standard schema for authoring

document/book)

Page 7: XML Data Management Ning Zhang University of Waterloo

When should we use XML? (2)

Data integration and exchange language Web services (SOAP, WSDL, UDDI)

Amazon.com, eBay, Microsoft MapPoint, …

Domain specific data exchange schemas (>1000) legal document exchange language business information exchange …

RSS XML news feed CNN, slashdot, blogs, …

Page 8: XML Data Management Ning Zhang University of Waterloo

When should we use XML? (3)

Any data having hierarchical structure Email

Header – from, to, cc, bcc… Body – my message, replied email …

Network log file IP address, time, request type, error code

Advances of translating to XML Exploit high-level declarative XML query

languages

Page 9: XML Data Management Ning Zhang University of Waterloo

XML Databases

Advantages: Manage large volume of XML data Provide high-level declarative language Efficiently evaluate complex queries

XML Data Management Issues: XML Data Model XML Query Languages XML Query Processing and Optimization

Page 10: XML Data Management Ning Zhang University of Waterloo

XML Data Model

Hierarchical data model An XML document is an ordered tree; Nodes in the tree are labeled with element

names. Element nesting relationship corresponds to

parent-child relationship;chap

sect

sect

@title

Introduction to XML

@title

What is XML?

some free text

@title

……

Page 11: XML Data Management Ning Zhang University of Waterloo

XML Schema Languages

Schema languages defines the structure: Document Type Definition (DTD)

Context-free grammar Structurally richer than relational schema

definition language because of recursion. XML Schema

Also context-free Richer than DTD because of data types

definition (integer, date, sequence).

Page 12: XML Data Management Ning Zhang University of Waterloo

XML Query Languages

XPath 13 axes (navigation directions in the tree)

child (/), descendant (//), following-sibling, following… NameTest, predicates E.g,

doc(“bib.xml”)//book[title=“Harry Potter”]/ISBN

XQuery (superset of XPath) FLWOR expression

for $x in doc(“bib.xml”)//book[title = “Harry Potter”]/ISBN,

$y in doc(“imdb.xml”)//moviewhere $y//novel/ISBN = $xreturn $y//title

Page 13: XML Data Management Ning Zhang University of Waterloo

Important Problems in XML Data Management

1. How to store XML data?2. How to efficiently evaluate XPath/XQuery

languages?• Efficient physical operators• Query optimization

3. How to support XML update languages?4. How to support transaction

management?5. Recovery management?

What follows is not covered in COSC 3480!!

Page 14: XML Data Management Ning Zhang University of Waterloo

Agenda

1. XML Storage

2. XML Path Query Processing

3. XML Optimization

Page 15: XML Data Management Ning Zhang University of Waterloo

XML Storage

Extended Relational Storage Convert XML documents to relational tables

Native Storage Treat XML elements as first-class citizens

Hybrid of Relational and Native Storage XML documents can be stored in columns of

relational tables (XML typed column)

Page 16: XML Data Management Ning Zhang University of Waterloo

Extended Relational Storage

Edge-based Storage Scheme (Florescu and Kossman ‘99) Each node has an ID Each tuple in the edge table consists of:

(parentID, childID, type of data, reference to data)

Pro: easy to convert XML to relational tables Con: impossible to answer path queries such

as //a//b using SQL (needs transitive closure operator)

Page 17: XML Data Management Ning Zhang University of Waterloo

Extended Relational Storage

Path-based Storage Scheme XRel (Yoshikawa et al. ‘01) Each node corresponds to a tuple in the table Each tuple keeps a rooted path to the node

(e.g., /article/chap/sec/sec/@title) Pro: also easy to convert XML to tables Con: answering path queries, such as //a//b,

needs expensive string pattern matching

Page 18: XML Data Management Ning Zhang University of Waterloo

Extended Relational Storage

Node-based Storage Scheme: Niagara, TIMBER etc. (Zhang et al. ’01) Each node is encoded with a “begin” and “end”

integers. Begin corresponds to the order of in-order

traversal of tree; end corresponds to the order in post-order traversal.

Pro: checking parent-child/ancestor-descendant relationships is efficient (constant time using begin and end)

Con: inefficient for updating XML

Page 19: XML Data Management Ning Zhang University of Waterloo

Native Storage

Subtree partition-based scheme: Natix (Kanne and Moerkotte ’00) A large XML tree is partitioned into small

subtrees, each of which can be fit into one disk page

Introducing aproxy and aggregate nodes to connect different subtrees

Pro: easy to update and traversal Con: complex update algorithm; frequent

deletion/addition may deteriorate page usage ratio

Page 20: XML Data Management Ning Zhang University of Waterloo

Native Storage

Binary tree-based scheme: Arb (Koch ’03) Convert a tree with arbitrary number of children

to a binary tree (first child translates to left child; next sibling translate to right child)

Tree nodes are stored in document order Each node has 2 bits indicating whether it has a

left & right child Pro: easy to do depth-first search (DFS)

traversal Con: inefficient to do next_sibling navigation and

hard to update

Page 21: XML Data Management Ning Zhang University of Waterloo

Native Storage

String-based scheme: NoK (Zhang ’04) Convert a tree to a parenthesized string

E.g., a having b and c as children is converted to ab)c)), by DFS of the tree and ‘)’ representing “end-of-subtree”

Tree can be reconstructed by the string A long string can be cut into substrings and fit them into

disk pages Page header can contains simple statistics to expedite

next_sibling navigation Pro: particularly optimized for DFS navigational evaluation

plan Con: inefficient to do for breadth-first search (BFS)

Page 22: XML Data Management Ning Zhang University of Waterloo

Hybrid of Relational and Native Storage

All major commercial RDBMS vendaors (IBM, Oracle, Microsoft and Sybase) support XML type in their RDBMS A table can have a column whose type is

“XML” When inserting a tuple in the table, the

XML field could be an XML document XML documents are stored natively

Page 23: XML Data Management Ning Zhang University of Waterloo

Hybrid of Relational and Native Storage

IBM DB2 UDB System RX – XML storage is similar to Natix

Microsoft SQL Server Uses BLOB (binary large object) to represent

XML documents Oracle

Can use multiple format: CLOB (character large object) Serialized object Shredded relational table

Page 24: XML Data Management Ning Zhang University of Waterloo

Agenda

1. XML Storage

2. XML Path Query Processing

3. XML Optimization

Page 25: XML Data Management Ning Zhang University of Waterloo

XML Path Processing

Extended Relational Approach Translate XML queries to SQL

statements Native Approach (may be based on

extended relational storage) Join-based approach Navigational approach Hybrid approach

Page 26: XML Data Management Ning Zhang University of Waterloo

Extended Relational Query Processing

Regular expression based approach: XRel (Yoshikawa et al. ‘01) Linear path expression (without branches) are

translated to regular expressions on strings (rooted paths)

Use the “like” predicate in SQL to evaluate regular expressions

Pro: easy to implement Con: cannot answer branching path queries

Page 27: XML Data Management Ning Zhang University of Waterloo

Extended Relational Query Processing

Dynamic Interval based approach: DI (DeHaan et al. ‘03) Use the node labeling (begin,end) interval

storage scheme Dynamically calculate (begin,end) intervals for

resulting nodes give a path/FLWOR expression Pro: can handle all types of queries including

FLWOR expression Con: inefficient for answering complex path

queries

Page 28: XML Data Management Ning Zhang University of Waterloo

Native Path Query Processing

Merge-Join based approach: Multi-predicate Merge Join (MPMGJN) algorithm (Zhang et al. ’01) Modify the merge join algorithm to reduce

unnecessary comparisons Keep to position p of the last successful

comparisons in the right input stream The next item from the left input stream

starts scanning from position p.

Page 29: XML Data Management Ning Zhang University of Waterloo

Native Path Query Processing

Stack-based Structural Join (Wu et al. ’02) Improve the MPMGJN algorithm Do not look back but keep all ancestors

in a stack When comparing the new item, just

compare it with the top of the stack

Page 30: XML Data Management Ning Zhang University of Waterloo

Native Path Query Processing

Holistic Twig Join (Bruno et al. ’02) Improve the stack-based structure

algorithm Use one join algorithm for the whole path

expression instead of one join for one step

Reduce the overhead to produce and store intermediate results

Page 31: XML Data Management Ning Zhang University of Waterloo

Native Path Query Processing

Natix (Brantner et al. ’05) Translate each step into a logical

navigational operator Unnest-Map Each unnest-map operator is translated

into a physical operator that performs tree traversal on the Natix storage

Physical optimization can be performed on the physical navigational operators to reduce cross-cluster I/O.

Page 32: XML Data Management Ning Zhang University of Waterloo

Native Path Query Processing

IBM DB2 XNav (Josifovski et al. ’04) XML path expressions are translated into

automata The automaton is constructed

dynamically while traversing the XML tree in DFS

Physical I/O can be optimized by navigating to next_sibling without traversing the whole subtree

Page 33: XML Data Management Ning Zhang University of Waterloo

Native Path Query Processing

Tree automata (Koch ’03) The tree automaton needs two passes of

tree The first traversal is a bottom-up

deterministic tree automaton to determine which states are reachable

The second traversal is a top-down tree automaton to prune the reachable states and compute predicates.

Page 34: XML Data Management Ning Zhang University of Waterloo

Hybrid Processing

BlossomTree (Zhang ’04, Zhang’05) Navigational approach is efficient for parent-child

navigation Join-based approach is efficient for ancestor-

descendant BlossomTree approach identifies sub-

expressions, Next-of-Kin (NoK), that are efficient for navigational approach.

Use navigational approach for NoK subexpressions and use structural joins to join intermediate results

Page 35: XML Data Management Ning Zhang University of Waterloo

XML Indexing

Structural Index Clustering tree nodes by their structural

similarity (e.g., bisimilarity and F&B bisimilarity)

Index is a graph, in which each vertex is an equivalence class of similar XML tree nodes

Path query evaluation amounts to navigational evaluation on the graph

Page 36: XML Data Management Ning Zhang University of Waterloo

Agenda

1. XML Storage

2. XML Path Query Processing

3. XML Optimization

Page 37: XML Data Management Ning Zhang University of Waterloo

Overview of Cost-based Optimization

Query Optimization depends on:1. How much knowledge about the data we have?

2. How intelligent we can make use of the knowledge (within a time constraint)?

The cost of a plan is heavily dependent on: The cost model of each operator The cardinality/selectivity of each operator

Page 38: XML Data Management Ning Zhang University of Waterloo

Cardinality Estimation

Full path summarization: DataGuide (Goldman ’97) and PathTree (Aboulnaga ’01) Summarize all distinct paths in XML

documents in a graph Cardinality information is annotated on

graph vertices

Page 39: XML Data Management Ning Zhang University of Waterloo

Cardinality Estimation

Partial path summarization: Markov Table (Aboulnaga ’01) Keep sub-paths and cardinality

information in a table Cardinality for longer paths are calculated

using partial paths. Can use additional compression methods

to accommodate Internet scale database

Page 40: XML Data Management Ning Zhang University of Waterloo

Cardinality Estimation

Structural clustered summarization: XSketch (Neoklis ’02) and TreeSketch (Neoklis ’04) Similar idea as clustered-based index XSketch uses forward and backward

stability, and TreeSketch uses count stability as similarity measurement

Heuristics to reduce graph to fit memory budget

Page 41: XML Data Management Ning Zhang University of Waterloo

Cardinality Estimation

Decompression-based approach: XSEED (Zhang ’06) XML documents are compressed into a

small kernel with edge cardinality labels Kernel can be decompressed into XML

document with cardinality annotations Navigational path operator can be reused

on the decompressed XML document for cardinality estimation

Page 42: XML Data Management Ning Zhang University of Waterloo

Cost Modeling

Statistical Learning Cost Model: COMET (Zhang ’05) Relational operator cost modeling is performed

by analyzing the source code XML operators are much more complex than

relational operators; therefore analytical approach is too time-consuming

Statistical learning approach needs a training set of queries and learn the cost model from the input parameters and real cost.