Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

Streaming XPath / XQuery Evaluationand Course Wrap-Up

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Implementing Data Management Systems

December 2, 2008

Administrivia

Recall that the final project is due – with a write-up and a 10-minute demo presentation – on Tuesday 12/16, 9-11AM

Also: course evaluations (at end)

2

XML: Its Roles

Perhaps used as a superset of HTML for documents, but…

Most successful as a transport format for sending data between systems SOAP, WSDL, etc. Data interchange formats like ebXML, MAGE-ML, …

So why would we want to store it in a database to query it, when we could query over XML as it streams across the network? (Note: not infinite streams, as in DSMSs, and it’s

hierarchical)

3

Streaming XPaths and XQueries

Suppose I give an XPath expression (which is a subset of a regular expression) Can I match it against the parse tree of the data?

An XQuery takes multiple XPaths in the FOR clause, and iterates over the elements of each Xpath (binding the variable to each) FOR $i in doc(“abc”)/xyz, $j in $i/def

We can think of an XQuery as doing tree matching, which returns tuples ($i, $j) for each tree matching $i and $j

4

Where This Leads

An XQuery can be broken into two operations: A parsing / tree matching stage (FOR and also LET)

* Finds matches to the variables * Returns a tuple of trees

A (mostly) pipelined SPJ / union / group by / order by engine – (WHERE, ORDER BY, nesting in RETURN) * Like a regular relational engine extended with XML tree

datatype!

The first engine to put these things together: Tukwila (Ives+ 2000, 2002)

IBM DB2 was built upon a nearly identical model – TurboXPath (Josifowski 2004)

5

The Key: SAX (Simple API for XML)

If we are to match XPaths in streaming fashion, we need a stream of data items

The original parser model: DOM (Document Object Model) Builds an entire object hierarchy in memory, which

is traversable Not incremental! (Until later versions)

SAX: a series of event notifications open-tag, close-tag, character data Idea: build a state machine (or similar

mechanism) to match on the events!

6

Different Options

Many different “streaming XPath:” matching algorithms were developed with some differences What to match with (DFA, NFA, lazy DFA, PDA,

proprietary format) Complexity of the path language (regular path

expressions, XPath), axes (downwards, upwards, sideways), internal references (IDREFs, foreign keys), recursive patterns

Which operations can be pushed into the operator (selection predicates, joins, position indices)

We’ll consider TurboXPath, highlighted in red above(Tukwila’s x-scan is highlighted in green)

7

From XPath Patterns to Tuplesand A Normal Query Plan

8

for $c in doc("d1")//customerfor $p in doc("d2")//profiles[cid/text() = $c/cid/text()]for $o in $c/order[date = ‘12/12/01’]return <result>

{$c/name} {$p/status} {$o/amount} </result>

($c/cid/text(), $c/name, $o/amount) ($p/status, $p/cid/text())

⋈Pipelined join

TurboXPath over “d1” TurboXPath over “d2”

($c/name, $p/status, $o/amount)

XML tagger (add “result”)

XPath Processing in TurboXPath

9

Performance Issues

Predicate pushdown Similar to “sargable predicates” – reduces the internal

state that must be run through a cross-product to produce tuples

“Smart” memory management Want to deallocate space from partial pattern matches as

early as possible

Parser efficiency We found that Xerces-C (validating C++ parser used by

TurboXPath) was 10x slower than expat (non-validating C parser)

10

11

Wrapping up…

This semester has been a whirlwind tour of many different aspects of the “data ecosystem” Storage Concurrency control Query processing Data distribution and streams Heterogeneity, mappings, and reformulation (and the

limitations thereof) Many styles of data integration XML processing

I hope I’ve been able to convey some of what makes this field both relevant and, I think, cool…

Where There Is Room for More Work (Among Many Topics)

Storage: rows versus columns Concurrency control Query processing

Is there a theory of adaptivity, and an optimal scheme? Data distribution, networks, and streams

How do we distribute to 10,000 nodes? What is the relationship between network communication and query processing?

Data integration, better support for collaboration How can we make it less human-intensive?

“Lightweight databases” Probabilistic databases Visualization and interfaces Databases meets machine learning and info retrieval

12

A Sampler of Some of the SystemsWork by (Some) Major DB Groups

Washington: Mystiq – probabilistic databases; distrib. streams

Stanford: Trio – probabilities and “lineage” meets databases Cornell: databases meets games; probabilistic databases Wisconsin: Cimple; database support for monitoring clusters MIT: Sensor query processing; signal processing; column

stores Berkeley: Data management for sensors and networks Maryland: Querying data models; learning and probabilities

meets databases Penn: Orchestra; data and workflow provenance; keyword

querying with learned ranks over databases; lightweight data integration; networking meets databases; sensor integration

13

14

Thanks!!!

I had a great time this semester – I hope you learned a lot and found it to be enjoyable I’m looking forward to seeing your projects!

Documents

Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December