Upload
marshall-hensley
View
214
Download
0
Embed Size (px)
Citation preview
Streaming XPath / XQuery Evaluationand Course Wrap-Up
Zachary G. IvesUniversity of Pennsylvania
CIS 650 – Implementing Data Management Systems
December 2, 2008
Administrivia
Recall that the final project is due – with a write-up and a 10-minute demo presentation – on Tuesday 12/16, 9-11AM
Also: course evaluations (at end)
2
XML: Its Roles
Perhaps used as a superset of HTML for documents, but…
Most successful as a transport format for sending data between systems SOAP, WSDL, etc. Data interchange formats like ebXML, MAGE-ML, …
So why would we want to store it in a database to query it, when we could query over XML as it streams across the network? (Note: not infinite streams, as in DSMSs, and it’s
hierarchical)
3
Streaming XPaths and XQueries
Suppose I give an XPath expression (which is a subset of a regular expression) Can I match it against the parse tree of the data?
An XQuery takes multiple XPaths in the FOR clause, and iterates over the elements of each Xpath (binding the variable to each) FOR $i in doc(“abc”)/xyz, $j in $i/def
We can think of an XQuery as doing tree matching, which returns tuples ($i, $j) for each tree matching $i and $j
4
Where This Leads
An XQuery can be broken into two operations: A parsing / tree matching stage (FOR and also LET)
* Finds matches to the variables * Returns a tuple of trees
A (mostly) pipelined SPJ / union / group by / order by engine – (WHERE, ORDER BY, nesting in RETURN) * Like a regular relational engine extended with XML tree
datatype!
The first engine to put these things together: Tukwila (Ives+ 2000, 2002)
IBM DB2 was built upon a nearly identical model – TurboXPath (Josifowski 2004)
5
The Key: SAX (Simple API for XML)
If we are to match XPaths in streaming fashion, we need a stream of data items
The original parser model: DOM (Document Object Model) Builds an entire object hierarchy in memory, which
is traversable Not incremental! (Until later versions)
SAX: a series of event notifications open-tag, close-tag, character data Idea: build a state machine (or similar
mechanism) to match on the events!
6
Different Options
Many different “streaming XPath:” matching algorithms were developed with some differences What to match with (DFA, NFA, lazy DFA, PDA,
proprietary format) Complexity of the path language (regular path
expressions, XPath), axes (downwards, upwards, sideways), internal references (IDREFs, foreign keys), recursive patterns
Which operations can be pushed into the operator (selection predicates, joins, position indices)
We’ll consider TurboXPath, highlighted in red above(Tukwila’s x-scan is highlighted in green)
7
From XPath Patterns to Tuplesand A Normal Query Plan
8
for $c in doc("d1")//customerfor $p in doc("d2")//profiles[cid/text() = $c/cid/text()]for $o in $c/order[date = ‘12/12/01’]return <result>
{$c/name} {$p/status} {$o/amount} </result>
($c/cid/text(), $c/name, $o/amount) ($p/status, $p/cid/text())
⋈Pipelined join
TurboXPath over “d1” TurboXPath over “d2”
($c/name, $p/status, $o/amount)
XML tagger (add “result”)
XPath Processing in TurboXPath
9
Performance Issues
Predicate pushdown Similar to “sargable predicates” – reduces the internal
state that must be run through a cross-product to produce tuples
“Smart” memory management Want to deallocate space from partial pattern matches as
early as possible
Parser efficiency We found that Xerces-C (validating C++ parser used by
TurboXPath) was 10x slower than expat (non-validating C parser)
10
11
Wrapping up…
This semester has been a whirlwind tour of many different aspects of the “data ecosystem” Storage Concurrency control Query processing Data distribution and streams Heterogeneity, mappings, and reformulation (and the
limitations thereof) Many styles of data integration XML processing
I hope I’ve been able to convey some of what makes this field both relevant and, I think, cool…
Where There Is Room for More Work (Among Many Topics)
Storage: rows versus columns Concurrency control Query processing
Is there a theory of adaptivity, and an optimal scheme? Data distribution, networks, and streams
How do we distribute to 10,000 nodes? What is the relationship between network communication and query processing?
Data integration, better support for collaboration How can we make it less human-intensive?
“Lightweight databases” Probabilistic databases Visualization and interfaces Databases meets machine learning and info retrieval
12
A Sampler of Some of the SystemsWork by (Some) Major DB Groups
Washington: Mystiq – probabilistic databases; distrib. streams
Stanford: Trio – probabilities and “lineage” meets databases Cornell: databases meets games; probabilistic databases Wisconsin: Cimple; database support for monitoring clusters MIT: Sensor query processing; signal processing; column
stores Berkeley: Data management for sensors and networks Maryland: Querying data models; learning and probabilities
meets databases Penn: Orchestra; data and workflow provenance; keyword
querying with learned ranks over databases; lightweight data integration; networking meets databases; sensor integration
13
14
Thanks!!!
I had a great time this semester – I hope you learned a lot and found it to be enjoyable I’m looking forward to seeing your projects!