Xpath Query Evaluation
Goal
• Evaluating an Xpath query against a given document– To find all matches
• We will also consider the use of types
• Complexity is important– Huge Documents
Data complexity vs. Combined Complexity
• Two inputs to the query evaluation problem– Data (XML document) of size |D|– Query (Xpath expression) of size |Q|– Usually |Q| << |D|
• Polynomial data complexity– Complexity that is polynomial in |D|, possibly exponential in |Q|
• Polynomial combined complexity– Complexity that is polynomial in |D| and |Q|
• Fixed Parameter Tractable complexity – Complexity Poly(|D|)*f(|Q|)
Xpath Query Evaluation
• Input: XML Document D, Xpath query Q
• Output: A subset of the nodes of D, as defined by Q
• We will follow Efficient Algorithms for Processing Xpath Queries / Gottlob, Koch, Pichler, TODS 2005
Simple algorithm
process-location-step(n,Q) { S:-= Apply Q.first to n; If |Q|> 1 For each node n’ in s do process-location-step(n’,Q.next)}
Complexity
• Worst case: in each step of Q the axis is “following”
• So we apply the query in each step on O(|D|) nodes
• And we get Time(|Q|)= |D|*Time(|Q|-1)
• I.e. the complexity is O(|D|^|Q|)
Early Systems Performance
Figure taken from Gottlob, Koch, Pichler ‘05
Internet Explorer 6
Figure taken from Gottlob, Koch, Pichler ‘05
IE6 – performance as a function of document size
Figure taken from Gottlob, Koch, Pichler ‘05
Polynomial data complexity
• Poly data complexity is sometimes considered good even if exponential in the query size
• But can we have polynomial combined complexity for Xpath query evaluation?
• Yes!
Two main principles
• Query parse trees: the query is divided to parts according to its structure (not to be confused with the XML tree structure)
• Context-value tables: for every expression e occurring in the parse tree, compute a table of all valid combinations of context c and value v such that e evaluates to v in c.
Xpath query parse tree
descendant::b/following-sibling::* [position() != last()]
Bottom-up vs. Top-down evaluation
• We will discuss two kinds of query evaluation algorithms:– Bottom-up means that the query parse tree is
processed from the leaves up to the root– Top-down means that the parse tree is processed
from the root to the leaves
• When processing we will fill in the context-value table
Bottom-up evaluation
• Main idea: compute the value for each leaf for every possible context
• Propagate upwards until the root
• Dynamic programming algorithm to avoid re-evaluation of queries in the same context
Operational semantics
• Needed as a first step for evaluation algorithms
• Similar ideas used in compilers design
• Here the semantics is based on the notion of contexts
Contexts
• The domain of contexts is C= dom X {<k,n> | 1<k<n< |dom|} A context is c=<x,k,n> where x is a context node k is a context position n is the context size
Semantics for Xpath expressions
• The semantics of evaluating an expression is a 4-tuple where the first 3 elements are the context, and the fourth is the value obtained by evaluation in the context
Some notations
• T(t): all nodes satisfying a predicate t
• E(e): all nodes satisfying a regular exp. e (applied with respect to a given axis)
• Idxx(x,S) is the index of a node x in the set s with respect to a given axis and the document order
Context-value Table
• Given a query sub-expression e, the context-value table of e specifies all combinations of context c and value v, such that computing e on the context c results in v
• Bottom-up algorithm follows: compute the context-value table in a bottom-up fashion with respect to the query
Bottom-up algorithm
Example
4 times
Complexity
• O(|D|^3*|Q|) space ignoring strings and numbers– O(|Q|) tables, with 3 columns, each including values
in 1…|D| thus O(|D|^3*|Q|)– An extra O(|D|*|Q|) multiplicative factor for strings
and numbers
• O(|D|^5*|Q|) time ignoring strings and numbers– It can take O(|D|^2) to combine two nodesets– Extra O(|Q|) in case of strings and numbers
Optimization
• Represent contexts as pairs of current and previous node
• Allows to get the time complexity down to O(|D|^4* |Q|^2)
• Space complexity can be brought down to O(|D|^2*|Q|^2) via more optimizations
Top-down evaluation
• Similar idea
• But allows to compute only values for contexts that are needed
• Same worst-case bounds
Top-down or bottom-up?
• General question in processing XML trees• The tradeoff:
– Usually easier to combine results computed in children to obtain the result at the parent
• So bottom-up traversal is usually easier to design
– On the other hand, some of the computation is redundant since we don’t know if it will become relevant
• So top-down traversal may be more efficient
Linear-time fragment• Core Xpath includes only navigation
– \ and \\
• Core Xpath can be evaluated in O(|D|*|Q|)
• Observtion: no need to consider the entire triple, only current context node
• Top-down or bottom-up evaluation with essentially the same algorithm
• But smaller tables (for every query node, all document nodes and values of evaluation) are maintained.
Types are helpful
• Can direct the search– In some parts of the tree there is no hope to get a
match to a given sub-expression of the query– As a result we may have tables with less entries.
• Whiteboard discussion
Type Checking and Inference
• Type checking a single document: straightforward– Polynomial combined complexity if automaton
representing type is deterministic, exponential in automaton size but polynomial in document size otherwise
• Type checking the results of a (Xpath) query• Inferring the results of a query
Type Inference
• An (incomplete) algorithm for type inference can work its way to the top of the query parse tree to infer a type in a bottom-up fashion – Start by inferring a type for the leaves (simple
queries), then use it for their parents
• Type Inference is inherently incomplete.• Can be performed for some languages that
are “regular” in a sense.
Restricted language allowing for type inference
• Axes: child, descendant, parent, ancestor, following-sibling, etc.
• variables can be bound to nodes in the input tree= then passed as parameters
• An equality test can be performed between node ID's, but not between node values.
Type Checking
• In addition to inferring a type we need to verify containment in another type.
• Type Inference can be used as a tool for Type Checking.
• Type Checking was shown to be decidable for the same language fragment, but with high complexity.
Intuitive connection to text
• Queries => regular expressions• Types (tree automata) => context free
languages• Type Inference => intersection of context free
and regular languages, resulting in a context free one
• Type checking => Type Inference + inclusion of context free languages (with some restrictions to guarantee decidability)