View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Selectivity Estimation of Twig Queries on Cyclic Graphs
Department of Computer Science
Hong Kong Baptist University
Speaker: Byron Choi
Joint work with *Yun Peng and Jianliang Xu
(to appear ICDE 2011)
March 21 2011 @ COMP630Q
Agenda
BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works
Graph Data is Ubiquitous
Navigational Queries
SELECT a set of nodes via a user-specified path◦ //person[//open auction//person]◦ //
ancestor-descendant axes CONNECT in logic (reachability tests)
there are evidently many other query formalisms
Selectivity Estimation
A classical problem: Given a query, estimate the count of the results efficiently
Requirements◦ accurate◦ efficient estimation time◦ small overhead in terms of size
XMark, used in this Presentation
Selectivity Estimation (cont’)
Query optimizers rely on the counts to evaluate the costs of query plans
Example:◦ XMark 1.0 (> 180,000 nodes)◦ Query: //person[//open auction//person]◦ 25,500 person’s◦ 12,000 open_auction’s◦ 13,192 open_auction//person’s◦ //open_auction → //person → ↑↑person
Problem StatementData: A rooted directed labeled graph (i.e.,
possibly cyclic)Query: Twig queries (i.e., parent-child and
ancestor-descendant axes and branches)Problem statement: given a cyclic graph G and
a twig Q, estimate the result count of Q on G.department
facul ty facul ty facul ty
name RA TA RA TA TA TAname name
Graph G Twig Query Q
department
f acul ty
RA TA
Our Position Relative to the Current State-of-the-Art
Graph Complexity
Que
ry C
ompl
exit
y
Tree (XML) Cyclic Graph
Path Query
Twig Query
XSketch ’06
Xseed ’06TreeSketch ’06
CST ’04XPathLearner ’02
DataGuide ’99
Our Work
Related Work (Graph-based approaches) Dataguide – Automata theories
◦ J. McHugh and J. Widom. Query optimization for xml. In VLDB, pages 315–326, 1999
TreeSketch and XSketch -- Bisimulation◦ N. Polyzotis and M. Garofalakis. Xsketch synopses for xml
data graphs. ACM Trans. Database Syst., 31(3):1014–1063, 2006.
◦ N. Polyzotis, M. Garofalakis, and Y. Ioannidis. Approximate xml query answers. In SIGMOD, pages 263–274, 2004.
Correlated Subpath Tree (CST)◦ Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S.
Muthukrishnan, R. Ng, and D. Srivastava. Counting twig matches in a tree. In ICDE, pages 595–604, 2001
Straight-Line Grammar◦ D. K. Fisher and S. Maneth. Structural selectivity
estimation for xml documents. In ICDE, pages 626–635, 2007
2-dimensional histograms on TREEs◦ Y. Wu, J. M. Patel, and H. V. Jagadish. Using
histograms to estimate answer sizes for xml queries. Inf. Syst., 28(1-2):33–59, 2003.
Hidden Markov Model◦ A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton.
Estimating the selectivity of xml path expressions for internet scale applications. InVLDB, pages 591–600, 2001.
A novel bloom filter – two 1-dimensional histograms◦ W. Wang, H. Jiang, H. Lu, and J. X. Yu. Bloom
histogram: Path selectivity estimation for xml data with updates. In VLDB, pages 240–251, 2004.
Related Work (Relational approaches)
Technical Challenges
Interactions between cyclic graphs and recursions (i.e. //) in twig queries
Branches of twigs
A Typical Framework
Graph Representati onGraph
Summari zati on of Graph’ s
Representati onSel ecti vi ty
Query
Summari zati on techni que
Sel ecti vi ty esti mati on techni que
• Previous research differs from each other in one or more steps
• We also follow this general framework
Framework – with Our Solution Now
Graph Representation
Summarization of Graph Rep.
Selectivity Estimation
Summary of Contributions
1. Cyclic graph representation◦ Prime labeling (vs. other representations)
◦ Matrix representation of prime labeling
◦ Matrix transformation to C1P matrix
2. Summarization of graph’s representation◦ 2-dimensional histogram for cyclic graph
3. Algorithms for selectivity estimation
Characteristics of our ContributionsMatrix representation of cyclic graphs
◦ Reuse some research from matrices
Histogram-based selectivity estimation◦ No uniform distribution assumption
One data node/vertex – one 2-dimensional pointOne query step (child or descendant) – multiple 2-
dimensional points
Agenda
BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works
Alternative Representations for Cyclic Graphs
Adjacency matrix/list◦ Easy to construct
◦ Inefficient in determining ancestors/descendants
Transitive closure◦ Efficient in ancestors/descendants;
◦ Inefficient in terms of space
Prime labeling◦ Smaller than transitive closure but larger than adjacency matrix
◦ Query efficiency better than adjacency matrix but worse than transitive closure
…
Prime Labeling
Originally proposed for tree data [X. Wu, ICDE’04]◦ To address update-friendly XML index for reachability tests
Later extended to DAGs [G. Wu, DASFAA’06]◦ Each vertex is assigned a prime number
Our extension to cyclic graphs◦ Applied to cyclic graphs
◦ Reduced labeling size further Not each vertex is labeled with a unique prime number →
smaller than G. Wu et al.
Prime Labeling (con’t)
Large prime numbers near the root of the graph
• assign each leaf vertex a prime number
• assign an intermediate vertex production of label of its children
• label the root
Prime Labeling (our Def.)
G. Wu et al
Yun Peng
Querying with Prime LabelingReachability ≡ Divisibility
◦ c → d: 7 * 11 * 3 / 3 = 7 * 11
◦ c → e: 7 * 11 * 3 / 5 = 46.2
Matrix Representation
3 5 7 11
1 1 1 1
1 1 0 0
1 0 1 1
1 0 0 0
0 1 0 0
0 0 1 0
1 0 0 1
a
b
c
d
e
f
g
Columns:Prime numbers
Rows:Vertices
Reachability: Divisibility ≡ Logic op.s
Experiments: Often just a constant factor smaller than the adjacency matrix!
Where are we?Experiments from XMark
◦ It is just a constant factor smaller than adjacency matrix
◦ How on earth would this be summarized?
Consecutive Ones Property (C1P)
A Consecutive Ones Matrix (C1P matrix) is a 0/1 matrix, in which 1s of each row are consecutive.
Since 1s are consecutive, each row of a C1P matrix can be summarized by an interval: [start column id of 1s, end column id of 1s]
One row → One vertex → One interval
1
2
1 1 1 0 1 1
0 0 1 1 1 1
r
r
1
2
0 1 1 1 1 1
0 0 1 1 1 0
r
r
non-consecutive ones matrix
consecutive ones matrix
[1,5]
[2,4]
What do we get from C1P?
What do we get from C1P? (cont’)
Adopting a property of intervals◦ Vertex w is reachable from vertex v, if w locates
within the right-bottom field of v on the plane◦ For example, dot (2,4) is at right bottom part of dot
(1,5), so r2 is reachable from r1
1
2
0 1 1 1 1 1
0 0 1 1 1 0
r
r
[1,5]
[2,4]
r1(1,5)
r2(2,4)
Complexities related to C1P
C1P matrix detection◦ Linear time solvable [Hsu, Algorithms’02]
Transform a non-C1P matrix to a C1P matrix◦ NP-hard [Tan, Algorithmica’07]◦ No polynomial time approximation [Tan, Algorithmica’07]
Our Heuristic Algorithm
Main Idea: given any m*n matrix with r 1s, extract C1P sub matrixes (by the C1P matrix detection algorithm) and then concatenate them one by one
Time complexity: 2( ( ))O m m n r
Pseudo-code of the Matrix Transformation
Extract a submatrix for this iteration
Adding one row at a time
C1P detection – linear time
Transform to C1P – linear time
Optimizations for C1P Trans.
Horizontal matrix decomposition prior to C1P heuristics◦ ◦ Use the 3 sigmas rule on the number of 1’s in rows
Common pattern extraction◦ Done by an intersection of the rows
Compressed (extensible) hash mappings◦ One column in the original matrix may be mapped to
multiple positions in a C1P matrix◦ Support mapping ops in the compressed domain
))(( 2 rnmmO
What do we get from C1P? (Recall)
Agenda
BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works
2-Dimensional Histogram Recall we summarize rows of a C1P
matrix by intervals and then dot them on the 2-d plane
The plane is divided into cells. For each cell, we record the number of dots located within it.
Given a vertex v, the set of vertices reachable from v must be located in the right bottom part of v
Sum up the size of cells located at right-bottom part of v as 1+2 = 3
1
2
1
1
We build a 2-dimensional histogram for each kind of nodes
2-Dimensional Histogram -- Observations
Data dots are always on top of the diagonal lineData dots are often skewed towards the diagonal
line◦ This is consistent to an observation from an XML
researchThere are different types of cells w.r.t a query →
there should be different estimation rules
Our 2-Dimensional HistogramsMore histogram/structure in a cell
Different estimation rules for different classes of cells
Schematics of Our Estimation
Estimation Details that have been Skipped in this TalkA top-down recursive estimation algorithm
based on the (syntactic) structure of twigsDetails on handling branches
◦ A bottom-up recursive algorithmestimate_intermediate: generating next
query dotsestimate_count: generating count from a
query dot
top_down (very briefly)
A rule in estimate_intermediate
Illustration of estimation rules in estimate_intermediate
A rule in estimate_count
Query-dot generation
Compress f and f^-1Generate query dots in the
compressed domain in one scan
1.They can be large, sometimes2. Many query dots have 0 count
Agenda
BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works
ExperimentsDatasets
◦ XMark; DBLP; Treebank.05
Queries◦ Skewed queries based on the tags’ popularities
Optimizations◦ Used all optimizations unless specified otherwise
Error Metrics
◦ Relative error: from XSketch/TreeSketch
◦ Root Mean Square Error (RMSE): from XSeed
◦ Normalized RMSE: from XSeed
| . |est realrealn
2( . )est real
n
RMSEreal
n
Our Est. Error (relative error)
Our Est. Error (RMSE & NRMSE)
Ours vs. XSeed
RMSE NRMSE
XMark 7.1 times better 6.9 times better
Treebank.05 6.8 times better 6.8 times better
Ours vs. XSketch/TreeSketch (indirect)
XSketch focuses on path queries on cyclic graph, which controls error under 10%
TreeSketch focuses on twig queries on tree, which control error under 5%
Our Est. Time on XMark Graph
Performance of C1P Matrix Transformation Optimization
Query Dot Gen. Optimization
ConclusionsWe are the first work on selectivity estimation of twig
queries on cyclic graphsWe propose a new graph representation technique
◦ Extend prime labeling to cyclic graphs◦ Transform prime labeling to a C1P matrix for summarization
We extend 2-dimensional histogram selectivity estimation technique to cyclic graphs
Experiment results shows that we outperform previous works◦ Our ~1.3% error vs. XSketch/TreeSketch’s 5% error◦ Errors are at least 6.8 times smaller than XSeed
Future Works
Incorporating this technique with estimation on◦ Data values◦ Queries with negations
External implementation◦ For quick implementation, we put almost all data
structures in main memoryEstimation performance guarantees