56
Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with *Yun Peng and Jianliang Xu (to appear ICDE 2011) March 21 2011 @ COMP630Q

Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

  • View
    217

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Selectivity Estimation of Twig Queries on Cyclic Graphs

Department of Computer Science

Hong Kong Baptist University

Speaker: Byron Choi

Joint work with *Yun Peng and Jianliang Xu

(to appear ICDE 2011)

March 21 2011 @ COMP630Q

Page 2: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Agenda

BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works

Page 3: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Graph Data is Ubiquitous

Page 4: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Navigational Queries

SELECT a set of nodes via a user-specified path◦ //person[//open auction//person]◦ //

ancestor-descendant axes CONNECT in logic (reachability tests)

there are evidently many other query formalisms

Page 5: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Selectivity Estimation

A classical problem: Given a query, estimate the count of the results efficiently

Requirements◦ accurate◦ efficient estimation time◦ small overhead in terms of size

Page 6: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

XMark, used in this Presentation

Page 7: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Selectivity Estimation (cont’)

Query optimizers rely on the counts to evaluate the costs of query plans

Example:◦ XMark 1.0 (> 180,000 nodes)◦ Query: //person[//open auction//person]◦ 25,500 person’s◦ 12,000 open_auction’s◦ 13,192 open_auction//person’s◦ //open_auction → //person → ↑↑person

Page 8: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Problem StatementData: A rooted directed labeled graph (i.e.,

possibly cyclic)Query: Twig queries (i.e., parent-child and

ancestor-descendant axes and branches)Problem statement: given a cyclic graph G and

a twig Q, estimate the result count of Q on G.department

facul ty facul ty facul ty

name RA TA RA TA TA TAname name

Graph G Twig Query Q

department

f acul ty

RA TA

Page 9: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Our Position Relative to the Current State-of-the-Art

Graph Complexity

Que

ry C

ompl

exit

y

Tree (XML) Cyclic Graph

Path Query

Twig Query

XSketch ’06

Xseed ’06TreeSketch ’06

CST ’04XPathLearner ’02

DataGuide ’99

Our Work

Page 10: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Related Work (Graph-based approaches) Dataguide – Automata theories

◦ J. McHugh and J. Widom. Query optimization for xml. In VLDB, pages 315–326, 1999

TreeSketch and XSketch -- Bisimulation◦ N. Polyzotis and M. Garofalakis. Xsketch synopses for xml

data graphs. ACM Trans. Database Syst., 31(3):1014–1063, 2006.

◦ N. Polyzotis, M. Garofalakis, and Y. Ioannidis. Approximate xml query answers. In SIGMOD, pages 263–274, 2004.

Correlated Subpath Tree (CST)◦ Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S.

Muthukrishnan, R. Ng, and D. Srivastava. Counting twig matches in a tree. In ICDE, pages 595–604, 2001

Straight-Line Grammar◦ D. K. Fisher and S. Maneth. Structural selectivity

estimation for xml documents. In ICDE, pages 626–635, 2007

Page 11: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

2-dimensional histograms on TREEs◦ Y. Wu, J. M. Patel, and H. V. Jagadish. Using

histograms to estimate answer sizes for xml queries. Inf. Syst., 28(1-2):33–59, 2003.

Hidden Markov Model◦ A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton.

Estimating the selectivity of xml path expressions for internet scale applications. InVLDB, pages 591–600, 2001.

A novel bloom filter – two 1-dimensional histograms◦ W. Wang, H. Jiang, H. Lu, and J. X. Yu. Bloom

histogram: Path selectivity estimation for xml data with updates. In VLDB, pages 240–251, 2004.

Related Work (Relational approaches)

Page 12: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Technical Challenges

Interactions between cyclic graphs and recursions (i.e. //) in twig queries

Branches of twigs

Page 13: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

A Typical Framework

Graph Representati onGraph

Summari zati on of Graph’ s

Representati onSel ecti vi ty

Query

Summari zati on techni que

Sel ecti vi ty esti mati on techni que

• Previous research differs from each other in one or more steps

• We also follow this general framework

Page 14: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Framework – with Our Solution Now

Graph Representation

Summarization of Graph Rep.

Selectivity Estimation

Page 15: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Summary of Contributions

1. Cyclic graph representation◦ Prime labeling (vs. other representations)

◦ Matrix representation of prime labeling

◦ Matrix transformation to C1P matrix

2. Summarization of graph’s representation◦ 2-dimensional histogram for cyclic graph

3. Algorithms for selectivity estimation

Page 16: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Characteristics of our ContributionsMatrix representation of cyclic graphs

◦ Reuse some research from matrices

Histogram-based selectivity estimation◦ No uniform distribution assumption

One data node/vertex – one 2-dimensional pointOne query step (child or descendant) – multiple 2-

dimensional points

Page 17: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Agenda

BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works

Page 18: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Alternative Representations for Cyclic Graphs

Adjacency matrix/list◦ Easy to construct

◦ Inefficient in determining ancestors/descendants

Transitive closure◦ Efficient in ancestors/descendants;

◦ Inefficient in terms of space

Prime labeling◦ Smaller than transitive closure but larger than adjacency matrix

◦ Query efficiency better than adjacency matrix but worse than transitive closure

Page 19: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Prime Labeling

Originally proposed for tree data [X. Wu, ICDE’04]◦ To address update-friendly XML index for reachability tests

Later extended to DAGs [G. Wu, DASFAA’06]◦ Each vertex is assigned a prime number

Our extension to cyclic graphs◦ Applied to cyclic graphs

◦ Reduced labeling size further Not each vertex is labeled with a unique prime number →

smaller than G. Wu et al.

Page 20: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Prime Labeling (con’t)

Large prime numbers near the root of the graph

• assign each leaf vertex a prime number

• assign an intermediate vertex production of label of its children

• label the root

Page 21: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Prime Labeling (our Def.)

G. Wu et al

Yun Peng

Page 22: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Querying with Prime LabelingReachability ≡ Divisibility

◦ c → d: 7 * 11 * 3 / 3 = 7 * 11

◦ c → e: 7 * 11 * 3 / 5 = 46.2

Page 23: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Matrix Representation

3 5 7 11

1 1 1 1

1 1 0 0

1 0 1 1

1 0 0 0

0 1 0 0

0 0 1 0

1 0 0 1

a

b

c

d

e

f

g

Columns:Prime numbers

Rows:Vertices

Reachability: Divisibility ≡ Logic op.s

Experiments: Often just a constant factor smaller than the adjacency matrix!

Page 24: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Where are we?Experiments from XMark

◦ It is just a constant factor smaller than adjacency matrix

◦ How on earth would this be summarized?

Page 25: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Consecutive Ones Property (C1P)

A Consecutive Ones Matrix (C1P matrix) is a 0/1 matrix, in which 1s of each row are consecutive.

Since 1s are consecutive, each row of a C1P matrix can be summarized by an interval: [start column id of 1s, end column id of 1s]

One row → One vertex → One interval

1

2

1 1 1 0 1 1

0 0 1 1 1 1

r

r

1

2

0 1 1 1 1 1

0 0 1 1 1 0

r

r

non-consecutive ones matrix

consecutive ones matrix

[1,5]

[2,4]

Page 26: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

What do we get from C1P?

Page 27: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

What do we get from C1P? (cont’)

Adopting a property of intervals◦ Vertex w is reachable from vertex v, if w locates

within the right-bottom field of v on the plane◦ For example, dot (2,4) is at right bottom part of dot

(1,5), so r2 is reachable from r1

1

2

0 1 1 1 1 1

0 0 1 1 1 0

r

r

[1,5]

[2,4]

r1(1,5)

r2(2,4)

Page 28: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Complexities related to C1P

C1P matrix detection◦ Linear time solvable [Hsu, Algorithms’02]

Transform a non-C1P matrix to a C1P matrix◦ NP-hard [Tan, Algorithmica’07]◦ No polynomial time approximation [Tan, Algorithmica’07]

Page 29: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Our Heuristic Algorithm

Main Idea: given any m*n matrix with r 1s, extract C1P sub matrixes (by the C1P matrix detection algorithm) and then concatenate them one by one

Time complexity: 2( ( ))O m m n r

Page 30: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Pseudo-code of the Matrix Transformation

Extract a submatrix for this iteration

Adding one row at a time

C1P detection – linear time

Transform to C1P – linear time

Page 31: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Optimizations for C1P Trans.

Horizontal matrix decomposition prior to C1P heuristics◦ ◦ Use the 3 sigmas rule on the number of 1’s in rows

Common pattern extraction◦ Done by an intersection of the rows

Compressed (extensible) hash mappings◦ One column in the original matrix may be mapped to

multiple positions in a C1P matrix◦ Support mapping ops in the compressed domain

))(( 2 rnmmO

Page 32: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

What do we get from C1P? (Recall)

Page 33: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Agenda

BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works

Page 34: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

2-Dimensional Histogram Recall we summarize rows of a C1P

matrix by intervals and then dot them on the 2-d plane

The plane is divided into cells. For each cell, we record the number of dots located within it.

Given a vertex v, the set of vertices reachable from v must be located in the right bottom part of v

Sum up the size of cells located at right-bottom part of v as 1+2 = 3

1

2

1

1

We build a 2-dimensional histogram for each kind of nodes

Page 35: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

2-Dimensional Histogram -- Observations

Data dots are always on top of the diagonal lineData dots are often skewed towards the diagonal

line◦ This is consistent to an observation from an XML

researchThere are different types of cells w.r.t a query →

there should be different estimation rules

Page 36: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Our 2-Dimensional HistogramsMore histogram/structure in a cell

Different estimation rules for different classes of cells

Page 37: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Schematics of Our Estimation

Page 38: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Estimation Details that have been Skipped in this TalkA top-down recursive estimation algorithm

based on the (syntactic) structure of twigsDetails on handling branches

◦ A bottom-up recursive algorithmestimate_intermediate: generating next

query dotsestimate_count: generating count from a

query dot

Page 39: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

top_down (very briefly)

Page 40: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

A rule in estimate_intermediate

Page 41: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Illustration of estimation rules in estimate_intermediate

Page 42: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

A rule in estimate_count

Page 43: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Query-dot generation

Compress f and f^-1Generate query dots in the

compressed domain in one scan

1.They can be large, sometimes2. Many query dots have 0 count

Page 44: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Agenda

BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works

Page 45: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

ExperimentsDatasets

◦ XMark; DBLP; Treebank.05

Queries◦ Skewed queries based on the tags’ popularities

Optimizations◦ Used all optimizations unless specified otherwise

Page 46: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Error Metrics

◦ Relative error: from XSketch/TreeSketch

◦ Root Mean Square Error (RMSE): from XSeed

◦ Normalized RMSE: from XSeed

| . |est realrealn

2( . )est real

n

RMSEreal

n

Page 47: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Our Est. Error (relative error)

Page 48: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Our Est. Error (RMSE & NRMSE)

Page 49: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Ours vs. XSeed

RMSE NRMSE

XMark 7.1 times better 6.9 times better

Treebank.05 6.8 times better 6.8 times better

Page 50: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Ours vs. XSketch/TreeSketch (indirect)

XSketch focuses on path queries on cyclic graph, which controls error under 10%

TreeSketch focuses on twig queries on tree, which control error under 5%

Page 51: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Our Est. Time on XMark Graph

Page 52: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Performance of C1P Matrix Transformation Optimization

Page 53: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Query Dot Gen. Optimization

Page 54: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

ConclusionsWe are the first work on selectivity estimation of twig

queries on cyclic graphsWe propose a new graph representation technique

◦ Extend prime labeling to cyclic graphs◦ Transform prime labeling to a C1P matrix for summarization

We extend 2-dimensional histogram selectivity estimation technique to cyclic graphs

Experiment results shows that we outperform previous works◦ Our ~1.3% error vs. XSketch/TreeSketch’s 5% error◦ Errors are at least 6.8 times smaller than XSeed

Page 55: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Future Works

Incorporating this technique with estimation on◦ Data values◦ Queries with negations

External implementation◦ For quick implementation, we put almost all data

structures in main memoryEstimation performance guarantees

Page 56: Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with