Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with

Selectivity Estimation of Twig Queries on Cyclic Graphs

Department of Computer Science

Hong Kong Baptist University

Speaker: Byron Choi

Joint work with *Yun Peng and Jianliang Xu

(to appear ICDE 2011)

March 21 2011 @ COMP630Q

Agenda

BackgroundProblem StatementOverview of our FrameworkMatrix and its TransformationsHistograms and EstimationExperimental EvaluationFuture Works

Graph Data is Ubiquitous

Navigational Queries

SELECT a set of nodes via a user-specified path◦ //person[//open auction//person]◦ //

ancestor-descendant axes CONNECT in logic (reachability tests)

there are evidently many other query formalisms

Selectivity Estimation

A classical problem: Given a query, estimate the count of the results efficiently

Requirements◦ accurate◦ efficient estimation time◦ small overhead in terms of size

XMark, used in this Presentation

Selectivity Estimation (cont’)

Query optimizers rely on the counts to evaluate the costs of query plans

Example:◦ XMark 1.0 (> 180,000 nodes)◦ Query: //person[//open auction//person]◦ 25,500 person’s◦ 12,000 open_auction’s◦ 13,192 open_auction//person’s◦ //open_auction → //person → ↑↑person

Problem StatementData: A rooted directed labeled graph (i.e.,

possibly cyclic)Query: Twig queries (i.e., parent-child and

ancestor-descendant axes and branches)Problem statement: given a cyclic graph G and

a twig Q, estimate the result count of Q on G.department

facul ty facul ty facul ty

name RA TA RA TA TA TAname name

Graph G Twig Query Q

department

f acul ty

RA TA

Our Position Relative to the Current State-of-the-Art

Graph Complexity

Que

ry C

ompl

exit

y

Tree (XML) Cyclic Graph

Path Query

Twig Query

XSketch ’06

Xseed ’06TreeSketch ’06

CST ’04XPathLearner ’02

DataGuide ’99

Our Work

Related Work (Graph-based approaches) Dataguide – Automata theories

◦ J. McHugh and J. Widom. Query optimization for xml. In VLDB, pages 315–326, 1999

TreeSketch and XSketch -- Bisimulation◦ N. Polyzotis and M. Garofalakis. Xsketch synopses for xml

data graphs. ACM Trans. Database Syst., 31(3):1014–1063, 2006.

◦ N. Polyzotis, M. Garofalakis, and Y. Ioannidis. Approximate xml query answers. In SIGMOD, pages 263–274, 2004.

Correlated Subpath Tree (CST)◦ Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S.

Muthukrishnan, R. Ng, and D. Srivastava. Counting twig matches in a tree. In ICDE, pages 595–604, 2001

Straight-Line Grammar◦ D. K. Fisher and S. Maneth. Structural selectivity

estimation for xml documents. In ICDE, pages 626–635, 2007

2-dimensional histograms on TREEs◦ Y. Wu, J. M. Patel, and H. V. Jagadish. Using

histograms to estimate answer sizes for xml queries. Inf. Syst., 28(1-2):33–59, 2003.

Hidden Markov Model◦ A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton.

Estimating the selectivity of xml path expressions for internet scale applications. InVLDB, pages 591–600, 2001.

A novel bloom filter – two 1-dimensional histograms◦ W. Wang, H. Jiang, H. Lu, and J. X. Yu. Bloom

histogram: Path selectivity estimation for xml data with updates. In VLDB, pages 240–251, 2004.

Related Work (Relational approaches)

Technical Challenges

Interactions between cyclic graphs and recursions (i.e. //) in twig queries

Branches of twigs

A Typical Framework

Graph Representati onGraph

Summari zati on of Graph’ s

Representati onSel ecti vi ty

Query

Summari zati on techni que

Sel ecti vi ty esti mati on techni que

• Previous research differs from each other in one or more steps

• We also follow this general framework

Framework – with Our Solution Now

Graph Representation

Summarization of Graph Rep.

Selectivity Estimation

Summary of Contributions

1. Cyclic graph representation◦ Prime labeling (vs. other representations)

◦ Matrix representation of prime labeling

◦ Matrix transformation to C1P matrix

2. Summarization of graph’s representation◦ 2-dimensional histogram for cyclic graph

3. Algorithms for selectivity estimation

Characteristics of our ContributionsMatrix representation of cyclic graphs

◦ Reuse some research from matrices

Histogram-based selectivity estimation◦ No uniform distribution assumption

One data node/vertex – one 2-dimensional pointOne query step (child or descendant) – multiple 2-

dimensional points

Agenda


Alternative Representations for Cyclic Graphs

Adjacency matrix/list◦ Easy to construct

◦ Inefficient in determining ancestors/descendants

Transitive closure◦ Efficient in ancestors/descendants;

◦ Inefficient in terms of space

Prime labeling◦ Smaller than transitive closure but larger than adjacency matrix

◦ Query efficiency better than adjacency matrix but worse than transitive closure

…

Prime Labeling

Originally proposed for tree data [X. Wu, ICDE’04]◦ To address update-friendly XML index for reachability tests

Later extended to DAGs [G. Wu, DASFAA’06]◦ Each vertex is assigned a prime number

Our extension to cyclic graphs◦ Applied to cyclic graphs

◦ Reduced labeling size further Not each vertex is labeled with a unique prime number →

smaller than G. Wu et al.

Prime Labeling (con’t)

Large prime numbers near the root of the graph

• assign each leaf vertex a prime number

• assign an intermediate vertex production of label of its children

• label the root

Prime Labeling (our Def.)

G. Wu et al

Yun Peng

Querying with Prime LabelingReachability ≡ Divisibility

◦ c → d: 7 * 11 * 3 / 3 = 7 * 11

◦ c → e: 7 * 11 * 3 / 5 = 46.2

Matrix Representation

3 5 7 11

1 1 1 1

1 1 0 0

1 0 1 1

1 0 0 0

0 1 0 0

0 0 1 0

1 0 0 1

a

b

c

d

e

f

g

Columns:Prime numbers

Rows:Vertices

Reachability: Divisibility ≡ Logic op.s

Experiments: Often just a constant factor smaller than the adjacency matrix!

Where are we?Experiments from XMark

◦ It is just a constant factor smaller than adjacency matrix

◦ How on earth would this be summarized?

Consecutive Ones Property (C1P)

A Consecutive Ones Matrix (C1P matrix) is a 0/1 matrix, in which 1s of each row are consecutive.

Since 1s are consecutive, each row of a C1P matrix can be summarized by an interval: [start column id of 1s, end column id of 1s]

One row → One vertex → One interval

1

2

1 1 1 0 1 1

0 0 1 1 1 1

r

r

1

2

0 1 1 1 1 1

0 0 1 1 1 0

r

r

non-consecutive ones matrix

consecutive ones matrix

[1,5]

[2,4]

What do we get from C1P?

What do we get from C1P? (cont’)

Adopting a property of intervals◦ Vertex w is reachable from vertex v, if w locates

within the right-bottom field of v on the plane◦ For example, dot (2,4) is at right bottom part of dot

(1,5), so r2 is reachable from r1

1

2

0 1 1 1 1 1

0 0 1 1 1 0

r

r

[1,5]

[2,4]

r1(1,5)

r2(2,4)

Complexities related to C1P

C1P matrix detection◦ Linear time solvable [Hsu, Algorithms’02]

Transform a non-C1P matrix to a C1P matrix◦ NP-hard [Tan, Algorithmica’07]◦ No polynomial time approximation [Tan, Algorithmica’07]

Our Heuristic Algorithm

Main Idea: given any m*n matrix with r 1s, extract C1P sub matrixes (by the C1P matrix detection algorithm) and then concatenate them one by one

Time complexity: 2( ( ))O m m n r

Pseudo-code of the Matrix Transformation

Extract a submatrix for this iteration

Adding one row at a time

C1P detection – linear time

Transform to C1P – linear time

Optimizations for C1P Trans.

Horizontal matrix decomposition prior to C1P heuristics◦ ◦ Use the 3 sigmas rule on the number of 1’s in rows

Common pattern extraction◦ Done by an intersection of the rows

Compressed (extensible) hash mappings◦ One column in the original matrix may be mapped to

multiple positions in a C1P matrix◦ Support mapping ops in the compressed domain

))(( 2 rnmmO

What do we get from C1P? (Recall)

Agenda


2-Dimensional Histogram Recall we summarize rows of a C1P

matrix by intervals and then dot them on the 2-d plane

The plane is divided into cells. For each cell, we record the number of dots located within it.

Given a vertex v, the set of vertices reachable from v must be located in the right bottom part of v

Sum up the size of cells located at right-bottom part of v as 1+2 = 3

1

2

1

1

We build a 2-dimensional histogram for each kind of nodes

2-Dimensional Histogram -- Observations

Data dots are always on top of the diagonal lineData dots are often skewed towards the diagonal

line◦ This is consistent to an observation from an XML

researchThere are different types of cells w.r.t a query →

there should be different estimation rules

Our 2-Dimensional HistogramsMore histogram/structure in a cell

Different estimation rules for different classes of cells

Schematics of Our Estimation

Estimation Details that have been Skipped in this TalkA top-down recursive estimation algorithm

based on the (syntactic) structure of twigsDetails on handling branches

◦ A bottom-up recursive algorithmestimate_intermediate: generating next

query dotsestimate_count: generating count from a

query dot

top_down (very briefly)

A rule in estimate_intermediate

Illustration of estimation rules in estimate_intermediate

A rule in estimate_count

Query-dot generation

Compress f and f^-1Generate query dots in the

compressed domain in one scan

1.They can be large, sometimes2. Many query dots have 0 count

Agenda


ExperimentsDatasets

◦ XMark; DBLP; Treebank.05

Queries◦ Skewed queries based on the tags’ popularities

Optimizations◦ Used all optimizations unless specified otherwise

Error Metrics

◦ Relative error: from XSketch/TreeSketch

◦ Root Mean Square Error (RMSE): from XSeed

◦ Normalized RMSE: from XSeed

| . |est realrealn

2( . )est real

n

RMSEreal

n

Our Est. Error (relative error)

Our Est. Error (RMSE & NRMSE)

Ours vs. XSeed

RMSE NRMSE

XMark 7.1 times better 6.9 times better

Treebank.05 6.8 times better 6.8 times better

Ours vs. XSketch/TreeSketch (indirect)

XSketch focuses on path queries on cyclic graph, which controls error under 10%

TreeSketch focuses on twig queries on tree, which control error under 5%

Our Est. Time on XMark Graph

Performance of C1P Matrix Transformation Optimization

Query Dot Gen. Optimization

ConclusionsWe are the first work on selectivity estimation of twig

queries on cyclic graphsWe propose a new graph representation technique

◦ Extend prime labeling to cyclic graphs◦ Transform prime labeling to a C1P matrix for summarization

We extend 2-dimensional histogram selectivity estimation technique to cyclic graphs

Experiment results shows that we outperform previous works◦ Our ~1.3% error vs. XSketch/TreeSketch’s 5% error◦ Errors are at least 6.8 times smaller than XSeed

Future Works

Incorporating this technique with estimation on◦ Data values◦ Queries with negations

External implementation◦ For quick implementation, we put almost all data

structures in main memoryEstimation performance guarantees

Documents

Selectivity Estimation of Twig Queries on Cyclic Graphs Department of Computer Science Hong Kong Baptist University Speaker: Byron Choi Joint work with