Selectivity Estimation of XPath for Cyclic Graphs

Selectivity Estimation of XPath for Cyclic Graphs

Yun Peng

Outline

Motivation Problem definition Prime number labeling Selectivity estimation Implementation

Motivation To retrieve sub graphs from large graph

databases efficiently, selectivity estimation is one of the most important query optimization technologies

An Example

Query q=//faculty[//RA][//TA] means to list all faculties that have both RA and TA To evaluate this query, we have two evaluation plans

One plan Find out faculties having RA. Result set size is 3. Find out faculties having TA from the intermediate results

Another plan Find out faculties having TA. Result set size is 2. Find out faculties having RA from the intermediate results

department

facul ty facul ty facul ty facul ty

name RA name TA RA TA RA RAname name

Problem Definition

Selectivity estimation is that given a query, estimate how many results are produced by this query without costly evaluation

department

facul ty facul ty facul ty facul ty

name RA name TA RA TA RA RAname name

q=//faculty[//RA]

Selectivity(q) = 3

Our methodology skeleton

Step1: label the graph nodes (pre-prepared)

Step2: Estimate query selectivity based on the pre-prepared labels (after a query comes)

Prime number labeling

Label each graph node with an integer, which is production of some prime numbers

Prime number labeling (cont.) Divisibility of labels implies ancestor-descendent

relationship

For example, 3*5*7*11 is divisible by 11, so node g is descendent of node a

Optimization

Replace integers by vectors

1 1 1 1

1 1 0 0

1 0 1 1

1 0 0 0

0 1 0 0

0 0 1 0

1 0 0 1

a

b

c

d

e

f

g

Optimization (cont.)

( ) ( ) 0VL a VL b implies node b is descendent of node a

Our methodology skeleton

Step1: label the graph nodes (pre-prepared)

Step2: Estimate query selectivity based on the pre-prepared labels (after a query comes)

Selectivity Estimation

Two dimensional histogram Originally designed for selectivity estimation on

trees [Jargadish 2004] Label each tree node by an interval, e.g. (l, r) Represent the interval by a dot (l, r) on the XOY

coordination system Partition the XOY plain to grids as buckets Estimate results using this histogram

Selectivity Estimation (cont.)

Optimization

Replace integers by vectors

1 1 1 1

1 1 0 0

1 0 1 1

1 0 0 0

0 1 0 0

0 0 1 0

1 0 0 1

a

b

c

d

e

f

g

Consecutive Ones Property Matrix Given a 0/1 matrix, if we can find an order of

columns such that all row’s 1s are consecutive, this matrix is called consecutive ones property matrix (C1P matrix)

Reorganization is linear Find the largest C1P sub matrix is NP and if 1s

number of each column is larger than 3, it is un- polynomial time approximatable

Add extra columns

0 1 2 3

1 1 1 1

1 1 0 0

1 0 1 1

1 0 0 0

0 1 0 0

0 0 1 0

1 0 0 1

a

b

c

d

e

f

g

0 1 2 3 4

1 1 1 1 0

1 1 0 0 0

0 0 1 1 1

: 4 01 0 0 0 0

0 1 0 0 0

0 0 1 0 0

0 0 0 1 1

a

b

c

Mapd

e

f

g

Add extra columns

Given a 0/1 matrix, add minimum number of extra columns such that result matrix is a C1P matrix is NP?

Heuristic algorithm

Duplicate Merge

1

2

3

1 2 3 4 5 6

1 1 1 0 1 1

0 1 1 0 1 0

0 0 1 1 1 1

r

r

r

Heuristic algorithm (cont.)

Heuristic Algorithm (cont.)

1

2

3

1 2 3 4 5 6

1 1 1 0 1 1

0 1 1 0 1 0

0 0 1 1 1 1

r

r

r

1

2

3

1 2 3 6 5 4

1 1 1 1 1 0

0 1 1 0 1 0 0

0 0 1 1 1 1

r

r

r

Selectivity Estimation (cont.)

Implementation

Implementation

Implementation

Documents

Selectivity Estimation of XPath for Cyclic Graphs