31
Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Embed Size (px)

Citation preview

Page 1: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Histograms for Selectivity Estimation

Speaker: Ho Wai Shing

Page 2: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Contents Introduction: What is a histogram? How

to use a histogram? A taxonomy of single-dimensional

histograms Some experimental results Some approaches for multi-dimensional

histograms Conclusions Future Work

Page 3: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Introduction Many modules of a DB require

selectivity estimation (estimating the query result size)

e.g., query optimizer -- determine the

nesting in indexed nested loop join user interface -- return a rough

answer to the users

Page 4: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Introduction We need to store some statistics of

the database to estimate the selectivity

Histogram is one of the most common statistics to be stored in practice.

Quite accurate, needs reasonably small space.

Page 5: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

What is a Histogram? Histograms approximate the

frequency distribution of an attribute (or a set of attributes)

group attribute values into "buckets"

approximate the actual frequencies by the statistical information stored in each bucket.

Page 6: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Histogram: Example Consider the following distribution:

This is an "equi-width" histogram:

(1) 3

value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3

1--3 4--6 7--93 4 2.7

Page 7: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Histogram: the problem a pair of dual problem [1]:

Given a data distribution, a limit B on the length of H, and an error metric E(), find the histogram H that minimizes E(H).

Given the data distribution, a limit on the error, and an error metric E(), find the histogram H of smallest length for which E(H) is at most .

Page 8: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Two Goals for Histograms for selections (exact or range

queries) for joins focus on histograms for selections

in this talk

Page 9: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Taxonomy of Histogram based on the paper by Poosala et

al. [2] in SIGMOD'96 on single-dimensional histograms proposed a generalized histogram-

generating algorithm different decisions in each step

results in different histograms

Page 10: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Generalized Histogram Generating Algorithm consider the data distribution as a two

column table T(value, frequency) create a third attribute a3 (sort

parameter) based on the first two attributes, sort the table according to a3.

specify a subclass of histogram create a 4th attribute a4 (source

parameter) partition T into B buckets s.t. it satisfies

some constraints on a4.

Page 11: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Example: Equi-Width Histograms a3 = value all histograms are possible a4 = value constraint: every bucket should

contain the same number of data values

Page 12: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Example: End-Biased Equi-Depth Histograms a3 = value all but one buckets must be

singletons a4 = frequency constraint: all buckets should have

the same total frequency counts

Page 13: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Taxonomy Dimensions:

partition classes -- serial, end-biased a3 -- value (V), frequency (F), area (A) a4 -- spread (S), frequency (F), cum.

freq (C), area (A) constraints -- equi-sum, v-optimal,

max-diff, compressed, spline-based

Page 14: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Constraints Equi-sum: each bucket should

have the same sum of a4 V-Optimal: divide the buckets so

that the variance of the overall frequency approximation is minimized

Spline-based: the cumulative freq. satisfies a piece-wise linear approximation.

Page 15: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Constraints (cont.) Max-diff: bucket boundaries are at

top-(B-1) adjacent a4 differences. Compressed (comp.): top-n entries

with the highest a3 values are stored exactly, others are stored using equi-sum.

Page 16: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Taxonomy

spread freq. area cum. freq.

value equi-sum

equi-sumv-optimal

comp.max-diff

v-optimalmaxdiffcomp.

splinev-optimal

freqv-optimalmax-diff

areav-optimalmaxdiff

a4a3

Page 17: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Equi-Width Histograms discussed in Kooi's thesis (1980) [3] denoted by "Equi-Sum(V, S)" in the

taxonomy mergeable buckets must have contiguous

values merge criteria is about the spread based on equi-sum

value 1--3 4--6 7--9count 3 3 3freq. 3 4 2.7

value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3

Page 18: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Equi-Depth Histograms proposed by Piatetsky-Shapiro and

Connell in SIGMOD'84 [4] denoted by "Equi-Sum(V, F)" in the

taxonomy mergeable buckets must have

contiguous values merge criteria is about the frequencies based on equi-sum

value 1--4 5--6 7--9count 4 2 3freq. 2.8 5 2.7

value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3

Page 19: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

V-Optimal(F, F) Histograms proposed by Ioannidis and Christodoulakis

in 1993 [5] mergeable buckets must have contiguous

frequencies merge criteria is to minimize sum-squared

error on frequencies within a bucket

value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3

value 6 3 2 8 9 4 5 7 1freq. 9 5 4 4 3 2 1 1 0

value 6 2--9 1--7count 1 4 4freq. 9 4 1

Page 20: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3

V-Optimal(V, F) Histograms proposed Poosala et al. in 1996 [2] mergeable buckets must have

contiguous values merge criteria is to minimize sum-

squared error on frequenciesvalue 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3 value 1--5 6 7--9

count 5 1 3freq. 2.4 9 2.7

Page 21: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Max-Diff(V, F) Histograms proposed Poosala et al. in 1996 [2] mergeable buckets must have

contiguous values merge criteria is to minimize sum-

squared error on frequencies

value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3

value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3 value 1--5 6 7--9

count 5 1 3freq. 2.4 9 2.7

Page 22: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Compressed(V, F) Histograms proposed Poosala et al. in 1996 [2] mergeable buckets must have

contiguous values merge criteria is equi-depth except

the more frequent n values.

value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3

value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3 value 1--3 6 4--9

count 3 1 5freq. 3 9 2.2

Page 23: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Summary

value 1--3 4--6 7--9count 3 3 3freq. 3 4 2.7

value 1--4 5--6 7--9count 4 2 3freq. 2.8 5 2.7

value 6 2--9 1--7count 1 4 4freq. 9 4 1

value 1--5 6 7--9count 5 1 3freq. 2.4 9 2.7

value 1--5 6 7--9count 5 1 3freq. 2.4 9 2.7

value 1--3 6 4--9count 3 1 5freq. 3 9 2.2

value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3

equi-width

equi-depth

V-optimal(F, F)

V-optimal(V, F)

Max-Diff(V, F)

Compressed(V, F)

Data Distribution

Page 24: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Estimation Example (4) = 2 (actual value)

equi-width: (4) 4 equi-depth: (4) 2.8 V-optimal(F,F): (4) 9(0)+4(1)+1(0) =

4 V-optimal(V,F): (4)

2.4(1)+9(0)+2.7(0) =2.4 Max-Diff(V,F): (4) 2.4 Compressed(V,F): (4) 2.2

Page 25: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Experimental Results 100000 tuples 200 attribute values 2000 samples for construction

Page 26: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Experimental Results cusp_max value

distribution random value &

freq. relation frequencies fit to

Zipf distribution

Page 27: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Other Experiment Parameters Skew of frequency Skew of data value distribution Sample size in construction Accuracy vs. Storage Data Distributions (freq., values,

correlations) Queries

Page 28: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Conclusions Histograms are useful in estimating

the selectivity of a query Different techniques to use histogram

for approximating the data exist v-optimal or MaxDiff histograms can

have good accuracy for 1-D case

Page 29: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Future Work The methods presented can't solve

the n-D histogram problem completely

Try to apply SF-Tree to store and retrieve the buckets in multi-dimensional histogram efficiently.

Page 30: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

References[1] H.V. Jagadish, Nick Koudas, S. Muthukrishnan,

Viswanath Poosala, Ken Sevcik, Torsten Suel, Optimal Histograms with Quality Guarantees, VLDB’98

[2] Viswanath Poosala, Yannis Ioannidis, Peter Haas, Eugene Shekita, Improved Histograms for Selectivity Estimation of Range Predicates, SIGMOD’96

[3] R. P. Kooi, The Optimization of Queries in Relational Databases, PhD Thesis, Case Western Reserver University, 1980

Page 31: Histograms for Selectivity Estimation Speaker: Ho Wai Shing

References[4] M. Muralikrishna and D. DeWitt, Equi-Depth

Histograms for Estimating Selectivity Factors for Multi-Dimensional Queries, SIGMOD’88