Upload
leo-rice
View
230
Download
4
Tags:
Embed Size (px)
Citation preview
Histograms for Selectivity Estimation
Speaker: Ho Wai Shing
Contents Introduction: What is a histogram? How
to use a histogram? A taxonomy of single-dimensional
histograms Some experimental results Some approaches for multi-dimensional
histograms Conclusions Future Work
Introduction Many modules of a DB require
selectivity estimation (estimating the query result size)
e.g., query optimizer -- determine the
nesting in indexed nested loop join user interface -- return a rough
answer to the users
Introduction We need to store some statistics of
the database to estimate the selectivity
Histogram is one of the most common statistics to be stored in practice.
Quite accurate, needs reasonably small space.
What is a Histogram? Histograms approximate the
frequency distribution of an attribute (or a set of attributes)
group attribute values into "buckets"
approximate the actual frequencies by the statistical information stored in each bucket.
Histogram: Example Consider the following distribution:
This is an "equi-width" histogram:
(1) 3
value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3
1--3 4--6 7--93 4 2.7
Histogram: the problem a pair of dual problem [1]:
Given a data distribution, a limit B on the length of H, and an error metric E(), find the histogram H that minimizes E(H).
Given the data distribution, a limit on the error, and an error metric E(), find the histogram H of smallest length for which E(H) is at most .
Two Goals for Histograms for selections (exact or range
queries) for joins focus on histograms for selections
in this talk
Taxonomy of Histogram based on the paper by Poosala et
al. [2] in SIGMOD'96 on single-dimensional histograms proposed a generalized histogram-
generating algorithm different decisions in each step
results in different histograms
Generalized Histogram Generating Algorithm consider the data distribution as a two
column table T(value, frequency) create a third attribute a3 (sort
parameter) based on the first two attributes, sort the table according to a3.
specify a subclass of histogram create a 4th attribute a4 (source
parameter) partition T into B buckets s.t. it satisfies
some constraints on a4.
Example: Equi-Width Histograms a3 = value all histograms are possible a4 = value constraint: every bucket should
contain the same number of data values
Example: End-Biased Equi-Depth Histograms a3 = value all but one buckets must be
singletons a4 = frequency constraint: all buckets should have
the same total frequency counts
Taxonomy Dimensions:
partition classes -- serial, end-biased a3 -- value (V), frequency (F), area (A) a4 -- spread (S), frequency (F), cum.
freq (C), area (A) constraints -- equi-sum, v-optimal,
max-diff, compressed, spline-based
Constraints Equi-sum: each bucket should
have the same sum of a4 V-Optimal: divide the buckets so
that the variance of the overall frequency approximation is minimized
Spline-based: the cumulative freq. satisfies a piece-wise linear approximation.
Constraints (cont.) Max-diff: bucket boundaries are at
top-(B-1) adjacent a4 differences. Compressed (comp.): top-n entries
with the highest a3 values are stored exactly, others are stored using equi-sum.
Taxonomy
spread freq. area cum. freq.
value equi-sum
equi-sumv-optimal
comp.max-diff
v-optimalmaxdiffcomp.
splinev-optimal
freqv-optimalmax-diff
areav-optimalmaxdiff
a4a3
Equi-Width Histograms discussed in Kooi's thesis (1980) [3] denoted by "Equi-Sum(V, S)" in the
taxonomy mergeable buckets must have contiguous
values merge criteria is about the spread based on equi-sum
value 1--3 4--6 7--9count 3 3 3freq. 3 4 2.7
value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3
Equi-Depth Histograms proposed by Piatetsky-Shapiro and
Connell in SIGMOD'84 [4] denoted by "Equi-Sum(V, F)" in the
taxonomy mergeable buckets must have
contiguous values merge criteria is about the frequencies based on equi-sum
value 1--4 5--6 7--9count 4 2 3freq. 2.8 5 2.7
value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3
V-Optimal(F, F) Histograms proposed by Ioannidis and Christodoulakis
in 1993 [5] mergeable buckets must have contiguous
frequencies merge criteria is to minimize sum-squared
error on frequencies within a bucket
value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3
value 6 3 2 8 9 4 5 7 1freq. 9 5 4 4 3 2 1 1 0
value 6 2--9 1--7count 1 4 4freq. 9 4 1
value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3
V-Optimal(V, F) Histograms proposed Poosala et al. in 1996 [2] mergeable buckets must have
contiguous values merge criteria is to minimize sum-
squared error on frequenciesvalue 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3 value 1--5 6 7--9
count 5 1 3freq. 2.4 9 2.7
Max-Diff(V, F) Histograms proposed Poosala et al. in 1996 [2] mergeable buckets must have
contiguous values merge criteria is to minimize sum-
squared error on frequencies
value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3
value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3 value 1--5 6 7--9
count 5 1 3freq. 2.4 9 2.7
Compressed(V, F) Histograms proposed Poosala et al. in 1996 [2] mergeable buckets must have
contiguous values merge criteria is equi-depth except
the more frequent n values.
value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3
value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3 value 1--3 6 4--9
count 3 1 5freq. 3 9 2.2
Summary
value 1--3 4--6 7--9count 3 3 3freq. 3 4 2.7
value 1--4 5--6 7--9count 4 2 3freq. 2.8 5 2.7
value 6 2--9 1--7count 1 4 4freq. 9 4 1
value 1--5 6 7--9count 5 1 3freq. 2.4 9 2.7
value 1--5 6 7--9count 5 1 3freq. 2.4 9 2.7
value 1--3 6 4--9count 3 1 5freq. 3 9 2.2
value 1 2 3 4 5 6 7 8 9freq. 0 4 5 2 1 9 1 4 3
equi-width
equi-depth
V-optimal(F, F)
V-optimal(V, F)
Max-Diff(V, F)
Compressed(V, F)
Data Distribution
Estimation Example (4) = 2 (actual value)
equi-width: (4) 4 equi-depth: (4) 2.8 V-optimal(F,F): (4) 9(0)+4(1)+1(0) =
4 V-optimal(V,F): (4)
2.4(1)+9(0)+2.7(0) =2.4 Max-Diff(V,F): (4) 2.4 Compressed(V,F): (4) 2.2
Experimental Results 100000 tuples 200 attribute values 2000 samples for construction
Experimental Results cusp_max value
distribution random value &
freq. relation frequencies fit to
Zipf distribution
Other Experiment Parameters Skew of frequency Skew of data value distribution Sample size in construction Accuracy vs. Storage Data Distributions (freq., values,
correlations) Queries
Conclusions Histograms are useful in estimating
the selectivity of a query Different techniques to use histogram
for approximating the data exist v-optimal or MaxDiff histograms can
have good accuracy for 1-D case
Future Work The methods presented can't solve
the n-D histogram problem completely
Try to apply SF-Tree to store and retrieve the buckets in multi-dimensional histogram efficiently.
References[1] H.V. Jagadish, Nick Koudas, S. Muthukrishnan,
Viswanath Poosala, Ken Sevcik, Torsten Suel, Optimal Histograms with Quality Guarantees, VLDB’98
[2] Viswanath Poosala, Yannis Ioannidis, Peter Haas, Eugene Shekita, Improved Histograms for Selectivity Estimation of Range Predicates, SIGMOD’96
[3] R. P. Kooi, The Optimization of Queries in Relational Databases, PhD Thesis, Case Western Reserver University, 1980
References[4] M. Muralikrishna and D. DeWitt, Equi-Depth
Histograms for Estimating Selectivity Factors for Multi-Dimensional Queries, SIGMOD’88