21
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP Kamel Aouiche and Daniel Lemire Universit´ e du Qu´ ebec ` a Montr´ eal (UQAM), Canada November 11, 2008 K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................1/19

A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Embed Size (px)

DESCRIPTION

A data warehouse cannot materialize all possible views, hence we must estimate quickly, accurately, and reliably the size of views to determine the best candidates for materialization. Many available techniques for view-size estimation make particular statistical assumptions and their error can be large. Comparatively, unassuming probabilistic techniques are slower, but they estimate accurately and reliability very large view sizes using little memory. We compare five unassuming hashing-based view-size estimation techniques including Stochastic Probabilistic Counting and LogLog Probabilistic Counting. Our experiments show that only Generalized Counting, Gibbons-Tirthapura, and Adaptive Counting provide universally tight estimates irrespective of the size of the view; of those, only Adaptive Counting remains constantly fast as we increase the memory budget.

Citation preview

Page 1: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

A Comparison of Five Probabilistic View-SizeEstimation Techniques in OLAP

Kamel Aouiche and Daniel Lemire

Universite du Quebec a Montreal (UQAM), Canada

November 11, 2008

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................1/19

Page 2: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Online Analytical Processing (OLAP)

(Source: dwreview.com)

What is OLAP?

multidimensional model, several(hierarchical) dimensions

measures aggregated (SUM,MIN, MAX, AVERAGE,. . . )

a set of standard operations:drill-down, roll-up, slice, dice

answers expected in nearconstant time

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................2/19

Page 3: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

The View-Size Materialization problem

Storing the result of a query (a view) is important

You trade storage for (future) speed.

Storage is cheap, faster CPUs are expensive.

Materialized views can be used to compute other views faster.

From sales per (time,store) it is faster to computesales per store, than to go back to transactions!

For aggressive aggregation (coarse views), materialized viewsare unbeatable!

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................3/19

Page 4: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

The data cube

Time - Store - Product

Time - Store Time -Product Store - Product

Time Store Product

none

The Data Cube

A d dimensional datacube is made of 2d − 1cuboids.

Typical values for drange from 10 to 20.

You also havedimensional hierarchiesto handle.

It would take too longto materialize them alleven if you hadenough storage andthe data neverchanged.

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................4/19

Page 5: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

View Selection Heuristics

The Data Cube

given a data warehouse, heuristics can be used to determinewhich views to aggregate (the problem itself is typicallyNP-hard);

many heuristics assume reliable and accurate estimates of theview sizes;

even if the choice is done by hand, the analyst needs guidance;

finding optimally fast, accurate and reliable estimates is stillan open problem;

there has been little experimental work to compare thealternatives!

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................5/19

Page 6: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

What is view-size estimation?

Algorithmically?

Views are typically result ofGROUP BYs;

The size of the view is thenumber of distinct elements inthe GROUP BY;

thus, in a simplistic sense,view-size estimation isequivalent to finding thenumber of distinct elementsin a sequence, using littlememory.

A B CA B DA A CA B E

A BA A

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................6/19

Page 7: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Summary of the methods under review

Sampling Method

Sampling and Multifractal models [Faloutsos et al., 1996]

Probabilistic Methods

Adaptive counting [Cai et al., 2005];

LogLog probabilistic counting [Durand and Flajolet, 2003];

Gibbons-Tirthapura [Gibbons and Tirthapura, 2001]

Generalized counting [Bar-Yossef et al., 2002]

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................7/19

Page 8: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Unassuming Probabilistic Techniques

x

x

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xx

x

x

x

UNIFORM RANDOM HASHING

What is the big idea?

It is difficult to workwith the original databecause it hasunknown bias.

Instead of learning thedistribution, just hashevery element, andwork into hashedspace.

Suddenly the datadistribution is known:it is uniform!

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................8/19

Page 9: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

k-wise independent hashing

What is it?

hash to [0, 2L)

uniform hashing: P(h(x) = y) = 1/2L

pairwise independent hashing:P(h(x) = y ∧ h(x ′) = y ′) = 1/4L.

3-wise independent hashing:P(h(x) = y ∧ h(x ′) = y ′ ∧ h(x ′′) = y ′′) = 1/8L

pairwise independence implies uniformity.

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................9/19

Page 10: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Multidimensional Hashing

TIME

MarchApril

AugustMarchMarch

October

STORE

MontrealNew YorkMontrealToronto

Los AngelesMiami

PRODUCT

hamjelly

breadmilk

March→ 0111April →1010

August→1110October→0101

...ham→ 0111jelly →1010bread→1110milk→0101

XOR

How to hash facts?

Use a random numbergenerator, andgenerate independenthashed values for eachdimensions.

XOR the hashedvalues.

If you have kdimensions, get k-wiseindependent hashing.

Scales well if you storethe dimension-wisehash functions.

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................10/19

Page 11: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Stochastic (LogLog) Probabilistic Counting

000001011011010010110101101010111010111101100010101011000011010101

000000

00000

COUNTING

LOGLOG

rejects outlier

only keep maximum

The counting trick

Hash to [0, 2L)

Keep track of numberof leading zeroes t,estimate ≈ 2t

LogLog variant onlyseek max leading zero(outliers)

Stochastic: hash xrandomly to one of Mintervals [0, 2L), keeptrack of M lesservalues, do some sort ofgeometric average ofthe M estimates

LogLogK. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................11/19

Page 12: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Adaptive counting

[Cai et al., 2005]

Probablistic counting schemes require the view size to be verylarge.

A small view compared to the available memory (M), willleave several of the M counters unused.

When more than 5% of the counters are unused we return alinear counting estimate [Whang et al., 1990] instead of theLogLog estimate.

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................12/19

Page 13: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Generalized counting

[Bar-Yossef et al., 2002]

The tuples and hashed values are stored in an ordered set M.

For small M with respect to the view size, most tuples arenever inserted since their hashed value is larger than thesmallest M hashed values.

Estimate is 2Lsize(M)/max(M) where max(M) returns anelement with the largest hashed value.

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................13/19

Page 14: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Gibbons-Tirthapura

[Gibbons and Tirthapura, 2001]

Keep track of all items hashed to 1/2t of the hashing space

Estimate is 2tm where m is number of items tracked.

x

x

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xx

x

x

x

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................14/19

Page 15: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Experimental results

Benchmark the accuracy and speed for the five algorithmsover:

synthetic data set (derived by DBGEN)real data set (US Census 1990)

US Census 1990 DBGEN# of facts 2 458 285 13 977 981# of views 20 8

# of attributes 69 16Data size 360 MiB 1.5 GiB

Table: Characteristic of data sets.

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................15/19

Page 16: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Experimental results: small memory budgets

0.01

0.1

1

10

100

1000

100 1000 10000 100000 1e+06 1e+07

ε, s

tan

da

rd e

rro

r (%

)

View size

M=16M=64

M=256M=2048

(a) Gibbons-Tirthapura

0.01

0.1

1

10

100

1000

100 1000 10000 100000 1e+06 1e+07

ε, s

tan

da

rd e

rro

r (%

)

View size

M=16M=64

M=256M=2048

theory M=2048

(b) Probabilistic Count-ing

0.01

0.1

1

10

100

1000

100 1000 10000 100000 1e+06 1e+07

ε, s

tan

da

rd e

rro

r (%

)

View size

M=16M=64

M=256M=2048

theory M=2048

(c) LogLog

0.01

0.1

1

10

100

1000

100 1000 10000 100000 1e+06 1e+07

ε, s

tan

da

rd e

rro

r (%

)

View size

p=0.1%p=0.3%p=0.5%p=0.7%

(d) Multifractal

0.01

0.1

1

10

100

1000

100 1000 10000 100000 1e+06 1e+07

ε, s

tan

da

rd e

rro

r (%

)

View size

M=16 M=64 M=256 M=2048

(e) Generalized Counting

0.01

0.1

1

10

100

1000

100 1000 10000 100000 1e+06 1e+07

ε, s

tan

da

rd e

rro

r (%

)

View size

M=16 M=64 M=256 M=2048

(f) Adaptive Counting

Figure: Standard error of estimation as a function of exact view size forincreasing values of M (US Census 1990).

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................16/19

Page 17: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Experimental results: large memory budgets

0.01

0.1

1

10

100

1000

10 100 1000 10000 100000 1e+006 1e+007 1e+008

ε, s

tand

ard

erro

r (%

)

Memory budget

Gibbons-TirthapuraGeneralized Counting

LogLogCounting

Adaptive Counting

(a) L=32

0.01

0.1

1

10

100

1000

10 100 1000 10000 100000 1e+006 1e+007 1e+008

ε, s

tand

ard

erro

r (%

)

Memory budget

Gibbons-TirthapuraGeneralized Counting

LogLogCounting

Adaptive Counting

(b) L=64

Figure: Standard error of estimation for a given view (four dimensionsand 1.18× 107 distinct tuples) as a function of memory budgets M(synthetic data set).

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................17/19

Page 18: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Experimental results: speed

0

50

100

150

200

250

10 100 1000 10000 100000 1e+006

Tim

e (s

econ

ds)

Memory budget

Gibbons-TirthapuraGeneralized Counting

LogLogCounting

Adaptive Counting

Figure: Estimation time for a given view (four dimensions and 1.18× 107

distinct tuples) as a function of memory budgets M (synthetic data set).

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................18/19

Page 19: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Conclusion

Main Points

Sampling can be quite unreliable, but very fast.

Processing time of probabilistic methods is dominated byhashing.

For small view-sizes relative to the available memory budget,the accuracy of Probablistic Counting and LogLog can bevery low.

However, as you increase the memory budget,Gibbons-Tirthapura, Generalized Counting and Adaptivecounting systematically improve, but they also become slower.

Adaptive Counting remains constantly fast.

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................19/19

Page 20: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Bar-Yossef, Z., Jayram, T. S., Kumar, R., Sivakumar, D., andTrevisan, L. (2002).Counting distinct elements in a data stream.In RANDOM’02, pages 1–10.

Cai, M., Pan, J., Kwok, Y.-K., and Hwang, K. (2005).Fast and accurate traffic matrix measurement using adaptivecardinality counting.In MineNet’05, pages 205–206.

Durand, M. and Flajolet, P. (2003).Loglog counting of large cardinalities.In ESA’03, volume 2832 of LNCS, pages 605–617.

Faloutsos, C., Matias, Y., and Silberschatz, A. (1996).Modeling skewed distribution using multifractals and the 80-20law.In VLDB’96, pages 307–317.

Gibbons, P. B. and Tirthapura, S. (2001).

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................19/19

Page 21: A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

Estimating simple functions on the union of data streams.In SPAA’01, pages 281–291.

Whang, K.-Y., Vander-Zanden, B. T., and Taylor, H. M.(1990).A linear-time probabilistic counting algorithm for databaseapplications.ACM Trans. Database Syst., 15(2):208–229.

K. Aouiche and D. Lemire Probabilistic View-Size Estimation.................................19/19