52
1 Algorithms for massive Algorithms for massive data sets data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

1

Algorithms for massive data setsAlgorithms for massive data sets

Lecture 3 (March 2, 2003)

Synopses, Samples & Sketches

Page 2: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

2

SynopsesSynopses

• Synopsis (from Webster) : a condensed statement or outline (as of a narrative or treatise)

• Synopsis (here) : A succinct data structure that lets us answers queries efficiently

Page 3: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

3

Typical QueriesTypical Queries

Statistics (count, median, variance, aggregates)

Patterns (clustering, associations, classification)

Nearest Neighbors (L1, L2, Hamming norm)

Property Testing (Skewness, Independence)

etc..

Page 4: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

4

Why use Synopses?Why use Synopses?

• Can’t store the whole data : E.g. Web Data

• Resides in main memory : fast query response. E.g. OLAP Data

• Remote transmission at minimal cost

• Minimal effect on storage cost

Page 5: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

5

Classification of SynopsesClassification of Synopses

• Are they useful for more than kind of query?– General purpose: E.g. samples

– Specific purpose: E.g. Distinct Values Estimator

• What granularity ?– One per database: E.g. Sample of the whole

relation

– One per distinct value of attribute : E.g. Profiles for customers in a call database

Page 6: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

6

Some NumbersSome Numbers

• AQUA Project (Bell Labs): – DB Size : 420 MB

– Synopsis Size : 420 KB (0.1%) to 12.5 MB (3%)

– Accuracy : Within 10% for 0.1% of DB size

– Running Time : Less than 0.3% of the time for full query

• Quantile Summary (Khanna et al) : – DB Size : 109 tuples

– Synopsis Size : 1249 tuples

– Accuracy : 1%

Page 7: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

7

Synopses need not be fancy!Synopses need not be fancy!

• Maintaining Mean (μ) of numbers

• What about variance ?

22 )( ix

Page 8: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

8

ObjectivesObjectives

• Small Size

• Fast Update and Query

• Provable error guarantees (Need not give exact answers)

• Composable : Useful for distributed scenario

Page 9: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

9

A coarse classificationA coarse classification

• Sampling based : This lecture

• Sketches

• Histograms

Page 10: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

10

SamplingSampling

• Where and how are samples used

• How are samples maintained – Single relation

• Types of samples :– Oblivious

– Value based

• Limitations of oblivious samples

Page 11: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

11

Samples in DSSSamples in DSS

• Exact answers NOT always required

– DSS applications usually exploratory: early feedback to help identify “interesting” regions

– Aggregate queries: precision to “last decimal” not needed• e.g., “What percentage of the US sales are in NJ?” (display as

bar graph)

– Base data can be remote or unavailable: approximate processing using locally-cached data synopsesdata synopses is the only option

SQL Query

Exact Answer

DecisionDecisionSupport Support SystemsSystems(DSS) (DSS)

Long Response Times!

Page 12: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

12

Sampling: BasicsSampling: Basics• Idea: A small random sample S of the data often well-represents all the data

– For a fast approx answer, apply the query to S & “scale” the result

– E.g., R.a is {0,1}, S is a 20% sample

select count(*) from R where R.a = 0

select 5 * count(*) from S where S.a = 0

1 1 0 1 1 1 1 1 0 0 0

0 1 1 1 1 1 0 11 1 0 1 0 1 1

0 1 1 0

Red = in S

R.aR.a

Est. count = 5*2 = 10, Exact count = 10

• Leverage extensive literature on confidence intervals for sampling

Actual answer is within the interval [a,b] with a given probability

E.g., 54,000 ± 600 with prob 90%

Page 13: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

13

The Aqua ArchitectureThe Aqua Architecture

DataWarehouse

(e.g., Oracle)

SQLQuery Q

Network

Q

Result HTMLXML

WarehouseData

Updates

BrowserExcel

Picture without Aqua:

• User poses a query Q

• Data Warehouse executes Q and returns result

• Warehouse is periodically updated with new data

Page 14: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

14

The Aqua ArchitectureThe Aqua Architecture

Picture with Aqua:

• Aqua is middleware, between the user and the warehouse

• Aqua Synopses are stored in the warehouse

• Aqua intercepts the user query and rewrites it to be a query Q’ on the synopses. Data warehouse returns approximate answer

DataWarehouse

(e.g., Oracle)

Rewriter

SQLQuery Q

Network

Q’

Result (w/ error bounds)

HTMLXML

WarehouseData

Updates

AQUASynopses

AQUATracker

BrowserExcel

select count(*) from R where R.a = 0 select 5 * count(*) from S where S.a = 0 Q Q’

Page 15: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

15

Schema & QueriesSchema & Queries

• Most queries involve foreign key joins between tables followed by (grouping and) aggregation.

L

O PS

P SCN

R

order part, supp

cust

nation

Page 16: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

16

Page 17: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

17

Example QueryExample Query

Page 18: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

18

What samples are right?What samples are right?

• Naïve approach : maintain samples of each relation in the schema

• Problem : sample of the join is not a join of the samples, even for foreign key joins

• Example :

AABB

AB

a1a2

b1

Page 19: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

19

Foreign Key JoinsForeign Key Joins

• Foreign Key Join : Effectively a central “fact” table is appended with columns from the dimension tables.

• Sampling from the join is same as sampling from the “fact” table itself.

• Synopsis : For every table that may be a “fact” table for certain join, sample from the table and join the sample with the dimension tables.

Page 20: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

20

SynopsisSynopsis

• For every node in the DAG:– Maintain a sample corresponding to that table.

– Join the sample with tables corresponding to all its descendents in the graph.

– Maximal join for which the table is a “fact” table.

L

O PSP SC

NR

order part, supp

cust

nation

Page 21: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

21

Bells and whistles!Bells and whistles!

• How to allocate memory across samples of different “fact” tables

• Group-By Queries: – Are uniform samples best or can we do better?

• Aggregate attribute may be skewed– Are uniform samples best or can we do better?

• We may revisit these issues later– Have not seen some equations for a while!

Page 22: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

22

How to sample?How to sample?• Consider a single table with only insertions

• Want to maintain a sample of this table

• Three semantics of sampling:– Coin flip

– Fixed size without replacement

– Fixed size with replacement

• First one (coin flip) easy to maintain under insertions

• Exercise : Can we switch between different samples? If so how ?

Page 23: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

23

Reservoir SamplingReservoir Sampling

• Given : A stream of elements (tuples), viewed as insertions into a relation

• Aim : At every instant maintain a uniform random sample of size n without replacement

• Method : (Accept the first n elements)– Let t be the number of elements seen so far

– On seeing the the (t+1)st element include it with probability n/(t+1)

– If included evict one of the previous elements uniformly at random

Page 24: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

24

Proof of CorrectnessProof of Correctness

• Easy to see that every instant the size of the sample is exactly n

• Claim : After seeing t elements, every element belongs to the sample with probability n/t

• Exercise : Using induction prove the last claim

Page 25: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

25

Efficiency Efficiency • Let N be the number of records seen

• Each record (beyond the first n records) is added to the reservoir with probability n/t

• The average number of records added is

)/ln1()1(/ nNnHHntnn nNNtn

• Consider any reservoir sample. • The t th element has to be a part of the sample with probability no less than n/t. • Thus, the quantity above is also a lower bound on the additions made to the reservoir (time spent)

Page 26: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

26

EfficiencyEfficiency• The naïve algorithm makes N calls to

RANDOM() and takes time O(N)

• Consider the following random variable: Let S(n,t) denote the number of elements skipped where n is the size of the reservoir and t is the number of elements processed so far.

• Aim: Study this random variable and sample from its distribution using O(1) operations.

• Idea : Generate S(n,t) and skip those many records doing nothing

Page 27: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

27

ObservationsObservations• S(n,t) is non-negative

• Let F(s) denote Prob {S(n,t) ≤ s}, for s≥ 0

1

1

)1(

)1(1

)1(1)(

s

s

n

n

t

nt

st

tsF

Where ab denotes the falling power

a(a-1) (a-2)…(a-b-1) and denotes the rising

power a(a+1)(a+2)…(a+b-1)

ba

Page 28: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

28

ObservationsObservations• Subtracting two terms corresponding to s and s-1 we

get the probability distribution function f(s) as

1

1

)1(

)(

)(1)(

s

s

n

n

t

nt

nt

n

st

t

st

nsf

We can compute the expected value which is (t-n+1)/(n-1)Here is a simple way to sample from the distribution corresponding to S(n,t). We already calculated its CDF (F(s)). We generate a random number U between 0 and 1 and find the smallest s such that U ≤ F(s), i.e.

Ut

nts

s

1)1(

)1(1

1

Page 29: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

29

ObservationsObservations

• Have reduced the number of calls to RANDOM() to optimal : One per insertion into the reservoir

• There are two ways to find the largest s that satisfies the previous equation– Linear scan : Gives O(N) time algorithm

– Binary search/Newton’s interpolation method to get a running time of O(n2(1 + log (N/n) log log (N/n))

• Note: This is still not optimal. Read the paper for an optimal (up to constants) algorithm.

Page 30: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

30

What have we seen so far?What have we seen so far?• How to sample efficiently (Reservoir Sampling)

– A method to sample without replacement by making a single scan

– Optimized the calls to RANDOM()

– Overall processing time can also be optimized

• How samples are used in DSS and what are the different samples that should be kept in order to answer queries

• What next?– Queries in DSS are not simple counts over the entire relation

– Typically they have grouping followed by aggregation of an attribute that may have high variance

Page 31: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

31

Error using samplingError using sampling

R = {y1, y2, …, yN}, sample size nVariance in data values:

1

)(1

2

N

YyS

N

ii

Error = Std Dev =√E(μ – μ*)2

N

n

n

S 1)(

Page 32: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

32

Group-By QueriesGroup-By Queries

• SELECT avg (salary) FROM census GROUP BY state

• Some of the states have very tuples as compared to others. E.g. CA has 70 times more people as compared to WY

• If we sample uniformly from the entire relation then there will be very few tuples corresponding to WY and hence a large error in its avg(salary) estimate

Page 33: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

33

Error Metric (Group-By)Error Metric (Group-By)

• Let c*_i be the true answer (aggregate) corresponding to group i

• Let c_i be the estimate obtained from sample

• The error e_i is given by |c*_i – c_i|/|c_i|

• The cumulative error is the L1,L2, L∞ norm of the error vector {e_i}

Page 34: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

34

Optimal sampling strategyOptimal sampling strategy

• For every group the error is inversely proportional to √n where n is the number of tuples in the sample from this group

• In order to reduce the maximum error among all groups we should have equal number of samples from each group (Senate)

• But this strategy is not optimal if the query does not have a group by and is over the entire relation. In that case a uniform sample of the entire relation is optimal (House)

Page 35: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

35

Basic-Congress SamplingBasic-Congress Sampling

• Unfortunately, unlike U.S. congress we don’t have place to sit both Senators and House Representatives!

• Hence we do the following:– Let X be the total seats allotted to Congress

– For a state CA let CA_S (resp CA_H) be the seats allotted to it assuming the congress was only made of senate (resp. house)

– The final seat allocation to each state CA is proportional to max(CA_S, CA_H), subject to total seats being X

Page 36: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

36

CommentsComments

• No error guarantees– Only a best effort solution

• Cannot use Reservoir sampling anymore– The full paper talks about one pass algorithms, but

admits that they don’t work in all cases

• What if the variance in values (S) is large ?– Outlier indexing

Page 37: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

37

Error using samplingError using sampling

R = {y1, y2, …, yN}, sample size nVariance in data values:

1

)(1

2

N

YyS

N

ii

Error = Std Dev :

N

n

n

S 1)(

Page 38: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

38

Presence of Data Skew.Presence of Data Skew.

Outliers (deviant tuples).

9950 tuples.Value = 1

50 tuples

Value = 1000.

Uniform sampleof size 100.

Sum estimate= 10,000

OR

Sum estimate > 109,900

Error > 83%

Exact Answer= 59,950

case1

case2

Page 39: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

39

Outlier Indexing Scheme.Outlier Indexing Scheme.

R

RO (outliers)

RNO

sample RNO

(sample)

Preprocessing

QA1

Q & extrapolateA2

+ A

Query

Page 40: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

40

Selection of Outlier Index.Selection of Outlier Index.

Objective: Remove at most n outliers such that non outliers have least variance.

Theorem: For a sorted (multi)set of values optimal outlier set looks like :

...,vk,vk+1, vk+2,…,vm-1,vm,vm+1,…

Page 41: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

41

CommentsComments

• Cannot do reservoir sampling

• One pass algorithm for selection of outliers

Page 42: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

42

Types of SamplesTypes of Samples

• Oblivious samples: We do not look at the value of attribute while sampling

• Value based sampling : The distinct sampling of Gibbons et al

• Limitations of oblivious sampling:– Please refer :

Sampling algorithms: lower bounds and applicaitons, Z. Bar-Yossef, S. Ravi Kumar, and D. Sivakumar.STOC 2001.

Page 43: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

43

SummarySummary• Obvious type of synopsis: samples

• Use of samples in DB, in particular DSS. – Idea of maintaining the samples of ‘fact’ tables

• How to sample without replacement with a single pass, not knowing the size of the relation a-priori– Reservoir sampling and tricks to make it efficient

• Shortcomings of sampling in DB’s– Group-By queries : Congressional samples

– High Skew in Data : Outlier indexing, stratified sampling

Page 44: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

44

ReferencesReferences• Join Synopses for Approximate Query Answering, S. Acharya, P.

Gibbons, V. Poosala, and S. Ramaswamy.  SIGMOD 1999.

• Congressional Samples for Approximate Answering of Group-By Queries, S. Acharya,  P, Gibbons, and V. Poosala. SIGMOD 2000.

• Overcoming Limitations of Sampling for Aggregation Queries, S. Chaudhuri, G. Das, M. Datar, R. Motwani and V. Narasayya. ICDE 2001.

• A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries, S. Chaudhuri, G. Das and V. Narasayya. SIGMOD 2001.

• Random Sampling with a Reservoir, J. S. Vitter. Trans. on Mathematical Software 11(1):37-57 (1985).

Page 45: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

45

Sampling over Sliding windowsSampling over Sliding windows

• Samples of streaming data

• Need to account for staleness of data

• An data element is fresh if it belongs to the last N elements

• Problem statement : Given a stream of elements maintain a uniform random sample of size

Page 46: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

46

A Simple, Unsatisfying ApproachA Simple, Unsatisfying Approach• Choose a random subset X={x1, …,xk}, X{0,1,…,n-1}

• The sample always consists of the non-expired elements whose indexes are equal to x1, …,xk (modulo n)

• Only uses O(k) memory

• Technically produces a uniform random sample of each window, but unsatisfying because the sample is highly periodic

• Unsuitable for many real applications, particularly those with periodicity in the data

Page 47: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

47

Reservoir Sampling: Why It Reservoir Sampling: Why It Doesn’t WorkDoesn’t Work

• Suppose an element in the reservoir expires

• Need to replace it with a randomly-chosen element from the current window

• However, in the data stream model we have no access to past data

• Could store the entire window but this would require O(n) memory

Page 48: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

48

Chain-SampleChain-Sample• Include each new element in the sample with probability

1/min(i,n)

• As each element is added to the sample, choose the index of the element that will replace it when it expires

• When the ith element expires, the window will be (i+1…i+n), so choose the index from this range

• Once the element with that index arrives, store it and choose the index that will replace it in turn, building a “chain” of potential replacements

• When an element is chosen to be discarded from the sample, discard its “chain” as well

Page 49: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

49

ExampleExample

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

Page 50: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

50

Memory Usage of Chain-SampleMemory Usage of Chain-Sample

• Let T(x) denote the expected length of the chain from the element with index i when the most recent index is i+x

• T(x) =

• The expected length of each chain is less than T(n) e 2.718

• Expected memory usage is O(k)

{ 0 for x < 01 + 1/n [ΣT(j)] for x 1 j<i

Page 51: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

51

Memory Usage of Chain-SampleMemory Usage of Chain-Sample• Chain consists of “hops” with lengths 1…n

• Chain of length j can be represented by partition of n into j ordered integer parts– j-1 hops with sum less than n plus a remainder

• Each such partition has probability n-j

• Number of such partitions is (n) < (ne/j)j

• Probability of any such partition is small [O(n-c)]when j = O(k log n)

• Uses O(k log n) memory whp

j

Page 52: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

52

Comparison of AlgorithmsComparison of Algorithms

• Chain-sample is preferable to oversampling:– Better expected memory usage: O(k) vs. O(k log n)

– Same high-probability memory bound of O(k log n)

– No chance of failure due to sample size shrinking below k

Algorithm Expected High-Probability

Periodic O(k) O(k)

Oversample O(k log n) O(k log n)

Chain-Sample O(k) O(k log n)