23
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research Center

1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

Embed Size (px)

Citation preview

Page 1: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

1

Maintaining Bernoulli SamplesOver Evolving Multisets

Rainer Gemulla Wolfgang Lehner

Technische Universität Dresden

Peter J. Haas

IBM Almaden Research Center

Page 2: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

2

Motivation• Sampling: crucial for information systems

– Externally: quick approximate answers to user queries– Internally: Speed up design and optimization tasks

• Incremental sample maintenance– Key to instant availability– Should avoid base-data accesses

Sample

Data

(updates), deletes, inserts

“Local”

“Remote”

XToo expensive!

Page 3: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

3

What Kind of Samples?• Uniform sampling

– Samples of equal size have equal probability– Popular and flexible, used in more complex schemes

• Bernoulli– Each item independently included (prob. = q)– Easy to subsample, parallelize (i.e., merge)

• Multiset sampling (most previous work is on sets)– Compact representation– Used in network monitoring, schema discovery, etc.

ab

acb

aa c

Dataset R Sample S

Bern(q)a

a

ab

Page 4: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

4

Outline

• Background– Classical Bernoulli sampling on sets– A naïve multiset Bernoulli sampling algorithm

• New sampling algorithm + proof sketch– Idea: augment sample with “tracking counters”

• Exploiting tracking counters for unbiased estimation– For dataset frequencies (reduced variance)– For # of distinct items in the dataset

• Subsampling algorithm• Negative result on merging• Related work

Page 5: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

5

Classical Bernoulli sampling

• Bern(q) sampling of sets– Uniform scheme– Binomial sample size

• Originally designed for insertion-only• But handling deletions from R is easy

– Remove deleted item from S if present

{ } ( ; , ) (1 )R nnRP S n B n R q q q

n

Page 6: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

6

Multiset Bernoulli Sampling

• In a Bern(q) multiset sample:- Frequency X(t) of item t T is Binomial(N(t),q)

- Item frequencies are mutually independent

• Handling insertions is easy– Insert item t into R and (with probability q) into S– I.e., increment counters (or create new counters)

• Deletions from a multiset: not obvious– Multiple copies of item t in both S and R– Only local information is available

Page 7: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

7

A Naïve Algorithm

Sample

DataDeletion of t

Delete t from sample

With prob. X(t) / N(t)

X(t) copies of item t in sample

N(t) copies of item t in dataset

– Problem: must know N(t)• Impractical to track N(t) for every distinct t in dataset• Track N(t) only for distinct t in sample?

– No: when t first enters, must access dataset to compute N(t)

Insertion of t

Insert t into sample

With prob. q

Page 8: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

8

New Algorithm• Key idea: use tracking counters (GM98)

– After j-th transaction, augmented sample Sj is

Sj = { (Xj (t),Yj (t)): t T and Xj (t) > 0}• Xj(t) = frequency of item t in the sample

• Yj(t) = net # of insertions of t into R since t joined sample

Deletion of t

Delete t from sample

With prob. (Xj(t) – 1) / (Yj(t) – 1)

Sample

Data

Xj(t) copies of item t in dataset

Nj(t) copies of item t in dataset

Insertion of t

Insert t into sample

With prob. q

Page 9: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

9

A Correctness Proofi Ri Si

i*-1 {t t} { }

i* {t t t} {t}

i*+1 {t t t t} {t t}

… … …

j {t t t t … t} {t t … t}Yj - 1 items Xj - 1 items

Page 10: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

10

A Correctness Proofi Ri Si

( | ) ( 1| 1) ( 1; 1, )1 1jj j jP X k Y m P k Y m B mX k q

i*-1 {t t} { }

i* {t t t} {t}

i*+1 {t t t t} {t t}

… … …

j {t t t t … t} {t t … t}Yj - 1 items Xj - 1 items

Red sample obtained from red dataset via naïve algorithm, hence Bern(q)

Page 11: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

11

Proof (Continued)

• Can show (by induction)

– Intuitive when insertions only (Nj = j)

• Uncondition on Yj to finish proof

(1 ) if 0( )

(1 ) otherwise

j

j

N

j N m

q mP Y m

q q

( ) ( | ) ( )j j j jmP X k P X k Y m P Y m

= B(k-1;m-1,q) by previous slide

Page 12: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

12

Frequency Estimation• Naïve (Horvitz-Thompson) unbiased estimator

• Exploit tracking counter:

• Theorem

• Can extend to other aggregates (see paper)

1 1ˆiX

iiXN

q

X

qq

1 if 0ˆ0 if

1

0i

iY

i

i q YN

Y

Y

andˆ ˆ ˆ[ ] [ ] [ ]i i iY i Y XE N N V N V N

Page 13: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

13

Estimating Distinct-Value Counts

• If usual DV estimators unavailable (BH+07)• Obtain S’ from S: insert t D(S) with probability

• Can show: P(t S’) = q for t D(R)• HT unbiased estimator: = |S’| / q• Improve via conditioning (Var[E[U|V]] ≤ Var[U]):

1 if ( ) 1( )

if ( ) 1

Y tp t

q Y t

ˆHTD

( )ˆ ˆ[ | ] ( ) /Y HT t D S

D E D S p t q

Page 14: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

14

Subsampling

aa a

b b cBern(q) sample of R

aaa

bb

c

a a

bdd

R

• Why needed?– Sample is too large– For merging

• Challenge:– Generate statistically

correct tracking-counter value Y’

• New algorithm– See paper a

a c Bern(q’) sample of R

q’ < q

Page 15: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

15

Merging

R1aa a

b b c

S1a

a

c

sample

R2a a

b d d

S2a

d

b

sample

aaa

bb

c

a a

bdd

merge

R = R1 R2

• Easy case– Set sampling or no

further maintenance

– S = S1 S2

• Otherwise:– If R1 R2 Ø and

0 < q < 1, then there exists no statistically correct merging algorithm

a acb S Ra

d

Page 16: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

16

Related Work on Multiset Sampling

• Gibbons and Matias [1998]– Concise samples: maintain Xi(t), handles inserts only

– Counting Samples: maintain Yi(t), compute Xi(t) on demand

– Frequency estimator for hot items: Yi(t) – 1 + 0.418 / q

• Biased, higher mean-squared error than new estimator

• Distinct-Item sampling [CMR05,FIS05,Gi01]– Simple random sample of (t, N(t)) pairs– High space, time overhead

Page 17: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

17

Maintaining Bernoulli SamplesOver Evolving Multisets

Rainer Gemulla Wolfgang Lehner

Technische Universität Dresden

Peter J. Haas

IBM Almaden Research Center

Page 18: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

18

Backup Slides

Page 19: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

19

Subsampling• Easy case (no further maintenance)

– Take Bern(q*) subsample, where q* = q’ / q– Actually, just generate X’ directly as

Binomial(X,q*)

• Hard case (to continue maintenance)– Must also generate new tracking-counter

value Y’

• Approach: generate X’ then Y’ | {X’,Y}

Page 20: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

20

Subsampling: The Hard Case• Generate X’ = +

= 1 iff item included in S at time i* is retained• P( = 1) = 1 - P( = 0) = q*

is # of other items in S that are retained in S’ is Binomial(X-1,q*)

• Generate Y’ according to “correct” distribution– P(Y’ = m | X’, Y, )

Page 21: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

21

Subsampling (Continued)

1

1

' '( ' | ', , 0) 1 when ' 0

Y

i m

X XP Y m X Y X

m i

( ' | ', , 1) [1 if and 0 otherwise ]P Y m X Y m Y

Generate Y - Y’ using acceptance/rejection (Vitter 1984)

i : 1 2 3 4 5 6

Y X6 6 = 5 , = 3

Y X6 6' = 5 , ' = 2

Y X6 6' = 3 , ' = 2

t t

t

t t t tO rig ina l sam ple

t t t t t = 1

= 0 t t t t t t

Page 22: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

22

Related Workon Set-Based Sampling

• Methods that access dataset– CAR, CARWOR [OR86], backing samples [GMP02]

• Methods (for bounded samples) that do not access dataset– Reservoir sampling [FMR62, Vi85] (inserts only)– Stream sampling [BDM02] (sliding window)– Random pairing [GLH06] (resizing also discussed)

Page 23: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research

23

More Motivation:A Sample Warehouse

Full-ScaleWarehouse Of Data Partitions

Sample

Sample

Sample

S1,1 S1,2 Sn,mWarehouseof Samples

merge

S*,* S1-2,3-7 etc