28
General Database Statistics Using Maximum Entropy Raghav Kaushik 1 , Christopher Ré 2 , and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison 3 University of Washington

General Database Statistics Using Maximum Entropy

  • Upload
    lucas

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

General Database Statistics Using Maximum Entropy. Raghav Kaushik 1 , Christopher Ré 2 , and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison 3 University of Washington. 1. Model: Information that optimizer knows - PowerPoint PPT Presentation

Citation preview

Page 1: General Database  Statistics  Using  Maximum Entropy

General Database Statistics Using Maximum Entropy

Raghav Kaushik1, Christopher Ré2, and Dan Suciu3

1Microsoft Research2University of Wisconsin--Madison

3University of Washington

Page 2: General Database  Statistics  Using  Maximum Entropy

2

Study Cardinality Estimation

1. Model: Information that optimizer knows

2. Prediction: use the model to estimate cardinality of future queries

Contribution: A principled, declarative approach to cardinality estimation based on Entropy Maximization.

“We estimate that distinct # of Employees is 10”

Propose a declarative language with statistical assertions

Page 3: General Database  Statistics  Using  Maximum Entropy

3

Motivating Applications

1. Incorporate query feedback records-

3. Data generation and description

2. Optimizers for new domains (DB Kit 2.0)

Cloud Computing, Information Extraction

Underutilized: No general purpose mechanism

Page 4: General Database  Statistics  Using  Maximum Entropy

4

Outline

• Statistical programs and desiderata

• Semantics of Statistical Programs

• Two examples

• Conclusions

Page 5: General Database  Statistics  Using  Maximum Entropy

5

Statistical Assertions

An assertion is a CQ Views + sharp (#) statement:

V1(x) :- R(x,-)

“The number of values in the output of V1 is 20”

#V1 = 20

V2(y) :- R(-,y),S(y)

“The number of values in the output V2 is 50”

#V2 = 50

A program is a set of assertions

V(x) :- R(x,y), …. #V= 106

Page 6: General Database  Statistics  Using  Maximum Entropy

6

Model as a Probabilistic Database

Intuitively, # is “Expected Value”

V1(x) :- R(x,-)

A model is a probabilistic database s.t. the expected number of tuples in V1 is 20.

Ok, but which pdb?

#V1 = 20

V(x) :- R(x,y), …. #V= 106

“The number of values in the output of V1 is 20”

Page 7: General Database  Statistics  Using  Maximum Entropy

7

Desiderata for our solution

• Two Desiderata for the distribution(D1): Should agree with provided statistics(D2): Should assume nothing else

Approach: maximize entropy subject to D1

Challenge: Compute params of MaxEnt Distribution

Technical Desideratum: want params analytically

V(x) :- R(x,y), …. #V= 106

Page 8: General Database  Statistics  Using  Maximum Entropy

8

Outline

• Statistical programs and desiderata

• Semantics of Statistical Programs

• Two examples

• Conclusions

Page 9: General Database  Statistics  Using  Maximum Entropy

9

Notation for Probabilistic Databases

• Consider a domain D of size n.• Fix a schema R=R1, R2,…• Let Inst(n) = all instances over R on D• An element I of Inst(n) is called a world

Page 10: General Database  Statistics  Using  Maximum Entropy

10

Notation for Probabilistic Databases

• Consider a domain D of size n.• Fix a schema R=R1, R2,…• Let Inst(n) = all instances over R on D• An element I of Inst(n) is called a world

Essentially, any discrete probability distribution on relations

A probabilistic database is a pair (Inst(n),p)

( )

: ( ) [0,1] . . ( ) 1I Inst n

p Inst n s t p I

Page 11: General Database  Statistics  Using  Maximum Entropy

11

1( )

( ) 20I

I Inst n

p I V

The semantics of #

V1(x) :- R(x,-)

# means “expected value”

#V1 = 20

Achieving (D1): Stats must agree

NB: In truth, we let n tend to infinity, and settle for asymptotically equal.

“The number of values in the output of V1 is 20”

Page 12: General Database  Statistics  Using  Maximum Entropy

12

( )

1, , ( ) Ii i

I Inst n

for i t p I V d

Multiple Views

• Given V1, V2, … with #Vi = di for i=1,…,t

If p satisfies these equations, we’ve achieved:(D1): Should agree with provided statistics

Many such distributions exist. How do we pick one?

Achieving (D1): Stats must agree

Page 13: General Database  Statistics  Using  Maximum Entropy

13

Selecting the best one

• Maximize Entropy subject to constraints:

Achieving (D2) : No ad-hoc assumptions

# 1, ,i id forV i t

Page 14: General Database  Statistics  Using  Maximum Entropy

14

# 1, ,i id forV i t

| |

1

( ) 1 IiV

t

iiI

Zp

Selecting the best one

• Maximize Entropy subject to constraints:

Achieving (D2) : No ad-hoc assumptions

Z is normalizing constant and i is positive parameter for i=1,..,t

NB: p is only a function of the stats, and so we have achieved (D2)

One can show that p has following form:

Page 15: General Database  Statistics  Using  Maximum Entropy

15

| |

1

( ) 1 IiV

t

iiI

Zp

# 1, ,i id forV i t

Benefits of MaxEnt

• Every (consistent) statistical program induces a well-defined distribution– Every query has a well-defined cardinality estimate

• Statistics as a whole, not as individual stats.• Can add new statistics to our heart’s content

Technical Challenge: i analytically

A statistical program

Page 16: General Database  Statistics  Using  Maximum Entropy

16

Outline

• Statistical programs and desiderata

• Semantics of Statistical Programs

• Two examples

• Conclusions

Page 17: General Database  Statistics  Using  Maximum Entropy

17

Two quick Examples

• I: A material random Graph– Even simple EM solutions have interesting theory

• II: Intersection Models– Generating function , and– Different, analytic technique

Page 18: General Database  Statistics  Using  Maximum Entropy

18

Example I: Random Graphs are EMV(x,y) :- R(x,y) #V = d Random Graph: Add edges

independently at random

Page 19: General Database  Statistics  Using  Maximum Entropy

19

2

( ) (1 )v n vp I x x

2Ilet v V and x dn

Example I: Random Graphs are EMV(x,y) :- R(x,y) #V = d

By Linearity, E[V] = xn2 = d

Random Graph: Add edges independently at random

Page 20: General Database  Statistics  Using  Maximum Entropy

20

2

( ) (1 )v n vp I x x

Example I: Random Graphs are EMV(x,y) :- R(x,y) #V = d Random Graph: Add edges

independently at random

By Linearity, E[V] = xn2 = d

2Ilet v V and x dn

This is MaxEnt…write:

( ) 1 vp IZ

1xx

2

(1 ) nZ x

Page 21: General Database  Statistics  Using  Maximum Entropy

21

Example II:an intersection model

Read: Each element is either in R1, R2, or all three

V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3

1 2 1 2 3(1 )n nZ x x x x x

e.g., term with x1k is an instance

where k distinct values in R1

Page 22: General Database  Statistics  Using  Maximum Entropy

22

Example II:an intersection model

k kdx x kxdx

V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3

Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x

e.g., term with x1k is an instance

where k distinct values in R1

Page 23: General Database  Statistics  Using  Maximum Entropy

23

Example II:an intersection model

k kdx x kxdx

V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3

Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x

e.g., term with x1k is an instance

where k distinct values in R1

3 33

1 nn

ddZ

x Zdx

Page 24: General Database  Statistics  Using  Maximum Entropy

24

Example II:an intersection model

k kdx x kxdx

V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3

Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x

e.g., term with x1k is an instance

where k distinct values in R1

3

1 2 33 3

1 nn

x xdx Zx

nddx ZZ

Page 25: General Database  Statistics  Using  Maximum Entropy

25

Example II:an intersection model

k kdx x kxdx

V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3

Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x

e.g., term with x1k is an instance

where k distinct values in R1

3

1 2 33 3

1 nn

x xdx Zx

nddx ZZ

1 1 2 3 13

x= n di

x x xxd n

Z Z

Page 26: General Database  Statistics  Using  Maximum Entropy

26

Example II:an intersection model

k kdx x kxdx

V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3

Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x

e.g., term with x1k is an instance

where k distinct values in R1

3

1 2 33 3

1 nn

x xdx Zx

nddx ZZ

1 1 2 3 13

x= n di

x x xxd n

Z Z

3 1, 2ii

d dfor i

nx

33

1 2

( )d

x nnx x

Page 27: General Database  Statistics  Using  Maximum Entropy

27

Results in the paper

• Normal Form for statistical programs

• Syntactic classes that we can solve analytically– “Project-Semijoin” queries (previous slide)

• A general technique, conditioning:– Start with tuple independent prior, and condition– Introduces inclusion constraints

• Extensions to handle histograms

Page 28: General Database  Statistics  Using  Maximum Entropy

28

Conclusion

• Showed a principled, general model for database statistics based on MaxEnt

• Analytically solved syntactic classes of statistics

• Applications: Query Feedback and the Cloud