General Database Statistics Using Maximum Entropy

General Database Statistics Using Maximum Entropy

Raghav Kaushik1, Christopher Ré2, and Dan Suciu3

1Microsoft Research2University of Wisconsin--Madison

3University of Washington

2

Study Cardinality Estimation

1. Model: Information that optimizer knows

2. Prediction: use the model to estimate cardinality of future queries

Contribution: A principled, declarative approach to cardinality estimation based on Entropy Maximization.

“We estimate that distinct # of Employees is 10”

Propose a declarative language with statistical assertions

3

Motivating Applications

1. Incorporate query feedback records-

3. Data generation and description

2. Optimizers for new domains (DB Kit 2.0)

Cloud Computing, Information Extraction

Underutilized: No general purpose mechanism

4

Outline

• Statistical programs and desiderata

• Semantics of Statistical Programs

• Two examples

• Conclusions

5

Statistical Assertions

An assertion is a CQ Views + sharp (#) statement:

V1(x) :- R(x,-)

“The number of values in the output of V1 is 20”

#V1 = 20

V2(y) :- R(-,y),S(y)

“The number of values in the output V2 is 50”

#V2 = 50

A program is a set of assertions

V(x) :- R(x,y), …. #V= 106

6

Model as a Probabilistic Database

Intuitively, # is “Expected Value”

V1(x) :- R(x,-)

A model is a probabilistic database s.t. the expected number of tuples in V1 is 20.

Ok, but which pdb?

#V1 = 20

V(x) :- R(x,y), …. #V= 106


7

Desiderata for our solution

• Two Desiderata for the distribution(D1): Should agree with provided statistics(D2): Should assume nothing else

Approach: maximize entropy subject to D1

Challenge: Compute params of MaxEnt Distribution

Technical Desideratum: want params analytically

V(x) :- R(x,y), …. #V= 106

8

Outline



• Two examples

• Conclusions

9

Notation for Probabilistic Databases

• Consider a domain D of size n.• Fix a schema R=R1, R2,…• Let Inst(n) = all instances over R on D• An element I of Inst(n) is called a world

10

Notation for Probabilistic Databases

• Consider a domain D of size n.• Fix a schema R=R1, R2,…• Let Inst(n) = all instances over R on D• An element I of Inst(n) is called a world

Essentially, any discrete probability distribution on relations

A probabilistic database is a pair (Inst(n),p)

( )

: ( ) [0,1] . . ( ) 1I Inst n

p Inst n s t p I

11

1( )

( ) 20I

I Inst n

p I V

The semantics of #

V1(x) :- R(x,-)

# means “expected value”

#V1 = 20

Achieving (D1): Stats must agree

NB: In truth, we let n tend to infinity, and settle for asymptotically equal.


12

( )

1, , ( ) Ii i

I Inst n

for i t p I V d

Multiple Views

• Given V1, V2, … with #Vi = di for i=1,…,t

If p satisfies these equations, we’ve achieved:(D1): Should agree with provided statistics

Many such distributions exist. How do we pick one?

Achieving (D1): Stats must agree

13

Selecting the best one

• Maximize Entropy subject to constraints:

Achieving (D2) : No ad-hoc assumptions

# 1, ,i id forV i t

14

# 1, ,i id forV i t

| |

1

( ) 1 IiV

t

iiI

Zp

Selecting the best one

• Maximize Entropy subject to constraints:

Achieving (D2) : No ad-hoc assumptions

Z is normalizing constant and i is positive parameter for i=1,..,t

NB: p is only a function of the stats, and so we have achieved (D2)

One can show that p has following form:

15

| |

1

( ) 1 IiV

t

iiI

Zp

# 1, ,i id forV i t

Benefits of MaxEnt

• Every (consistent) statistical program induces a well-defined distribution– Every query has a well-defined cardinality estimate

• Statistics as a whole, not as individual stats.• Can add new statistics to our heart’s content

Technical Challenge: i analytically

A statistical program

16

Outline



• Two examples

• Conclusions

17

Two quick Examples

• I: A material random Graph– Even simple EM solutions have interesting theory

• II: Intersection Models– Generating function , and– Different, analytic technique

18

Example I: Random Graphs are EMV(x,y) :- R(x,y) #V = d Random Graph: Add edges

independently at random

19

2

( ) (1 )v n vp I x x

2Ilet v V and x dn

Example I: Random Graphs are EMV(x,y) :- R(x,y) #V = d

By Linearity, E[V] = xn2 = d

Random Graph: Add edges independently at random

20

2

( ) (1 )v n vp I x x

Example I: Random Graphs are EMV(x,y) :- R(x,y) #V = d Random Graph: Add edges

independently at random

By Linearity, E[V] = xn2 = d

2Ilet v V and x dn

This is MaxEnt…write:

( ) 1 vp IZ

1xx

2

(1 ) nZ x

21

Example II:an intersection model

Read: Each element is either in R1, R2, or all three

V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3

1 2 1 2 3(1 )n nZ x x x x x

e.g., term with x1k is an instance

where k distinct values in R1

22


k kdx x kxdx

V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3

Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x



23


k kdx x kxdx

V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3




3 33

1 nn

ddZ

x Zdx

24


k kdx x kxdx

V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3




3

1 2 33 3

1 nn

x xdx Zx

nddx ZZ

25


k kdx x kxdx

V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3




3

1 2 33 3

1 nn

x xdx Zx

nddx ZZ

1 1 2 3 13

x= n di

x x xxd n

Z Z

26


k kdx x kxdx

V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3




3

1 2 33 3

1 nn

x xdx Zx

nddx ZZ

1 1 2 3 13

x= n di

x x xxd n

Z Z

3 1, 2ii

d dfor i

nx

33

1 2

( )d

x nnx x

27

Results in the paper

• Normal Form for statistical programs

• Syntactic classes that we can solve analytically– “Project-Semijoin” queries (previous slide)

• A general technique, conditioning:– Start with tuple independent prior, and condition– Introduces inclusion constraints

• Extensions to handle histograms

28

Conclusion

• Showed a principled, general model for database statistics based on MaxEnt

• Analytically solved syntactic classes of statistics

• Applications: Query Feedback and the Cloud

Documents

General Database Statistics Using Maximum Entropy