Upload
lucas
View
51
Download
0
Tags:
Embed Size (px)
DESCRIPTION
General Database Statistics Using Maximum Entropy. Raghav Kaushik 1 , Christopher Ré 2 , and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison 3 University of Washington. 1. Model: Information that optimizer knows - PowerPoint PPT Presentation
Citation preview
General Database Statistics Using Maximum Entropy
Raghav Kaushik1, Christopher Ré2, and Dan Suciu3
1Microsoft Research2University of Wisconsin--Madison
3University of Washington
2
Study Cardinality Estimation
1. Model: Information that optimizer knows
2. Prediction: use the model to estimate cardinality of future queries
Contribution: A principled, declarative approach to cardinality estimation based on Entropy Maximization.
“We estimate that distinct # of Employees is 10”
Propose a declarative language with statistical assertions
3
Motivating Applications
1. Incorporate query feedback records-
3. Data generation and description
2. Optimizers for new domains (DB Kit 2.0)
Cloud Computing, Information Extraction
Underutilized: No general purpose mechanism
4
Outline
• Statistical programs and desiderata
• Semantics of Statistical Programs
• Two examples
• Conclusions
5
Statistical Assertions
An assertion is a CQ Views + sharp (#) statement:
V1(x) :- R(x,-)
“The number of values in the output of V1 is 20”
#V1 = 20
V2(y) :- R(-,y),S(y)
“The number of values in the output V2 is 50”
#V2 = 50
A program is a set of assertions
V(x) :- R(x,y), …. #V= 106
6
Model as a Probabilistic Database
Intuitively, # is “Expected Value”
V1(x) :- R(x,-)
A model is a probabilistic database s.t. the expected number of tuples in V1 is 20.
Ok, but which pdb?
#V1 = 20
V(x) :- R(x,y), …. #V= 106
“The number of values in the output of V1 is 20”
7
Desiderata for our solution
• Two Desiderata for the distribution(D1): Should agree with provided statistics(D2): Should assume nothing else
Approach: maximize entropy subject to D1
Challenge: Compute params of MaxEnt Distribution
Technical Desideratum: want params analytically
V(x) :- R(x,y), …. #V= 106
8
Outline
• Statistical programs and desiderata
• Semantics of Statistical Programs
• Two examples
• Conclusions
9
Notation for Probabilistic Databases
• Consider a domain D of size n.• Fix a schema R=R1, R2,…• Let Inst(n) = all instances over R on D• An element I of Inst(n) is called a world
10
Notation for Probabilistic Databases
• Consider a domain D of size n.• Fix a schema R=R1, R2,…• Let Inst(n) = all instances over R on D• An element I of Inst(n) is called a world
Essentially, any discrete probability distribution on relations
A probabilistic database is a pair (Inst(n),p)
( )
: ( ) [0,1] . . ( ) 1I Inst n
p Inst n s t p I
11
1( )
( ) 20I
I Inst n
p I V
The semantics of #
V1(x) :- R(x,-)
# means “expected value”
#V1 = 20
Achieving (D1): Stats must agree
NB: In truth, we let n tend to infinity, and settle for asymptotically equal.
“The number of values in the output of V1 is 20”
12
( )
1, , ( ) Ii i
I Inst n
for i t p I V d
Multiple Views
• Given V1, V2, … with #Vi = di for i=1,…,t
If p satisfies these equations, we’ve achieved:(D1): Should agree with provided statistics
Many such distributions exist. How do we pick one?
Achieving (D1): Stats must agree
13
Selecting the best one
• Maximize Entropy subject to constraints:
Achieving (D2) : No ad-hoc assumptions
# 1, ,i id forV i t
14
# 1, ,i id forV i t
| |
1
( ) 1 IiV
t
iiI
Zp
Selecting the best one
• Maximize Entropy subject to constraints:
Achieving (D2) : No ad-hoc assumptions
Z is normalizing constant and i is positive parameter for i=1,..,t
NB: p is only a function of the stats, and so we have achieved (D2)
One can show that p has following form:
15
| |
1
( ) 1 IiV
t
iiI
Zp
# 1, ,i id forV i t
Benefits of MaxEnt
• Every (consistent) statistical program induces a well-defined distribution– Every query has a well-defined cardinality estimate
• Statistics as a whole, not as individual stats.• Can add new statistics to our heart’s content
Technical Challenge: i analytically
A statistical program
16
Outline
• Statistical programs and desiderata
• Semantics of Statistical Programs
• Two examples
• Conclusions
17
Two quick Examples
• I: A material random Graph– Even simple EM solutions have interesting theory
• II: Intersection Models– Generating function , and– Different, analytic technique
18
Example I: Random Graphs are EMV(x,y) :- R(x,y) #V = d Random Graph: Add edges
independently at random
19
2
( ) (1 )v n vp I x x
2Ilet v V and x dn
Example I: Random Graphs are EMV(x,y) :- R(x,y) #V = d
By Linearity, E[V] = xn2 = d
Random Graph: Add edges independently at random
20
2
( ) (1 )v n vp I x x
Example I: Random Graphs are EMV(x,y) :- R(x,y) #V = d Random Graph: Add edges
independently at random
By Linearity, E[V] = xn2 = d
2Ilet v V and x dn
This is MaxEnt…write:
( ) 1 vp IZ
1xx
2
(1 ) nZ x
21
Example II:an intersection model
Read: Each element is either in R1, R2, or all three
V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3
1 2 1 2 3(1 )n nZ x x x x x
e.g., term with x1k is an instance
where k distinct values in R1
22
Example II:an intersection model
k kdx x kxdx
V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3
Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x
e.g., term with x1k is an instance
where k distinct values in R1
23
Example II:an intersection model
k kdx x kxdx
V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3
Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x
e.g., term with x1k is an instance
where k distinct values in R1
3 33
1 nn
ddZ
x Zdx
24
Example II:an intersection model
k kdx x kxdx
V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3
Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x
e.g., term with x1k is an instance
where k distinct values in R1
3
1 2 33 3
1 nn
x xdx Zx
nddx ZZ
25
Example II:an intersection model
k kdx x kxdx
V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3
Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x
e.g., term with x1k is an instance
where k distinct values in R1
3
1 2 33 3
1 nn
x xdx Zx
nddx ZZ
1 1 2 3 13
x= n di
x x xxd n
Z Z
26
Example II:an intersection model
k kdx x kxdx
V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3
Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x
e.g., term with x1k is an instance
where k distinct values in R1
3
1 2 33 3
1 nn
x xdx Zx
nddx ZZ
1 1 2 3 13
x= n di
x x xxd n
Z Z
3 1, 2ii
d dfor i
nx
33
1 2
( )d
x nnx x
27
Results in the paper
• Normal Form for statistical programs
• Syntactic classes that we can solve analytically– “Project-Semijoin” queries (previous slide)
• A general technique, conditioning:– Start with tuple independent prior, and condition– Introduces inclusion constraints
• Extensions to handle histograms
28
Conclusion
• Showed a principled, general model for database statistics based on MaxEnt
• Analytically solved syntactic classes of statistics
• Applications: Query Feedback and the Cloud