Upload
tushar-tank
View
514
Download
0
Embed Size (px)
Citation preview
Variational Inference
Note: Much (meaning almost all) of this has been liberated from John Winn and Matthew Beal’s theses, and David McKay’s book.
Overview
• Probabilistic models & Bayesian inference
• Variational Inference
• Univariate Gaussian Example
• GMM Example
• Variational Message Passing
Bayesian networks
• Directed graph• Nodes represent
variables• Links show dependencies• Conditional distribution at
each node• Defines a joint
distribution:
.P(C,L,S,I)=P(L) P(C) P(S|C) P(I|L,S)
Lighting color
Surface color
Image color
Object class
C
SL
I
P(L)
P(C)
P(S|C)
P(I|L,S)
Lighting color
Hidden
Bayesian inference
Observed
• Observed variables D and hidden variables H.
• Hidden variables includeparameters and latent variables.
• Learning/inference involves finding:
• P(H1, H2…| D), or• P(H,|D,M) - explicitly for
generative model.
Surface color
Image color
C
SL
I
Object class
Bayesian inference vs. ML/MAP• Consider learning one parameter θ
• How should we represent this posterior distribution?
)()|( PDP
Bayesian inference vs. ML/MAP
θMAP
θ
Maximum of P(V| θ) P(θ)
• Consider learning one parameter θ
P(D| θ) P(θ)
Bayesian inference vs. ML/MAP
P(D| θ) P(θ)
θMAP
θ
High probability massHigh probability density
• Consider learning one parameter θ
Bayesian inference vs. ML/MAP
θML
θSamples
• Consider learning one parameter θ
P(D| θ) P(θ)
Bayesian inference vs. ML/MAP
θML
θVariational
approximation
)(θQ
• Consider learning one parameter θ
P(D| θ) P(θ)
Variational Inference
1. Choose a family of variational distributions Q(H).
2. Use Kullback-Leibler divergence KL(Q||P) as a measure of ‘distance’ between P(H|D) and Q(H).
3. Find Q which minimizes divergence.
(in three easy steps…)
Choose Variational Distribution
• P(H|D) ~ Q(H).• If P is so complex how do we choose Q?• Any Q is better than an ML or MAP point
estimate.• Choose Q so it can “get” close to P and is
tractable – factorize, conjugate.
Kullback-Leibler Divergence
• Derived from Variational Free Energy by Feynman and Bobliubov
• Relative Entropy between two probability distributions• KL(Q||P) > 0 , for any Q (Jensen’s inequality)• KL(Q||P) = 0 iff P = Q.
• Not true distance measure, not symmetric
X xP
xQxQPQKL)()(ln)()||(
Kullback-Leibler Divergence
Minimising KL(Q||P)
P
Q
Q Exclusive
H DHP
HQHQ)|(
)(ln)(
Minimising KL(P||Q) P
H HQ
DHPDHP)(
)|(ln)|(
Inclusive
Kullback-Leibler Divergence
H DHP
HQHQPQKL)|(
)(ln)()||(
H H
DPHQDHP
HQHQPQKL )(ln)(),(
)(ln)()||(
H DHP
DPHQHQPQKL),(
)()(ln)()||(
H HQ
DHPDHPQPK)(
)|(ln)|()||(
H
DPDHP
HQHQPQKL )(ln),(
)(ln)()||(
H DHP
HQHQPQKL)|(
)(ln)()||(
Bayes Rules
Log property
Sum over H
Kullback-Leibler Divergence
H H
HQHQDHPHQQL )(ln)(),(ln)()( DEFINE
• L is the difference between: expectation of the marginal likelihood with respect to Q, and the entropy of Q
• Maximize L(Q) is equivalent to minimizing KL Divergence
• We could not do the same trick for KL(P||Q), thus we will approximate likelihood with a function that has it’s mass where the likelihood is most probable (exclusive).
H
DPDHP
HQHQPQKL )(ln),(
)(ln)()||(
)()(ln)||( QLDPPQKL
Summarize
where
• For arbitrary Q(H)
• We choose a family of Q distributions where L(Q) is tractable to compute.
maximisefixed minimise
Still difficult in general to calculate
Minimising the KL divergence
L(Q)
KL(Q || P)
ln P(D)maximise
fixed
Minimising the KL divergence
L(Q)
KL(Q || P)
ln P(D)maximise
fixed
Minimising the KL divergence
L(Q)
KL(Q || P)
ln P(D)
maximise
fixed
Minimising the KL divergence
L(Q)
KL(Q || P)
ln P(D)
maximise
fixed
Minimising the KL divergence
L(Q)
KL(Q || P)
ln P(D)
maximise
fixed
Factorised Approximation
• Assume Q factorises
• Optimal solution for one factor given by
• Given the form of Q, find the best H in KL sense• Choose conjugate priors P(H) to give from of Q• Do it iteratively of each Qi(Hi)
ji H
iiijji
DHPHQZ
HQ )),(ln)(exp(1)(*
Derivation
ji H
iijji
DHPHQZ
HQ )),(ln)(exp(1)(*
H H
HQHQDHPHQQL )(ln)(),(ln)()(
H H j
jji
iii
ii HQHQDHPHQ )(ln)(),(ln)(
H H i j
jjiii
ii HQHQDHPHQ )(ln)(),(ln)(
H i H
iiiii
iii
HQHQDHPHQ )(ln)(),(ln)(
H ji H
iiiiH
jjjjji
iijjij
HQHQHQHQDHPHQHQ )(ln)()(ln)(),(ln)()(
Log property
Substitution
Factor one term Qj
Not a Function of Qj
Idea: Use factoring of Q to isolate Qj and maximize L wrt Qj
ZQQKL jj log)||( *
Example: Univariate Gaussian• Normal distribution• Find P(| x)• Conjugate prior • Factorized variational
distribution• Q distribution same form as
prior distributions• Inference involves updating
these hidden parameters
Example: Univariate Gaussian• Use Q* to derive:
• Where <> is the expectation over Q function• Iteratively solve
Example: Univariate Gaussian
• Estimate of log evidence can be found by calculating L(Q):
• Where <.> are expectations wrt to Q(.)
Example
Take four data samples form Gaussian (Thick Line) to find posterior. Dashed lines distribution from sampled variational.
Variational and True posterior from Gaussian given four samples. P() = N(0,1000). P() = Gamma(.001,.001).
VB with Image Segmentation
20 40 60 80 100 120 140 160 180
20
40
60
80
100
120
0 100 200 3000
100
200
0 100 200 3000
50
100
0 100 200 3000
100
200
300
0 100 200 3000
50
100
0 100 200 3000
50
100
150
0 100 200 3000
50
100
RGB histogram of two pixel locations.
“VB at the pixel level will give better results.”
Feature vector (x,y,Vx,Vy,r,g,b) - will have issues with data association.
VB with GMM will be complex – doing this in real time will be execrable.
Lower Bound for GMM-Ugly
Variational Equations for GMM-Ugly
Brings Up VMP – Efficient Computation
Lighting color
Surface color
Image color
Object class
C
SL
I
P(L)
P(C)
P(S|C)
P(I|L,S)