Variational Inference

Variational Inference

Note: Much (meaning almost all) of this has been liberated from John Winn and Matthew Beal’s theses, and David McKay’s book.

Overview

• Probabilistic models & Bayesian inference

• Variational Inference

• Univariate Gaussian Example

• GMM Example

• Variational Message Passing

Bayesian networks

• Directed graph• Nodes represent

variables• Links show dependencies• Conditional distribution at

each node• Defines a joint

distribution:

.P(C,L,S,I)=P(L) P(C) P(S|C) P(I|L,S)

Lighting color

Surface color

Image color

Object class

C

SL

I

P(L)

P(C)

P(S|C)

P(I|L,S)

Lighting color

Hidden

Bayesian inference

Observed

• Observed variables D and hidden variables H.

• Hidden variables includeparameters and latent variables.

• Learning/inference involves finding:

• P(H1, H2…| D), or• P(H,|D,M) - explicitly for

generative model.

Surface color

Image color

C

SL

I

Object class

Bayesian inference vs. ML/MAP• Consider learning one parameter θ

• How should we represent this posterior distribution?

)()|( PDP

Bayesian inference vs. ML/MAP

θMAP

θ

Maximum of P(V| θ) P(θ)

• Consider learning one parameter θ

P(D| θ) P(θ)


P(D| θ) P(θ)

θMAP

θ

High probability massHigh probability density



θML

θSamples


P(D| θ) P(θ)


θML

θVariational

approximation

)(θQ


P(D| θ) P(θ)

Variational Inference

1. Choose a family of variational distributions Q(H).

2. Use Kullback-Leibler divergence KL(Q||P) as a measure of ‘distance’ between P(H|D) and Q(H).

3. Find Q which minimizes divergence.

(in three easy steps…)

Choose Variational Distribution

• P(H|D) ~ Q(H).• If P is so complex how do we choose Q?• Any Q is better than an ML or MAP point

estimate.• Choose Q so it can “get” close to P and is

tractable – factorize, conjugate.

Kullback-Leibler Divergence

• Derived from Variational Free Energy by Feynman and Bobliubov

• Relative Entropy between two probability distributions• KL(Q||P) > 0 , for any Q (Jensen’s inequality)• KL(Q||P) = 0 iff P = Q.

• Not true distance measure, not symmetric

X xP

xQxQPQKL)()(ln)()||(


Minimising KL(Q||P)

P

Q

Q Exclusive

H DHP

HQHQ)|(

)(ln)(

Minimising KL(P||Q) P

H HQ

DHPDHP)(

)|(ln)|(

Inclusive


H DHP

HQHQPQKL)|(

)(ln)()||(

H H

DPHQDHP

HQHQPQKL )(ln)(),(

)(ln)()||(

H DHP

DPHQHQPQKL),(

)()(ln)()||(

H HQ

DHPDHPQPK)(

)|(ln)|()||(

H

DPDHP

HQHQPQKL )(ln),(

)(ln)()||(

H DHP

HQHQPQKL)|(

)(ln)()||(

Bayes Rules

Log property

Sum over H


H H

HQHQDHPHQQL )(ln)(),(ln)()( DEFINE

• L is the difference between: expectation of the marginal likelihood with respect to Q, and the entropy of Q

• Maximize L(Q) is equivalent to minimizing KL Divergence

• We could not do the same trick for KL(P||Q), thus we will approximate likelihood with a function that has it’s mass where the likelihood is most probable (exclusive).

H

DPDHP

HQHQPQKL )(ln),(

)(ln)()||(

)()(ln)||( QLDPPQKL

Summarize

where

• For arbitrary Q(H)

• We choose a family of Q distributions where L(Q) is tractable to compute.

maximisefixed minimise

Still difficult in general to calculate

Minimising the KL divergence

L(Q)

KL(Q || P)

ln P(D)maximise

fixed


L(Q)

KL(Q || P)

ln P(D)maximise

fixed


L(Q)

KL(Q || P)

ln P(D)

maximise

fixed


L(Q)

KL(Q || P)

ln P(D)

maximise

fixed


L(Q)

KL(Q || P)

ln P(D)

maximise

fixed

Factorised Approximation

• Assume Q factorises

• Optimal solution for one factor given by

• Given the form of Q, find the best H in KL sense• Choose conjugate priors P(H) to give from of Q• Do it iteratively of each Qi(Hi)

ji H

iiijji

DHPHQZ

HQ )),(ln)(exp(1)(*

Derivation

ji H

iijji

DHPHQZ

HQ )),(ln)(exp(1)(*

H H

HQHQDHPHQQL )(ln)(),(ln)()(

H H j

jji

iii

ii HQHQDHPHQ )(ln)(),(ln)(

H H i j

jjiii

ii HQHQDHPHQ )(ln)(),(ln)(

H i H

iiiii

iii

HQHQDHPHQ )(ln)(),(ln)(

H ji H

iiiiH

jjjjji

iijjij

HQHQHQHQDHPHQHQ )(ln)()(ln)(),(ln)()(

Log property

Substitution

Factor one term Qj

Not a Function of Qj

Idea: Use factoring of Q to isolate Qj and maximize L wrt Qj

ZQQKL jj log)||( *

Example: Univariate Gaussian• Normal distribution• Find P(| x)• Conjugate prior • Factorized variational

distribution• Q distribution same form as

prior distributions• Inference involves updating

these hidden parameters

Example: Univariate Gaussian• Use Q* to derive:

• Where <> is the expectation over Q function• Iteratively solve

Example: Univariate Gaussian

• Estimate of log evidence can be found by calculating L(Q):

• Where <.> are expectations wrt to Q(.)

Example

Take four data samples form Gaussian (Thick Line) to find posterior. Dashed lines distribution from sampled variational.

Variational and True posterior from Gaussian given four samples. P() = N(0,1000). P() = Gamma(.001,.001).

VB with Image Segmentation

20 40 60 80 100 120 140 160 180

20

40

60

80

100

120

0 100 200 3000

100

200

0 100 200 3000

50

100

0 100 200 3000

100

200

300

0 100 200 3000

50

100

0 100 200 3000

50

100

150

0 100 200 3000

50

100

RGB histogram of two pixel locations.

“VB at the pixel level will give better results.”

Feature vector (x,y,Vx,Vy,r,g,b) - will have issues with data association.

VB with GMM will be complex – doing this in real time will be execrable.

Lower Bound for GMM-Ugly

Variational Equations for GMM-Ugly

Brings Up VMP – Efficient Computation

Lighting color

Surface color

Image color

Object class

C

SL

I

P(L)

P(C)

P(S|C)

P(I|L,S)

Technology

Variational Inference