Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood [email protected] Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Training Products of Experts by Minimizing Contrastive Divergence

Geoffrey E. Hinton

presented by Frank Wood


Goal

• Learn parameters for probability distribution models of high dimensional data– (Images, Population Firing Rates, Securities Data, NLP data, etc)

Product of Experts

Use Contrastive Divergence to learn parameters.

Mixture Model

Use EM to learn parameters

( )( )

( )∑∏

∏=

c m

mm

m

mm

ncf

df

dp

r

r

r

Kr

θ

θ

θθ|

|

,,| 1( ) ( )∑=

m

mmmn dfdp θαθθ |,,| 1

rK

r


Take Home

• Contrastive divergence is a general MCMC gradient ascent learning algorithm particularly well suited to learning Product of Experts (PoE) and energy- based (Gibbs distributions, etc.) model parameters.

• The general algorithm:– Repeat Until “Convergence”

• Draw samples from the current model starting from the training data.

• Compute the expected gradient of the log probability w.r.t. all model parameters over both samples and the training data.

• Update the model parameters according to the gradient.


Sampling – Critical to Understanding

• Uniform– rand() Linear Congruential Generator

• x(n) = a * x(n-1) + b mod M 0.2311 0.6068 0.4860 0.8913 0.7621 0.4565 0.0185

• Normal– randn() Box-Mueller

• x1,x2 ~ U(0,1) -> y1,y2 ~N(0,1)– y1 = sqrt( - 2 ln(x1) ) cos( 2 pi x2 ) – y2 = sqrt( - 2 ln(x1) ) sin( 2 pi x2 )

• Binomial(p)– if(rand()<p)

• More Complicated Distributions– Mixture Model

• Sample from a Gaussian• Sample from a multinomial (CDF + uniform)

– Product of Experts• Metropolis and/or Gibbs


The Flavor of Metropolis Sampling

• Given some distribution , a random starting point , and a symmetric proposal distribution .

• Calculate the ratio of densities

where is sampled from the proposal distribution.

• With probability accept .

• Given sufficiently many iterations

( )θ|dpr

1−tdr

( )1| −tt ddJrr

( )( )θ

θ

|

|

1−

=t

t

dp

dpr r

r

tdr

)1,min(rtdr

{ } ( )θ|~,,, 21 dpddd nnn

rK

rrr

++

Only need to

know the

distribution up

to a

proportionality!

Only need to

know the

distribution up

to a

proportionality!


Contrastive Divergence (Final Result!)

10

loglog

θ

θθθ

θθ

PmPm

mmm

ff

∂

∂−

∂

∂∝∆

∑∑∂

∂−

∂

∂∝∆

∈ 1~D

)(log1)(log1

θθθ

θθθ

Pc md m

m

cf

N

df

N

mm

Law of Large Numbers, compute

expectations using samples.



Now you know how to do it, let’s see why this works!

Model

parameters.

Model

parameters.

Training data

(empirical distribution).

Training data


Samples from

model.

Samples from

model.


But First: The last vestige of concreteness.

• Looking towards the future:– Take f to be a Student-t.

– Then (for instance)

( ) ( )( )

mmmm

dj

dfdf

m

j ααθ

+

==rr

rrr

T

;

2

11

1

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

α = .6, j = 5

1-D Student-t Expert

1/(

1+

.5*(

jt x))

α

( ) ( )( )

+−=

∂

+∂−

=∂

∂dj

djdf

m

m

mm

m

jmm

rr

rrr

r

T

T

;

2

11log

2

11log

log

α

α

α

α

Dot product �Projection �1-D MarginalDot product �Projection �1-D Marginal


Maximizing the training data log likelihood

• We want maximizing parameters

• Differentiate w.r.t. to all parameters and perform gradient ascent to find optimal parameters.

• The derivation is somewhat nasty.

( )

( )∏∑∏

∏

∈

=D,,

1,, |

|

logmaxarg),,|(Dlogmaxarg11 d

c m

mm

m

mm

ncf

df

pnn

r

r

KK

r

r

Kθ

θ

θθθθθθ

Assuming d’s drawn

independently from p()

Assuming d’s drawn

independently from p()

Standard PoE formStandard PoE form

Over all training data.Over all training data.



m

d

n

m

n

dpp

θ

θθ

θ

θθ

∂

∂

=∂

∂∏

∈D

1

1

),,|(log),,|(Dlog r

Kr

K

∑∈ ∂

∂=

D

1 ),,|(logd

n

m

dpr

Kr

θθθ

∞∂

∂=

θ

θ

θθ

Pm

ndpN

),,|(log 1 Kr

Remember this

equivalence!

Remember this

equivalence!



=∂

∂

m

np

N θ

θθ ),,|(Dlog1 1 K

( ) ( )

m

c m

mm

d m

mm

cfdf

N θ

θ

θ

θ

∂

∂

−∂

∂=

∑ ∏∑∈

r

r

rr |log

|log1

D

∏∑∏

∏

∈∂

∂

D )|(

)|(

log1

d

c m

mm

m

mm

m cf

df

N r

r

r

r

θ

θ

θ

( ) ( )

∑∑ ∏

∑∈∈ ∂

∂

−∂

∂=

DD

|log1|log1

d m

c m

mm

d m

mm

cf

N

df

N r

r

r

rr

θ

θ

θ

θ



( ) ( )

m

c m

mm

Pm

mm

cfdf

θ

θ

θ

θ

∂

∂

−∂

∂=

∑ ∏r

rr |log

|log

0

( ) ( )

m

c m

mm

d m

mm

cfdf

N θ

θ

θ

θ

∂

∂

−∂

∂=

∑ ∏∑∈

r

rr |log

|log1

D

( )( )

( )

m

c m

mm

c m

mmPm

mm

cf

cf

df

θ

θ

θθ

θ

∂

∂

−∂

∂=

∑ ∏

∑ ∏

r

r

r

r

r |

|

1|log

0

log(x)’ = x’/xlog(x)’ = x’/x



( )( )

( )

m

c m

mm

c m

mmPm

mm

cf

cf

df

θ

θ

θθ

θ

∂

∂

−∂

∂=

∑ ∏

∑ ∏

r

r

r

r

r |

|

1|log

0

( )( )

( ) ( )

m

mm

mj

jj

c

c m

mmPm

mm

cfcf

cf

df

θ

θθ

θθ

θ

∂

∂

−∂

∂=

∏∑

∑ ∏≠

||

|

1|log

0

rr

r

rr

r

( )( )

( ) ( )

m

mm

m

mm

c

c m

mmPm

mm

cfcf

cf

df

θ

θθ

θθ

θ

∂

∂

−∂

∂=

∏∑

∑ ∏

|log|

|

1|log

0

rr

r

rr

r

log(x)’ = x’/xlog(x)’ = x’/x



( )( )

( ) ( )

m

mm

m

mm

c

c m

mmPm

mm

cfcf

cf

df

θ

θθ

θθ

θ

∂

∂

−∂

∂=

∏∑

∑ ∏

|log|

|

1|log

0

rr

r

rr

r

( ) ( )

( )( )

∑∑ ∏

∏

∂

∂−

∂

∂=

c m

mm

c m

mm

m

mm

Pm

mm cf

cf

cfdf

r

r

r

r

rr

θ

θ

θ

θ

θ

θ |log

|

||log

0

( ) ( )∑

∂

∂−

∂

∂=

c m

mmn

Pm

mm cfcp

df

r

r

Kr

r

θ

θθθ

θ

θ |log),,|(

|log1

0



( ) ( )

∞∂

∂−

∂

∂=

θ

θ

θ

θ

θ

Pm

mm

Pm

mm cfdf |log|log

0

rr

( ) ( )∑

∂

∂−

∂

∂=

c m

mmn

Pm

mm cfcp

df

θ

θθθ

θ

θ |log),,|(

|log1

0

r

K

r

Phew! We’re done! So:

( ) ( )

∞∂

∂−

∂

∂∝

θ

θ

θ

θ

θ

Pm

mm

Pm

mm cfdf |log|log

0

rr

( )0

|log

Pm

mdpN

θ

θθ

∂

∂⇔

∞r

m

np

θ

θθ

∂

∂ ),,|(Dlog 1 K


Equilibrium Is Hard to Achieve

• With:

we can now train our PoE model.

• But… there’s a problem:– is computationally infeasible to obtain (esp. in an inner gradient ascent loop).

– Sampling Markov Chain must converge to target distribution. Often this takes a very long time!

( ) ( )

∞∂

∂−

∂

∂∝

θ

θ

θ

θ

θ

Pm

mm

Pm

mm cfdf |log|log

0

rr

m

np

θ

θθ

∂

∂ ),,|(Dlog 1 K

∞

θP


Solution: Contrastive Divergence!

• Now we don’t have to run the sampling Markov Chain to convergence, instead we can stop after 1 iteration (or perhaps a few iterations more typically)

• Why does this work?– Attempts to minimize the ways that the model distorts the data.

( ) ( )

10

|log|log

θ

θ

θ

θ

θ

Pm

mm

Pm

mm cfdf

∂

∂−

∂

∂∝

rr

m

np

θ

θθ

∂

∂ ),,|(Dlog 1 K


Equivalence of argmax log P() and argmax KL()

( ) ( )( )dP

dPdPPP

d

r

rr

r ∞

∞ ∑=θ

θ

000 log

( ) ( ) ( ) ( )dPdPdPdPdd

rrrr

rr

∞∑∑ −= θloglog 000

( ) ( )0

log0

PdPPHr

∞−= θ

( )0

log0

Pmm

dPPP

θθθθ

∂

∂−=

∂

∂ ∞∞ rThis is what

we got out of

the nasty

derivation!

This is what

we got out of

the nasty

derivation!


Contrastive Divergence

( ) ( ) ( )10

loglog10

θ

θθθθθ

θθθ

PmPmm

dPdPPPPP

∂

∂−

∂

∂−=−

∂

∂ ∞∞∞∞

rr

( ) ( ) ( ) ( )

∞∞∂

∂+

∂

∂−

∂

∂−

∂

∂∝

θθθ

θ

θ

θ

θ

θ

θ

θ

θ

Pm

mm

Pm

mm

Pm

mm

Pm

mm cfdfcfdf |log|log|log|log

10

rrrr

( ) ( )10

|log|log

θ

θ

θ

θ

θ

Pm

mm

Pm

mm dfdf

∂

∂−

∂

∂∝

rr

• We want to “update the parameters to reduce the tendency of the chain to wander away from the initial distribution on the first step”.


Contrastive Divergence (Final Result!)

10

loglog

θ

θθθ

θθ

PmPm

mmm

ff

∂

∂−

∂

∂∝∆

∑∑∂

∂−

∂

∂∝∆

∈ 1~D

)(log1)(log1

θθθ

θθθ

Pc md m

m

cf

N

df

N

mm





Now you know how to do it and why it works!

Model

parameters.

Model

parameters.

Training data


Training data


Samples from

model.

Samples from

model.

Documents

Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood [email protected] Training Products of Experts by Minimizing Contrastive