Upload
vuxuyen
View
215
Download
0
Embed Size (px)
Citation preview
Frank Wood - [email protected]
Training Products of Experts by Minimizing Contrastive Divergence
Geoffrey E. Hinton
presented by Frank Wood
Frank Wood - [email protected]
Goal
• Learn parameters for probability distribution models of high dimensional data– (Images, Population Firing Rates, Securities Data, NLP data, etc)
Product of Experts
Use Contrastive Divergence to learn parameters.
Mixture Model
Use EM to learn parameters
( )( )
( )∑∏
∏=
c m
mm
m
mm
ncf
df
dp
r
r
r
Kr
θ
θ
θθ|
|
,,| 1( ) ( )∑=
m
mmmn dfdp θαθθ |,,| 1
rK
r
Frank Wood - [email protected]
Take Home
• Contrastive divergence is a general MCMC gradient ascent learning algorithm particularly well suited to learning Product of Experts (PoE) and energy- based (Gibbs distributions, etc.) model parameters.
• The general algorithm:– Repeat Until “Convergence”
• Draw samples from the current model starting from the training data.
• Compute the expected gradient of the log probability w.r.t. all model parameters over both samples and the training data.
• Update the model parameters according to the gradient.
Frank Wood - [email protected]
Sampling – Critical to Understanding
• Uniform– rand() Linear Congruential Generator
• x(n) = a * x(n-1) + b mod M 0.2311 0.6068 0.4860 0.8913 0.7621 0.4565 0.0185
• Normal– randn() Box-Mueller
• x1,x2 ~ U(0,1) -> y1,y2 ~N(0,1)– y1 = sqrt( - 2 ln(x1) ) cos( 2 pi x2 ) – y2 = sqrt( - 2 ln(x1) ) sin( 2 pi x2 )
• Binomial(p)– if(rand()<p)
• More Complicated Distributions– Mixture Model
• Sample from a Gaussian• Sample from a multinomial (CDF + uniform)
– Product of Experts• Metropolis and/or Gibbs
Frank Wood - [email protected]
The Flavor of Metropolis Sampling
• Given some distribution , a random starting point , and a symmetric proposal distribution .
• Calculate the ratio of densities
where is sampled from the proposal distribution.
• With probability accept .
• Given sufficiently many iterations
( )θ|dpr
1−tdr
( )1| −tt ddJrr
( )( )θ
θ
|
|
1−
=t
t
dp
dpr r
r
tdr
)1,min(rtdr
{ } ( )θ|~,,, 21 dpddd nnn
rK
rrr
++
Only need to
know the
distribution up
to a
proportionality!
Only need to
know the
distribution up
to a
proportionality!
Frank Wood - [email protected]
Contrastive Divergence (Final Result!)
10
loglog
θ
θθθ
θθ
PmPm
mmm
ff
∂
∂−
∂
∂∝∆
∑∑∂
∂−
∂
∂∝∆
∈ 1~D
)(log1)(log1
θθθ
θθθ
Pc md m
m
cf
N
df
N
mm
Law of Large Numbers, compute
expectations using samples.
Law of Large Numbers, compute
expectations using samples.
Now you know how to do it, let’s see why this works!
Model
parameters.
Model
parameters.
Training data
(empirical distribution).
Training data
(empirical distribution).
Samples from
model.
Samples from
model.
Frank Wood - [email protected]
But First: The last vestige of concreteness.
• Looking towards the future:– Take f to be a Student-t.
– Then (for instance)
( ) ( )( )
mmmm
dj
dfdf
m
j ααθ
+
==rr
rrr
T
;
2
11
1
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
α = .6, j = 5
1-D Student-t Expert
1/(
1+
.5*(
jt x))
α
( ) ( )( )
+−=
∂
+∂−
=∂
∂dj
djdf
m
m
mm
m
jmm
rr
rrr
r
T
T
;
2
11log
2
11log
log
α
α
α
α
Dot product �Projection �1-D MarginalDot product �Projection �1-D Marginal
Frank Wood - [email protected]
Maximizing the training data log likelihood
• We want maximizing parameters
• Differentiate w.r.t. to all parameters and perform gradient ascent to find optimal parameters.
• The derivation is somewhat nasty.
( )
( )∏∑∏
∏
∈
=D,,
1,, |
|
logmaxarg),,|(Dlogmaxarg11 d
c m
mm
m
mm
ncf
df
pnn
r
r
KK
r
r
Kθ
θ
θθθθθθ
Assuming d’s drawn
independently from p()
Assuming d’s drawn
independently from p()
Standard PoE formStandard PoE form
Over all training data.Over all training data.
Frank Wood - [email protected]
Maximizing the training data log likelihood
m
d
n
m
n
dpp
θ
θθ
θ
θθ
∂
∂
=∂
∂∏
∈D
1
1
),,|(log),,|(Dlog r
Kr
K
∑∈ ∂
∂=
D
1 ),,|(logd
n
m
dpr
Kr
θθθ
∞∂
∂=
θ
θ
θθ
Pm
ndpN
),,|(log 1 Kr
Remember this
equivalence!
Remember this
equivalence!
Frank Wood - [email protected]
Maximizing the training data log likelihood
=∂
∂
m
np
N θ
θθ ),,|(Dlog1 1 K
( ) ( )
m
c m
mm
d m
mm
cfdf
N θ
θ
θ
θ
∂
∂
−∂
∂=
∑ ∏∑∈
r
r
rr |log
|log1
D
∏∑∏
∏
∈∂
∂
D )|(
)|(
log1
d
c m
mm
m
mm
m cf
df
N r
r
r
r
θ
θ
θ
( ) ( )
∑∑ ∏
∑∈∈ ∂
∂
−∂
∂=
DD
|log1|log1
d m
c m
mm
d m
mm
cf
N
df
N r
r
r
rr
θ
θ
θ
θ
Frank Wood - [email protected]
Maximizing the training data log likelihood
( ) ( )
m
c m
mm
Pm
mm
cfdf
θ
θ
θ
θ
∂
∂
−∂
∂=
∑ ∏r
rr |log
|log
0
( ) ( )
m
c m
mm
d m
mm
cfdf
N θ
θ
θ
θ
∂
∂
−∂
∂=
∑ ∏∑∈
r
rr |log
|log1
D
( )( )
( )
m
c m
mm
c m
mmPm
mm
cf
cf
df
θ
θ
θθ
θ
∂
∂
−∂
∂=
∑ ∏
∑ ∏
r
r
r
r
r |
|
1|log
0
log(x)’ = x’/xlog(x)’ = x’/x
Frank Wood - [email protected]
Maximizing the training data log likelihood
( )( )
( )
m
c m
mm
c m
mmPm
mm
cf
cf
df
θ
θ
θθ
θ
∂
∂
−∂
∂=
∑ ∏
∑ ∏
r
r
r
r
r |
|
1|log
0
( )( )
( ) ( )
m
mm
mj
jj
c
c m
mmPm
mm
cfcf
cf
df
θ
θθ
θθ
θ
∂
∂
−∂
∂=
∏∑
∑ ∏≠
||
|
1|log
0
rr
r
rr
r
( )( )
( ) ( )
m
mm
m
mm
c
c m
mmPm
mm
cfcf
cf
df
θ
θθ
θθ
θ
∂
∂
−∂
∂=
∏∑
∑ ∏
|log|
|
1|log
0
rr
r
rr
r
log(x)’ = x’/xlog(x)’ = x’/x
Frank Wood - [email protected]
Maximizing the training data log likelihood
( )( )
( ) ( )
m
mm
m
mm
c
c m
mmPm
mm
cfcf
cf
df
θ
θθ
θθ
θ
∂
∂
−∂
∂=
∏∑
∑ ∏
|log|
|
1|log
0
rr
r
rr
r
( ) ( )
( )( )
∑∑ ∏
∏
∂
∂−
∂
∂=
c m
mm
c m
mm
m
mm
Pm
mm cf
cf
cfdf
r
r
r
r
rr
θ
θ
θ
θ
θ
θ |log
|
||log
0
( ) ( )∑
∂
∂−
∂
∂=
c m
mmn
Pm
mm cfcp
df
r
r
Kr
r
θ
θθθ
θ
θ |log),,|(
|log1
0
Frank Wood - [email protected]
Maximizing the training data log likelihood
( ) ( )
∞∂
∂−
∂
∂=
θ
θ
θ
θ
θ
Pm
mm
Pm
mm cfdf |log|log
0
rr
( ) ( )∑
∂
∂−
∂
∂=
c m
mmn
Pm
mm cfcp
df
θ
θθθ
θ
θ |log),,|(
|log1
0
r
K
r
Phew! We’re done! So:
( ) ( )
∞∂
∂−
∂
∂∝
θ
θ
θ
θ
θ
Pm
mm
Pm
mm cfdf |log|log
0
rr
( )0
|log
Pm
mdpN
θ
θθ
∂
∂⇔
∞r
m
np
θ
θθ
∂
∂ ),,|(Dlog 1 K
Frank Wood - [email protected]
Equilibrium Is Hard to Achieve
• With:
we can now train our PoE model.
• But… there’s a problem:– is computationally infeasible to obtain (esp. in an inner gradient ascent loop).
– Sampling Markov Chain must converge to target distribution. Often this takes a very long time!
( ) ( )
∞∂
∂−
∂
∂∝
θ
θ
θ
θ
θ
Pm
mm
Pm
mm cfdf |log|log
0
rr
m
np
θ
θθ
∂
∂ ),,|(Dlog 1 K
∞
θP
Frank Wood - [email protected]
Solution: Contrastive Divergence!
• Now we don’t have to run the sampling Markov Chain to convergence, instead we can stop after 1 iteration (or perhaps a few iterations more typically)
• Why does this work?– Attempts to minimize the ways that the model distorts the data.
( ) ( )
10
|log|log
θ
θ
θ
θ
θ
Pm
mm
Pm
mm cfdf
∂
∂−
∂
∂∝
rr
m
np
θ
θθ
∂
∂ ),,|(Dlog 1 K
Frank Wood - [email protected]
Equivalence of argmax log P() and argmax KL()
( ) ( )( )dP
dPdPPP
d
r
rr
r ∞
∞ ∑=θ
θ
000 log
( ) ( ) ( ) ( )dPdPdPdPdd
rrrr
rr
∞∑∑ −= θloglog 000
( ) ( )0
log0
PdPPHr
∞−= θ
( )0
log0
Pmm
dPPP
θθθθ
∂
∂−=
∂
∂ ∞∞ rThis is what
we got out of
the nasty
derivation!
This is what
we got out of
the nasty
derivation!
Frank Wood - [email protected]
Contrastive Divergence
( ) ( ) ( )10
loglog10
θ
θθθθθ
θθθ
PmPmm
dPdPPPPP
∂
∂−
∂
∂−=−
∂
∂ ∞∞∞∞
rr
( ) ( ) ( ) ( )
∞∞∂
∂+
∂
∂−
∂
∂−
∂
∂∝
θθθ
θ
θ
θ
θ
θ
θ
θ
θ
Pm
mm
Pm
mm
Pm
mm
Pm
mm cfdfcfdf |log|log|log|log
10
rrrr
( ) ( )10
|log|log
θ
θ
θ
θ
θ
Pm
mm
Pm
mm dfdf
∂
∂−
∂
∂∝
rr
• We want to “update the parameters to reduce the tendency of the chain to wander away from the initial distribution on the first step”.
Frank Wood - [email protected]
Contrastive Divergence (Final Result!)
10
loglog
θ
θθθ
θθ
PmPm
mmm
ff
∂
∂−
∂
∂∝∆
∑∑∂
∂−
∂
∂∝∆
∈ 1~D
)(log1)(log1
θθθ
θθθ
Pc md m
m
cf
N
df
N
mm
Law of Large Numbers, compute
expectations using samples.
Law of Large Numbers, compute
expectations using samples.
Now you know how to do it and why it works!
Model
parameters.
Model
parameters.
Training data
(empirical distribution).
Training data
(empirical distribution).
Samples from
model.
Samples from
model.