Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Deep generative modelsVariational inference
Variational auto-encoderReferences
Probabilistic modelling for NLPpowered by deep learning
Wilker AzizUniversity of Amsterdam
April 3, 2018
Wilker Aziz DGMs in NLP 1
Deep generative modelsVariational inference
Variational auto-encoderReferences
Outline
1 Deep generative models
2 Variational inference
3 Variational auto-encoder
Wilker Aziz DGMs in NLP 1
Deep generative modelsVariational inference
Variational auto-encoderReferences
Problems
Supervised problems
“learn a distribution over observed data”
sentences in natural language
images, . . .
Unsupervised problems
“learn a distribution over observed and unobserved data”
sentences in natural language + parse trees
images + bounding boxes, . . .
Wilker Aziz DGMs in NLP 2
Deep generative modelsVariational inference
Variational auto-encoderReferences
Problems
Supervised problems
“learn a distribution over observed data”
sentences in natural language
images, . . .
Unsupervised problems
“learn a distribution over observed and unobserved data”
sentences in natural language + parse trees
images + bounding boxes, . . .
Wilker Aziz DGMs in NLP 2
Deep generative modelsVariational inference
Variational auto-encoderReferences
Supervised problems
We have data x (1), . . . , x (N) e.g.
sentences, images, ...
generated by some unknown procedure
which we assume can be captured by a probabilistic model
with known probability (mass/density) function e.g.
X ∼ Cat(π1, . . . , πK )︸ ︷︷ ︸e.g. nationality
or X ∼ N (µ, σ2)︸ ︷︷ ︸e.g. height
estimate parameters that assign maximum likelihood to observations
Wilker Aziz DGMs in NLP 3
Deep generative modelsVariational inference
Variational auto-encoderReferences
Supervised problems
We have data x (1), . . . , x (N) e.g.
sentences, images, ...
generated by some unknown procedurewhich we assume can be captured by a probabilistic model
with known probability (mass/density) function e.g.
X ∼ Cat(π1, . . . , πK )︸ ︷︷ ︸e.g. nationality
or X ∼ N (µ, σ2)︸ ︷︷ ︸e.g. height
estimate parameters that assign maximum likelihood to observations
Wilker Aziz DGMs in NLP 3
Deep generative modelsVariational inference
Variational auto-encoderReferences
Supervised problems
We have data x (1), . . . , x (N) e.g.
sentences, images, ...
generated by some unknown procedurewhich we assume can be captured by a probabilistic model
with known probability (mass/density) function e.g.
X ∼ Cat(π1, . . . , πK )︸ ︷︷ ︸e.g. nationality
or X ∼ N (µ, σ2)︸ ︷︷ ︸e.g. height
estimate parameters that assign maximum likelihood to observations
Wilker Aziz DGMs in NLP 3
Deep generative modelsVariational inference
Variational auto-encoderReferences
Supervised problems
We have data x (1), . . . , x (N) e.g.
sentences, images, ...
generated by some unknown procedurewhich we assume can be captured by a probabilistic model
with known probability (mass/density) function e.g.
X ∼ Cat(π1, . . . , πK )︸ ︷︷ ︸e.g. nationality
or X ∼ N (µ, σ2)︸ ︷︷ ︸e.g. height
estimate parameters that assign maximum likelihood to observations
Wilker Aziz DGMs in NLP 3
Deep generative modelsVariational inference
Variational auto-encoderReferences
Multiple problems, same language
xφ
N
(Conditional) Density estimation
Side information (φ) Observation (x)Parsing a sentence parse tree
Translation a sentence in French translation in English
Captioning an image caption in English
Entailment a text and hypothesis entailment relation
Wilker Aziz DGMs in NLP 4
Deep generative modelsVariational inference
Variational auto-encoderReferences
Where does deep learning kick in?
Let φ be all side information availablee.g. deterministic inputs/features
Have neural networks predict parameters of our probabilistic model
X |φ ∼ Cat(πθ(φ)) or X |φ ∼ N (µθ(φ), σθ(φ)2)
and proceed to estimate parameters θ of the NNs
Wilker Aziz DGMs in NLP 5
Deep generative modelsVariational inference
Variational auto-encoderReferences
Where does deep learning kick in?
Let φ be all side information availablee.g. deterministic inputs/features
Have neural networks predict parameters of our probabilistic model
X |φ ∼ Cat(πθ(φ)) or X |φ ∼ N (µθ(φ), σθ(φ)2)
and proceed to estimate parameters θ of the NNs
Wilker Aziz DGMs in NLP 5
Deep generative modelsVariational inference
Variational auto-encoderReferences
Where does deep learning kick in?
Let φ be all side information availablee.g. deterministic inputs/features
Have neural networks predict parameters of our probabilistic model
X |φ ∼ Cat(πθ(φ)) or X |φ ∼ N (µθ(φ), σθ(φ)2)
and proceed to estimate parameters θ of the NNs
Wilker Aziz DGMs in NLP 5
Deep generative modelsVariational inference
Variational auto-encoderReferences
NN as efficient parametrisation
From the statistical point of view NNs do not generate data
they parametrise distributions thatby assumption govern data
compact and efficient way to map from complex sideinformation to parameter space
Prediction is done by a decision rule outside the statistical model
e.g. beam search
Wilker Aziz DGMs in NLP 6
Deep generative modelsVariational inference
Variational auto-encoderReferences
NN as efficient parametrisation
From the statistical point of view NNs do not generate data
they parametrise distributions thatby assumption govern data
compact and efficient way to map from complex sideinformation to parameter space
Prediction is done by a decision rule outside the statistical model
e.g. beam search
Wilker Aziz DGMs in NLP 6
Deep generative modelsVariational inference
Variational auto-encoderReferences
Maximum likelihood estimation
Let p(x |θ) be the probability of an observation xand θ refer to all of its parameterse.g. parameters of NNs involved
Given a dataset x (1), . . . , x (N) of i.i.d. observations, the likelihoodfunction
L(θ|x (1:N)) = logN∏
s=1
p(x (s)|θ)
=N∑
s=1
log p(x (s)|θ)
quantifies the fitness of our model to data
Wilker Aziz DGMs in NLP 7
Deep generative modelsVariational inference
Variational auto-encoderReferences
Maximum likelihood estimation
Let p(x |θ) be the probability of an observation xand θ refer to all of its parameterse.g. parameters of NNs involved
Given a dataset x (1), . . . , x (N) of i.i.d. observations, the likelihoodfunction
L(θ|x (1:N)) = logN∏
s=1
p(x (s)|θ)
=N∑
s=1
log p(x (s)|θ)
quantifies the fitness of our model to data
Wilker Aziz DGMs in NLP 7
Deep generative modelsVariational inference
Variational auto-encoderReferences
Maximum likelihood estimation
Let p(x |θ) be the probability of an observation xand θ refer to all of its parameterse.g. parameters of NNs involved
Given a dataset x (1), . . . , x (N) of i.i.d. observations, the likelihoodfunction
L(θ|x (1:N)) = logN∏
s=1
p(x (s)|θ)
=N∑
s=1
log p(x (s)|θ)
quantifies the fitness of our model to data
Wilker Aziz DGMs in NLP 7
Deep generative modelsVariational inference
Variational auto-encoderReferences
Maximum likelihood estimation
Let p(x |θ) be the probability of an observation xand θ refer to all of its parameterse.g. parameters of NNs involved
Given a dataset x (1), . . . , x (N) of i.i.d. observations, the likelihoodfunction
L(θ|x (1:N)) = logN∏
s=1
p(x (s)|θ)
=N∑
s=1
log p(x (s)|θ)
quantifies the fitness of our model to data
Wilker Aziz DGMs in NLP 7
Deep generative modelsVariational inference
Variational auto-encoderReferences
MLE via gradient-based optimisation
If assessing the log-likelihood is differentiable and assessing it istractable, then backpropagation can give us the gradient
∇θL(θ|x (1:N)) = ∇θ
N∑s=1
log p(x (s)|θ)
=N∑
s=1
∇θ log p(x (s)|θ)
and we can update θ in the direction
γ∇θL(θ|x (1:N))
to attain a local optimum of the likelihood function
Wilker Aziz DGMs in NLP 8
Deep generative modelsVariational inference
Variational auto-encoderReferences
MLE via gradient-based optimisation
If assessing the log-likelihood is differentiable and assessing it istractable, then backpropagation can give us the gradient
∇θL(θ|x (1:N)) = ∇θ
N∑s=1
log p(x (s)|θ)
=N∑
s=1
∇θ log p(x (s)|θ)
and we can update θ in the direction
γ∇θL(θ|x (1:N))
to attain a local optimum of the likelihood function
Wilker Aziz DGMs in NLP 8
Deep generative modelsVariational inference
Variational auto-encoderReferences
MLE via gradient-based optimisation
If assessing the log-likelihood is differentiable and assessing it istractable, then backpropagation can give us the gradient
∇θL(θ|x (1:N)) = ∇θ
N∑s=1
log p(x (s)|θ)
=N∑
s=1
∇θ log p(x (s)|θ)
and we can update θ in the direction
γ∇θL(θ|x (1:N))
to attain a local optimum of the likelihood function
Wilker Aziz DGMs in NLP 8
Stochastic optimisation
We can also use a gradient estimate
∇θL(θ|x (1:N)) = ∇θ ES∼U(1..N)
[N log p(x (S)|θ)
]︸ ︷︷ ︸
L(θ|x(1:N))
= ES∼U(1..N)
[N∇θ log p(x (S)|θ)
]︸ ︷︷ ︸
expected gradient :)
MC≈ 1
M
M∑m=1
N∇θ log p(x (si )|θ)
Si ∼ U(1..N)
and take steps in the direction
γN
M∇θL(θ|x (s1:sM))
where x (s1), . . . , x (sM) is a random mini-batch of size M
Stochastic optimisation
We can also use a gradient estimate
∇θL(θ|x (1:N)) = ∇θ ES∼U(1..N)
[N log p(x (S)|θ)
]︸ ︷︷ ︸
L(θ|x(1:N))
= ES∼U(1..N)
[N∇θ log p(x (S)|θ)
]︸ ︷︷ ︸
expected gradient :)
MC≈ 1
M
M∑m=1
N∇θ log p(x (si )|θ)
Si ∼ U(1..N)
and take steps in the direction
γN
M∇θL(θ|x (s1:sM))
where x (s1), . . . , x (sM) is a random mini-batch of size M
Stochastic optimisation
We can also use a gradient estimate
∇θL(θ|x (1:N)) = ∇θ ES∼U(1..N)
[N log p(x (S)|θ)
]︸ ︷︷ ︸
L(θ|x(1:N))
= ES∼U(1..N)
[N∇θ log p(x (S)|θ)
]︸ ︷︷ ︸
expected gradient :)
MC≈ 1
M
M∑m=1
N∇θ log p(x (si )|θ)
Si ∼ U(1..N)
and take steps in the direction
γN
M∇θL(θ|x (s1:sM))
where x (s1), . . . , x (sM) is a random mini-batch of size M
Stochastic optimisation
We can also use a gradient estimate
∇θL(θ|x (1:N)) = ∇θ ES∼U(1..N)
[N log p(x (S)|θ)
]︸ ︷︷ ︸
L(θ|x(1:N))
= ES∼U(1..N)
[N∇θ log p(x (S)|θ)
]︸ ︷︷ ︸
expected gradient :)
MC≈ 1
M
M∑m=1
N∇θ log p(x (si )|θ)
Si ∼ U(1..N)
and take steps in the direction
γN
M∇θL(θ|x (s1:sM))
where x (s1), . . . , x (sM) is a random mini-batch of size M
Deep generative modelsVariational inference
Variational auto-encoderReferences
DL in NLP recipe
Maximum likelihood estimation
tells you which loss to optimise(i.e. negative log-likelihood)
Automatic differentiation (backprop)
chain rule of derivatives: “give me a tractable forward passand I will give you gradients”
Stochastic optimisation powered by backprop
general purpose gradient-based optimisers
Wilker Aziz DGMs in NLP 10
Deep generative modelsVariational inference
Variational auto-encoderReferences
DL in NLP recipe
Maximum likelihood estimation
tells you which loss to optimise(i.e. negative log-likelihood)
Automatic differentiation (backprop)
chain rule of derivatives: “give me a tractable forward passand I will give you gradients”
Stochastic optimisation powered by backprop
general purpose gradient-based optimisers
Wilker Aziz DGMs in NLP 10
Deep generative modelsVariational inference
Variational auto-encoderReferences
DL in NLP recipe
Maximum likelihood estimation
tells you which loss to optimise(i.e. negative log-likelihood)
Automatic differentiation (backprop)
chain rule of derivatives: “give me a tractable forward passand I will give you gradients”
Stochastic optimisation powered by backprop
general purpose gradient-based optimisers
Wilker Aziz DGMs in NLP 10
Deep generative modelsVariational inference
Variational auto-encoderReferences
Tractability is central
Likelihood gives us a differentiable objective to optimise for
but we need to stick with tractable likelihood functions
Wilker Aziz DGMs in NLP 11
Deep generative modelsVariational inference
Variational auto-encoderReferences
When do we have intractable likelihood?
Unsupervised problems contain unobserved random variables
pθ(x , z) =
latent variable model︷︸︸︷p(z) pθ(x |z)︸ ︷︷ ︸
observation model
thus assessing the marginal likelihood requires marginalisation oflatent variables
pθ(x) =
∫pθ(x , z) dz =
∫p(z)pθ(x |z) dz
Wilker Aziz DGMs in NLP 12
Deep generative modelsVariational inference
Variational auto-encoderReferences
When do we have intractable likelihood?
Unsupervised problems contain unobserved random variables
pθ(x , z) =
latent variable model︷︸︸︷p(z) pθ(x |z)︸ ︷︷ ︸
observation model
thus assessing the marginal likelihood requires marginalisation oflatent variables
pθ(x) =
∫pθ(x , z) dz =
∫p(z)pθ(x |z) dz
Wilker Aziz DGMs in NLP 12
Deep generative modelsVariational inference
Variational auto-encoderReferences
Examples of latent variable models
Discrete latent variable, continuous observation
too many forward passes
pθ(x) =K∑
c=1
Cat(c |π1, . . . , πK )N (x |µθ(c), σθ(c)2)︸ ︷︷ ︸forward pass
Continuous latent variable, discrete observation
infinitely many forward passes
pθ(x) =
∫N (z |0, I ) Cat(x |πθ(z))︸ ︷︷ ︸
forward pass
dz
Wilker Aziz DGMs in NLP 13
Deep generative modelsVariational inference
Variational auto-encoderReferences
Examples of latent variable models
Discrete latent variable, continuous observation
too many forward passes
pθ(x) =K∑
c=1
Cat(c |π1, . . . , πK )N (x |µθ(c), σθ(c)2)︸ ︷︷ ︸forward pass
Continuous latent variable, discrete observation
infinitely many forward passes
pθ(x) =
∫N (z |0, I ) Cat(x |πθ(z))︸ ︷︷ ︸
forward pass
dz
Wilker Aziz DGMs in NLP 13
Deep generative modelsVariational inference
Variational auto-encoderReferences
Deep generative models
Joint distribution with deep observation model
pθ(x , z) = p(z)︸︷︷︸prior
pθ(x |z)︸ ︷︷ ︸likelihood
mapping from latent variable z to p(x |z) is a NN with parameters θ
Marginal likelihood (or evidence)
pθ(x) =
∫pθ(x , z) dz =
∫p(z)pθ(x |z) dz
intractable in general
Wilker Aziz DGMs in NLP 14
Deep generative modelsVariational inference
Variational auto-encoderReferences
Deep generative models
Joint distribution with deep observation model
pθ(x , z) = p(z)︸︷︷︸prior
pθ(x |z)︸ ︷︷ ︸likelihood
mapping from latent variable z to p(x |z) is a NN with parameters θ
Marginal likelihood (or evidence)
pθ(x) =
∫pθ(x , z) dz =
∫p(z)pθ(x |z) dz
intractable in general
Wilker Aziz DGMs in NLP 14
GradientExact gradient is intractable
∇θ log pθ(x)
= ∇θ log
∫pθ(x , z)dz︸ ︷︷ ︸marginal
=1∫
pθ(x , z)dz
∫∇θpθ(x , z)dz︸ ︷︷ ︸
chain rule
=1
pθ(x)
∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸
log-identity for derivatives
dz
=
∫pθ(x , z)
pθ(x)︸ ︷︷ ︸posterior
∇θ log pθ(x , z)dz
=
∫pθ(z |x)∇θ log pθ(x , z)dz
= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)
GradientExact gradient is intractable
∇θ log pθ(x) = ∇θ log
∫pθ(x , z)dz︸ ︷︷ ︸marginal
=1∫
pθ(x , z)dz
∫∇θpθ(x , z)dz︸ ︷︷ ︸
chain rule
=1
pθ(x)
∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸
log-identity for derivatives
dz
=
∫pθ(x , z)
pθ(x)︸ ︷︷ ︸posterior
∇θ log pθ(x , z)dz
=
∫pθ(z |x)∇θ log pθ(x , z)dz
= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)
GradientExact gradient is intractable
∇θ log pθ(x) = ∇θ log
∫pθ(x , z)dz︸ ︷︷ ︸marginal
=1∫
pθ(x , z)dz
∫∇θpθ(x , z) dz︸ ︷︷ ︸
chain rule
=1
pθ(x)
∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸
log-identity for derivatives
dz
=
∫pθ(x , z)
pθ(x)︸ ︷︷ ︸posterior
∇θ log pθ(x , z)dz
=
∫pθ(z |x)∇θ log pθ(x , z)dz
= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)
GradientExact gradient is intractable
∇θ log pθ(x) = ∇θ log
∫pθ(x , z)dz︸ ︷︷ ︸marginal
=1∫
pθ(x , z)dz
∫∇θpθ(x , z) dz︸ ︷︷ ︸
chain rule
=1
pθ(x)
∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸
log-identity for derivatives
dz
=
∫pθ(x , z)
pθ(x)︸ ︷︷ ︸posterior
∇θ log pθ(x , z)dz
=
∫pθ(z |x)∇θ log pθ(x , z)dz
= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)
GradientExact gradient is intractable
∇θ log pθ(x) = ∇θ log
∫pθ(x , z)dz︸ ︷︷ ︸marginal
=1∫
pθ(x , z)dz
∫∇θpθ(x , z) dz︸ ︷︷ ︸
chain rule
=1
pθ(x)
∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸
log-identity for derivatives
dz
=
∫pθ(x , z)
pθ(x)︸ ︷︷ ︸posterior
∇θ log pθ(x , z)dz
=
∫pθ(z |x)∇θ log pθ(x , z)dz
= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)
GradientExact gradient is intractable
∇θ log pθ(x) = ∇θ log
∫pθ(x , z)dz︸ ︷︷ ︸marginal
=1∫
pθ(x , z)dz
∫∇θpθ(x , z) dz︸ ︷︷ ︸
chain rule
=1
pθ(x)
∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸
log-identity for derivatives
dz
=
∫pθ(x , z)
pθ(x)︸ ︷︷ ︸posterior
∇θ log pθ(x , z)dz
=
∫pθ(z |x)∇θ log pθ(x , z)dz
= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)
GradientExact gradient is intractable
∇θ log pθ(x) = ∇θ log
∫pθ(x , z)dz︸ ︷︷ ︸marginal
=1∫
pθ(x , z)dz
∫∇θpθ(x , z) dz︸ ︷︷ ︸
chain rule
=1
pθ(x)
∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸
log-identity for derivatives
dz
=
∫pθ(x , z)
pθ(x)︸ ︷︷ ︸posterior
∇θ log pθ(x , z)dz
=
∫pθ(z |x)∇θ log pθ(x , z)dz
= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)
Deep generative modelsVariational inference
Variational auto-encoderReferences
Can we get an estimate?
∇θ log pθ(x) = Epθ(z|x) [∇θ log pθ(x ,Z )]
MC≈ 1
K
K∑k=1
∇θ log pθ(x , z(k))
z(k) ∼ pθ(Z |x)
MC estimate of gradient requires sampling from posterior
pθ(z |x) =p(z)pθ(x |z)
pθ(x)
unavailable due to the intractability of the marginal
Wilker Aziz DGMs in NLP 16
Deep generative modelsVariational inference
Variational auto-encoderReferences
Can we get an estimate?
∇θ log pθ(x) = Epθ(z|x) [∇θ log pθ(x ,Z )]
MC≈ 1
K
K∑k=1
∇θ log pθ(x , z(k))
z(k) ∼ pθ(Z |x)
MC estimate of gradient requires sampling from posterior
pθ(z |x) =p(z)pθ(x |z)
pθ(x)
unavailable due to the intractability of the marginal
Wilker Aziz DGMs in NLP 16
Deep generative modelsVariational inference
Variational auto-encoderReferences
Can we get an estimate?
∇θ log pθ(x) = Epθ(z|x) [∇θ log pθ(x ,Z )]
MC≈ 1
K
K∑k=1
∇θ log pθ(x , z(k))
z(k) ∼ pθ(Z |x)
MC estimate of gradient requires sampling from posterior
pθ(z |x) =p(z)pθ(x |z)
pθ(x)
unavailable due to the intractability of the marginal
Wilker Aziz DGMs in NLP 16
Deep generative modelsVariational inference
Variational auto-encoderReferences
Summary
We like probabilistic models because can make explicitmodelling assumptions
We want complex observation models parameterised by NNs
But we cannot use backprop for parameter estimation
We need approximate inference techniques!
Wilker Aziz DGMs in NLP 17
Deep generative modelsVariational inference
Variational auto-encoderReferences
Summary
We like probabilistic models because can make explicitmodelling assumptions
We want complex observation models parameterised by NNs
But we cannot use backprop for parameter estimation
We need approximate inference techniques!
Wilker Aziz DGMs in NLP 17
Deep generative modelsVariational inference
Variational auto-encoderReferences
Summary
We like probabilistic models because can make explicitmodelling assumptions
We want complex observation models parameterised by NNs
But we cannot use backprop for parameter estimation
We need approximate inference techniques!
Wilker Aziz DGMs in NLP 17
Deep generative modelsVariational inference
Variational auto-encoderReferences
Summary
We like probabilistic models because can make explicitmodelling assumptions
We want complex observation models parameterised by NNs
But we cannot use backprop for parameter estimation
We need approximate inference techniques!
Wilker Aziz DGMs in NLP 17
Deep generative modelsVariational inference
Variational auto-encoderReferences
Outline
1 Deep generative models
2 Variational inference
3 Variational auto-encoder
Wilker Aziz DGMs in NLP 18
Deep generative modelsVariational inference
Variational auto-encoderReferences
The Basic Problem
The marginal likelihood
p(x) =
∫p(x , z)dz
is generally intractable, which prevents us from computingquantities that depend on the posterior p(z |x)
e.g. gradients in MLE
e.g. predictive distribution in Bayesian modelling
Wilker Aziz DGMs in NLP 19
Deep generative modelsVariational inference
Variational auto-encoderReferences
Strategy
Accept that p(z |x) is not computable.
approximate it by an auxiliary distribution q(z |x) that iscomputable
choose q(z |x) as close as possible to p(z |x) to obtain afaithful approximation
Wilker Aziz DGMs in NLP 20
Deep generative modelsVariational inference
Variational auto-encoderReferences
Strategy
Accept that p(z |x) is not computable.
approximate it by an auxiliary distribution q(z |x) that iscomputable
choose q(z |x) as close as possible to p(z |x) to obtain afaithful approximation
Wilker Aziz DGMs in NLP 20
Deep generative modelsVariational inference
Variational auto-encoderReferences
Evidence lowerbound
log p(x) = log
∫p(x , z)dz
= log
∫q(z |x)
p(x , z)
q(z |x)dz
= log
(Eq(z|x)
[p(x ,Z )
q(Z |x)
])≥ Eq(z|x)
[log
p(x ,Z )
q(Z |x)
]︸ ︷︷ ︸
ELBO
= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]
= Eq(z|x) [log p(x ,Z )] + H (q(z |x))
Wilker Aziz DGMs in NLP 21
Deep generative modelsVariational inference
Variational auto-encoderReferences
Evidence lowerbound
log p(x) = log
∫p(x , z)dz
= log
∫q(z |x)
p(x , z)
q(z |x)dz
= log
(Eq(z|x)
[p(x ,Z )
q(Z |x)
])≥ Eq(z|x)
[log
p(x ,Z )
q(Z |x)
]︸ ︷︷ ︸
ELBO
= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]
= Eq(z|x) [log p(x ,Z )] + H (q(z |x))
Wilker Aziz DGMs in NLP 21
Deep generative modelsVariational inference
Variational auto-encoderReferences
Evidence lowerbound
log p(x) = log
∫p(x , z)dz
= log
∫q(z |x)
p(x , z)
q(z |x)dz
= log
(Eq(z|x)
[p(x ,Z )
q(Z |x)
])
≥ Eq(z|x)
[log
p(x ,Z )
q(Z |x)
]︸ ︷︷ ︸
ELBO
= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]
= Eq(z|x) [log p(x ,Z )] + H (q(z |x))
Wilker Aziz DGMs in NLP 21
Deep generative modelsVariational inference
Variational auto-encoderReferences
Evidence lowerbound
log p(x) = log
∫p(x , z)dz
= log
∫q(z |x)
p(x , z)
q(z |x)dz
= log
(Eq(z|x)
[p(x ,Z )
q(Z |x)
])≥ Eq(z|x)
[log
p(x ,Z )
q(Z |x)
]︸ ︷︷ ︸
ELBO
= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]
= Eq(z|x) [log p(x ,Z )] + H (q(z |x))
Wilker Aziz DGMs in NLP 21
Deep generative modelsVariational inference
Variational auto-encoderReferences
Evidence lowerbound
log p(x) = log
∫p(x , z)dz
= log
∫q(z |x)
p(x , z)
q(z |x)dz
= log
(Eq(z|x)
[p(x ,Z )
q(Z |x)
])≥ Eq(z|x)
[log
p(x ,Z )
q(Z |x)
]︸ ︷︷ ︸
ELBO
= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]
= Eq(z|x) [log p(x ,Z )] + H (q(z |x))
Wilker Aziz DGMs in NLP 21
Deep generative modelsVariational inference
Variational auto-encoderReferences
Evidence lowerbound
log p(x) = log
∫p(x , z)dz
= log
∫q(z |x)
p(x , z)
q(z |x)dz
= log
(Eq(z|x)
[p(x ,Z )
q(Z |x)
])≥ Eq(z|x)
[log
p(x ,Z )
q(Z |x)
]︸ ︷︷ ︸
ELBO
= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]
= Eq(z|x) [log p(x ,Z )] + H (q(z |x))
Wilker Aziz DGMs in NLP 21
Deep generative modelsVariational inference
Variational auto-encoderReferences
An approximate posterior
log p(x) ≥ Eq(z|x)
[log
p(x ,Z )
q(Z |x)
]︸ ︷︷ ︸
ELBO
= Eq(z|x)
[log
p(Z |x)p(x)
q(Z |x)
]= Eq(z|x)
[log
p(Z |x)
q(Z |x)
]+ log p(x)︸ ︷︷ ︸
constant
= −Eq(z|x)
[log
q(Z |x)
p(Z |x)
]︸ ︷︷ ︸
KL(q(z|x)||p(z|x))
+ log p(x)
We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).
Wilker Aziz DGMs in NLP 22
Deep generative modelsVariational inference
Variational auto-encoderReferences
An approximate posterior
log p(x) ≥ Eq(z|x)
[log
p(x ,Z )
q(Z |x)
]︸ ︷︷ ︸
ELBO
= Eq(z|x)
[log
p(Z |x)p(x)
q(Z |x)
]
= Eq(z|x)
[log
p(Z |x)
q(Z |x)
]+ log p(x)︸ ︷︷ ︸
constant
= −Eq(z|x)
[log
q(Z |x)
p(Z |x)
]︸ ︷︷ ︸
KL(q(z|x)||p(z|x))
+ log p(x)
We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).
Wilker Aziz DGMs in NLP 22
Deep generative modelsVariational inference
Variational auto-encoderReferences
An approximate posterior
log p(x) ≥ Eq(z|x)
[log
p(x ,Z )
q(Z |x)
]︸ ︷︷ ︸
ELBO
= Eq(z|x)
[log
p(Z |x)p(x)
q(Z |x)
]= Eq(z|x)
[log
p(Z |x)
q(Z |x)
]+ log p(x)︸ ︷︷ ︸
constant
= −Eq(z|x)
[log
q(Z |x)
p(Z |x)
]︸ ︷︷ ︸
KL(q(z|x)||p(z|x))
+ log p(x)
We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).
Wilker Aziz DGMs in NLP 22
Deep generative modelsVariational inference
Variational auto-encoderReferences
An approximate posterior
log p(x) ≥ Eq(z|x)
[log
p(x ,Z )
q(Z |x)
]︸ ︷︷ ︸
ELBO
= Eq(z|x)
[log
p(Z |x)p(x)
q(Z |x)
]= Eq(z|x)
[log
p(Z |x)
q(Z |x)
]+ log p(x)︸ ︷︷ ︸
constant
= −Eq(z|x)
[log
q(Z |x)
p(Z |x)
]︸ ︷︷ ︸
KL(q(z|x)||p(z|x))
+ log p(x)
We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).
Wilker Aziz DGMs in NLP 22
Deep generative modelsVariational inference
Variational auto-encoderReferences
An approximate posterior
log p(x) ≥ Eq(z|x)
[log
p(x ,Z )
q(Z |x)
]︸ ︷︷ ︸
ELBO
= Eq(z|x)
[log
p(Z |x)p(x)
q(Z |x)
]= Eq(z|x)
[log
p(Z |x)
q(Z |x)
]+ log p(x)︸ ︷︷ ︸
constant
= −Eq(z|x)
[log
q(Z |x)
p(Z |x)
]︸ ︷︷ ︸
KL(q(z|x)||p(z|x))
+ log p(x)
We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).
Wilker Aziz DGMs in NLP 22
Deep generative modelsVariational inference
Variational auto-encoderReferences
Variational Inference
Objectivemaxq(z|x)
E [log p(x ,Z )] + H (q(z |x))
The ELBO is a lower bound on log p(x)
Blei et al. (2016)Wilker Aziz DGMs in NLP 23
Deep generative modelsVariational inference
Variational auto-encoderReferences
Mean field assumption
Suppose we have N latent variables
assume the posterior factorises as N independent terms
each with an independent set of parameters
q(z1, . . . , zN) =N∏i=1
qλi (zi )︸ ︷︷ ︸mean field
Wilker Aziz DGMs in NLP 24
Deep generative modelsVariational inference
Variational auto-encoderReferences
Amortised variational inference
Amortise the cost of inference using NNs
q(z1, . . . , zN |x1, . . . , xN) =N∏i=1
qλ(zi |xi )
with a shared set of parameters
e.g. Z |x ∼ N (µλ(x), σλ(x)2︸ ︷︷ ︸inference network
)
Wilker Aziz DGMs in NLP 25
Deep generative modelsVariational inference
Variational auto-encoderReferences
Outline
1 Deep generative models
2 Variational inference
3 Variational auto-encoder
Wilker Aziz DGMs in NLP 26
Deep generative modelsVariational inference
Variational auto-encoderReferences
Variational auto-encoder
Generative model with NN likelihood
z
x θ
N
λcomplex (non-linear) observation modelpθ(x |z)
complex (non-linear) mapping from datato latent variables qλ(z |x)
Jointly optimise generative model pθ(x |z) and inference modelqλ(z |x) under the same objective (ELBO)
Kingma and Welling (2013)Wilker Aziz DGMs in NLP 27
Objective
log pθ(x) ≥ELBO︷ ︸︸ ︷
Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))
= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))
= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))
Parameter estimation
arg maxθ,λ
Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))
assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families
approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple
Objective
log pθ(x) ≥ELBO︷ ︸︸ ︷
Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))
= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))
= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))
Parameter estimation
arg maxθ,λ
Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))
assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families
approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple
Objective
log pθ(x) ≥ELBO︷ ︸︸ ︷
Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))
= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))
= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))
Parameter estimation
arg maxθ,λ
Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))
assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families
approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple
Objective
log pθ(x) ≥ELBO︷ ︸︸ ︷
Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))
= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))
= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))
Parameter estimation
arg maxθ,λ
Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))
assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families
approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple
Objective
log pθ(x) ≥ELBO︷ ︸︸ ︷
Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))
= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))
= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))
Parameter estimation
arg maxθ,λ
Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))
assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families
approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple
Objective
log pθ(x) ≥ELBO︷ ︸︸ ︷
Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))
= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))
= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))
Parameter estimation
arg maxθ,λ
Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))
assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families
approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple
Generative Network Gradient
∂
∂θ
Eqλ(z|x) [log pθ(x |z)]−constant wrt θ︷ ︸︸ ︷
KL (qλ(z |x) || p(z))
= Eqλ(z|x)
[∂
∂θlog pθ(x |z)
]︸ ︷︷ ︸
expected gradient :)
MC≈ 1
K
K∑k=1
∂
∂θlog pθ(x |z(k))
z(k) ∼ qλ(Z |x)
Note: qλ(z |x) does not depend on θ.
Generative Network Gradient
∂
∂θ
Eqλ(z|x) [log pθ(x |z)]−constant wrt θ︷ ︸︸ ︷
KL (qλ(z |x) || p(z))
= Eqλ(z|x)
[∂
∂θlog pθ(x |z)
]︸ ︷︷ ︸
expected gradient :)
MC≈ 1
K
K∑k=1
∂
∂θlog pθ(x |z(k))
z(k) ∼ qλ(Z |x)
Note: qλ(z |x) does not depend on θ.
Generative Network Gradient
∂
∂θ
Eqλ(z|x) [log pθ(x |z)]−constant wrt θ︷ ︸︸ ︷
KL (qλ(z |x) || p(z))
= Eqλ(z|x)
[∂
∂θlog pθ(x |z)
]︸ ︷︷ ︸
expected gradient :)
MC≈ 1
K
K∑k=1
∂
∂θlog pθ(x |z(k))
z(k) ∼ qλ(Z |x)
Note: qλ(z |x) does not depend on θ.
Generative Network Gradient
∂
∂θ
Eqλ(z|x) [log pθ(x |z)]−constant wrt θ︷ ︸︸ ︷
KL (qλ(z |x) || p(z))
= Eqλ(z|x)
[∂
∂θlog pθ(x |z)
]︸ ︷︷ ︸
expected gradient :)
MC≈ 1
K
K∑k=1
∂
∂θlog pθ(x |z(k))
z(k) ∼ qλ(Z |x)
Note: qλ(z |x) does not depend on θ.
Deep generative modelsVariational inference
Variational auto-encoderReferences
Inference Network Gradient
∂
∂λ
Eqλ(z|x) [log pθ(x |z)]−analytical︷ ︸︸ ︷
KL (qλ(z |x) || p(z))
=∂
∂λEqλ(z|x) [log pθ(x |z)]− ∂
∂λKL (qλ(z |x) || p(z))︸ ︷︷ ︸
analytical computation
The first term again requires approximation by sampling,but there is a problem
Wilker Aziz DGMs in NLP 30
Deep generative modelsVariational inference
Variational auto-encoderReferences
Inference Network Gradient
∂
∂λ
Eqλ(z|x) [log pθ(x |z)]−analytical︷ ︸︸ ︷
KL (qλ(z |x) || p(z))
=∂
∂λEqλ(z|x) [log pθ(x |z)]− ∂
∂λKL (qλ(z |x) || p(z))︸ ︷︷ ︸
analytical computation
The first term again requires approximation by sampling,but there is a problem
Wilker Aziz DGMs in NLP 30
Deep generative modelsVariational inference
Variational auto-encoderReferences
Inference Network Gradient
∂
∂λ
Eqλ(z|x) [log pθ(x |z)]−analytical︷ ︸︸ ︷
KL (qλ(z |x) || p(z))
=∂
∂λEqλ(z|x) [log pθ(x |z)]− ∂
∂λKL (qλ(z |x) || p(z))︸ ︷︷ ︸
analytical computation
The first term again requires approximation by sampling,but there is a problem
Wilker Aziz DGMs in NLP 30
Deep generative modelsVariational inference
Variational auto-encoderReferences
Inference Network Gradient
∂
∂λEqλ(z|x) [log pθ(x |z)]
=∂
∂λ
∫qλ(z |x) log pθ(x |z)dz
=
∫∂
∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸
not an expectation
MC estimator is non-differentiable: cannot sample first
Differentiating the expression does not yield an expectation:cannot approximate via MC
Wilker Aziz DGMs in NLP 31
Deep generative modelsVariational inference
Variational auto-encoderReferences
Inference Network Gradient
∂
∂λEqλ(z|x) [log pθ(x |z)]
=∂
∂λ
∫qλ(z |x) log pθ(x |z)dz
=
∫∂
∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸
not an expectation
MC estimator is non-differentiable: cannot sample first
Differentiating the expression does not yield an expectation:cannot approximate via MC
Wilker Aziz DGMs in NLP 31
Deep generative modelsVariational inference
Variational auto-encoderReferences
Inference Network Gradient
∂
∂λEqλ(z|x) [log pθ(x |z)]
=∂
∂λ
∫qλ(z |x) log pθ(x |z)dz
=
∫∂
∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸
not an expectation
MC estimator is non-differentiable: cannot sample first
Differentiating the expression does not yield an expectation:cannot approximate via MC
Wilker Aziz DGMs in NLP 31
Deep generative modelsVariational inference
Variational auto-encoderReferences
Inference Network Gradient
∂
∂λEqλ(z|x) [log pθ(x |z)]
=∂
∂λ
∫qλ(z |x) log pθ(x |z)dz
=
∫∂
∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸
not an expectation
MC estimator is non-differentiable: cannot sample first
Differentiating the expression does not yield an expectation:cannot approximate via MC
Wilker Aziz DGMs in NLP 31
Deep generative modelsVariational inference
Variational auto-encoderReferences
Inference Network Gradient
∂
∂λEqλ(z|x) [log pθ(x |z)]
=∂
∂λ
∫qλ(z |x) log pθ(x |z)dz
=
∫∂
∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸
not an expectation
MC estimator is non-differentiable: cannot sample first
Differentiating the expression does not yield an expectation:cannot approximate via MC
Wilker Aziz DGMs in NLP 31
Deep generative modelsVariational inference
Variational auto-encoderReferences
Score function estimator
We can again use the log identity for derivatives
∂
∂λEqλ(z|x) [log pθ(x |z)]
=∂
∂λ
∫qλ(z |x) log pθ(x |z)dz
=
∫∂
∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸
not an expectation
=
∫qλ(z |x)
∂
∂λ(log qλ(z |x)) log pθ(x |z)dz
= Eqλ(z|x)
[log pθ(x |z)
∂
∂λlog qλ(z |x)
]︸ ︷︷ ︸
expected gradient :)
Wilker Aziz DGMs in NLP 32
Deep generative modelsVariational inference
Variational auto-encoderReferences
Score function estimator
We can again use the log identity for derivatives
∂
∂λEqλ(z|x) [log pθ(x |z)]
=∂
∂λ
∫qλ(z |x) log pθ(x |z)dz
=
∫∂
∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸
not an expectation
=
∫qλ(z |x)
∂
∂λ(log qλ(z |x)) log pθ(x |z)dz
= Eqλ(z|x)
[log pθ(x |z)
∂
∂λlog qλ(z |x)
]︸ ︷︷ ︸
expected gradient :)
Wilker Aziz DGMs in NLP 32
Deep generative modelsVariational inference
Variational auto-encoderReferences
Score function estimator
We can again use the log identity for derivatives
∂
∂λEqλ(z|x) [log pθ(x |z)]
=∂
∂λ
∫qλ(z |x) log pθ(x |z)dz
=
∫∂
∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸
not an expectation
=
∫qλ(z |x)
∂
∂λ(log qλ(z |x)) log pθ(x |z)dz
= Eqλ(z|x)
[log pθ(x |z)
∂
∂λlog qλ(z |x)
]︸ ︷︷ ︸
expected gradient :)
Wilker Aziz DGMs in NLP 32
Score function estimator: high variance
We can now build an MC estimator
∂
∂λEqλ(z|x) [log pθ(x |z)]
= Eqλ(z|x)
[log pθ(x |z)
∂
∂λlog qλ(z |x)
]
MC≈ 1
K
K∑k=1
log pθ(x |z(k))∂
∂λlog qλ(z(k)|x)
z(k) ∼ qλ(Z |x)
but
magnitude of log pθ(x |z) varies widely
model likelihood does not contribute to direction of gradient
too much variance to be useful
Score function estimator: high variance
We can now build an MC estimator
∂
∂λEqλ(z|x) [log pθ(x |z)]
= Eqλ(z|x)
[log pθ(x |z)
∂
∂λlog qλ(z |x)
]MC≈ 1
K
K∑k=1
log pθ(x |z(k))∂
∂λlog qλ(z(k)|x)
z(k) ∼ qλ(Z |x)
but
magnitude of log pθ(x |z) varies widely
model likelihood does not contribute to direction of gradient
too much variance to be useful
Score function estimator: high variance
We can now build an MC estimator
∂
∂λEqλ(z|x) [log pθ(x |z)]
= Eqλ(z|x)
[log pθ(x |z)
∂
∂λlog qλ(z |x)
]MC≈ 1
K
K∑k=1
log pθ(x |z(k))∂
∂λlog qλ(z(k)|x)
z(k) ∼ qλ(Z |x)
but
magnitude of log pθ(x |z) varies widely
model likelihood does not contribute to direction of gradient
too much variance to be useful
Score function estimator: high variance
We can now build an MC estimator
∂
∂λEqλ(z|x) [log pθ(x |z)]
= Eqλ(z|x)
[log pθ(x |z)
∂
∂λlog qλ(z |x)
]MC≈ 1
K
K∑k=1
log pθ(x |z(k))∂
∂λlog qλ(z(k)|x)
z(k) ∼ qλ(Z |x)
but
magnitude of log pθ(x |z) varies widely
model likelihood does not contribute to direction of gradient
too much variance to be useful
Score function estimator: high variance
We can now build an MC estimator
∂
∂λEqλ(z|x) [log pθ(x |z)]
= Eqλ(z|x)
[log pθ(x |z)
∂
∂λlog qλ(z |x)
]MC≈ 1
K
K∑k=1
log pθ(x |z(k))∂
∂λlog qλ(z(k)|x)
z(k) ∼ qλ(Z |x)
but
magnitude of log pθ(x |z) varies widely
model likelihood does not contribute to direction of gradient
too much variance to be useful
Deep generative modelsVariational inference
Variational auto-encoderReferences
When variance is high we can
sample more
won’t scale
use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet
stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]
until we find a way to rewrite the expectation in terms of adensity that does not depend on λ
Wilker Aziz DGMs in NLP 34
Deep generative modelsVariational inference
Variational auto-encoderReferences
When variance is high we can
sample morewon’t scale
use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet
stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]
until we find a way to rewrite the expectation in terms of adensity that does not depend on λ
Wilker Aziz DGMs in NLP 34
Deep generative modelsVariational inference
Variational auto-encoderReferences
When variance is high we can
sample morewon’t scale
use variance reduction techniques (e.g. baselines and controlvariates)
excellent idea, but not just yet
stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]
until we find a way to rewrite the expectation in terms of adensity that does not depend on λ
Wilker Aziz DGMs in NLP 34
Deep generative modelsVariational inference
Variational auto-encoderReferences
When variance is high we can
sample morewon’t scale
use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet
stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]
until we find a way to rewrite the expectation in terms of adensity that does not depend on λ
Wilker Aziz DGMs in NLP 34
Deep generative modelsVariational inference
Variational auto-encoderReferences
When variance is high we can
sample morewon’t scale
use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet
stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]
until we find a way to rewrite the expectation in terms of adensity that does not depend on λ
Wilker Aziz DGMs in NLP 34
Deep generative modelsVariational inference
Variational auto-encoderReferences
When variance is high we can
sample morewon’t scale
use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet
stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]
until we find a way to rewrite the expectation in terms of adensity that does not depend on λ
Wilker Aziz DGMs in NLP 34
Deep generative modelsVariational inference
Variational auto-encoderReferences
Reparametrisation
Find a transformation h : z 7→ ε that expresses z through a randomvariable ε such that q(ε) does not depend on λ
h(z , λ) needs to be invertible
h(z , λ) needs to be differentiable
Invertibility implies
h(z , λ) = ε
h−1(ε, λ) = z
(Kingma and Welling, 2013; Rezende et al., 2014; Titsias andLazaro-Gredilla, 2014)
Wilker Aziz DGMs in NLP 35
Deep generative modelsVariational inference
Variational auto-encoderReferences
Reparametrisation
Find a transformation h : z 7→ ε that expresses z through a randomvariable ε such that q(ε) does not depend on λ
h(z , λ) needs to be invertible
h(z , λ) needs to be differentiable
Invertibility implies
h(z , λ) = ε
h−1(ε, λ) = z
(Kingma and Welling, 2013; Rezende et al., 2014; Titsias andLazaro-Gredilla, 2014)
Wilker Aziz DGMs in NLP 35
Deep generative modelsVariational inference
Variational auto-encoderReferences
Reparametrisation
Find a transformation h : z 7→ ε that expresses z through a randomvariable ε such that q(ε) does not depend on λ
h(z , λ) needs to be invertible
h(z , λ) needs to be differentiable
Invertibility implies
h(z , λ) = ε
h−1(ε, λ) = z
(Kingma and Welling, 2013; Rezende et al., 2014; Titsias andLazaro-Gredilla, 2014)
Wilker Aziz DGMs in NLP 35
Deep generative modelsVariational inference
Variational auto-encoderReferences
Reparametrisation
Find a transformation h : z 7→ ε that expresses z through a randomvariable ε such that q(ε) does not depend on λ
h(z , λ) needs to be invertible
h(z , λ) needs to be differentiable
Invertibility implies
h(z , λ) = ε
h−1(ε, λ) = z
(Kingma and Welling, 2013; Rezende et al., 2014; Titsias andLazaro-Gredilla, 2014)
Wilker Aziz DGMs in NLP 35
Deep generative modelsVariational inference
Variational auto-encoderReferences
Gaussian Transformation
If Z ∼ N (µλ(x), σλ(x)2) then
h(z , λ) =z − µλ(x)
σλ(x)= ε ∼ N (0, I)
h−1(ε, λ) = µλ(x) + σλ(x)� ε ε ∼ N (0, I)
Wilker Aziz DGMs in NLP 36
Inference Network – Reparametrised Gradient
=∂
∂λ
∫qλ(z |x) log pθ(x |z) dz
=∂
∂λ
∫q(ε) log pθ(x |
=z︷ ︸︸ ︷h−1(ε, λ))dε
=
∫q(ε)
∂
∂λ
log pθ(x |
=z︷ ︸︸ ︷h−1(ε, λ))
dε
= Eq(ε)
[∂
∂λlog pθ(x |h−1(ε, λ))
]dε︸ ︷︷ ︸
expected gradient :D
Inference Network – Reparametrised Gradient
=∂
∂λ
∫qλ(z |x) log pθ(x |z) dz
=∂
∂λ
∫q(ε) log pθ(x |
=z︷ ︸︸ ︷h−1(ε, λ))dε
=
∫q(ε)
∂
∂λ
log pθ(x |
=z︷ ︸︸ ︷h−1(ε, λ))
dε
= Eq(ε)
[∂
∂λlog pθ(x |h−1(ε, λ))
]dε︸ ︷︷ ︸
expected gradient :D
Inference Network – Reparametrised Gradient
=∂
∂λ
∫qλ(z |x) log pθ(x |z) dz
=∂
∂λ
∫q(ε) log pθ(x |
=z︷ ︸︸ ︷h−1(ε, λ))dε
=
∫q(ε)
∂
∂λ
log pθ(x |
=z︷ ︸︸ ︷h−1(ε, λ))
dε
= Eq(ε)
[∂
∂λlog pθ(x |h−1(ε, λ))
]dε︸ ︷︷ ︸
expected gradient :D
Inference Network – Reparametrised Gradient
=∂
∂λ
∫qλ(z |x) log pθ(x |z) dz
=∂
∂λ
∫q(ε) log pθ(x |
=z︷ ︸︸ ︷h−1(ε, λ))dε
=
∫q(ε)
∂
∂λ
log pθ(x |
=z︷ ︸︸ ︷h−1(ε, λ))
dε
= Eq(ε)
[∂
∂λlog pθ(x |h−1(ε, λ))
]dε︸ ︷︷ ︸
expected gradient :D
Reparametrised gradient estimate
= Eq(ε)
[∂
∂λlog pθ(x |h−1(ε, λ))
]dε︸ ︷︷ ︸
expected gradient :D
= Eq(ε)
∂
∂zlog pθ(x |
=z︷ ︸︸ ︷h−1(ε, λ))× ∂
∂λh−1(ε, λ)︸ ︷︷ ︸
chain rule
MC≈ 1
K
K∑k=1
∂
∂zlog pθ(x |
=z︷ ︸︸ ︷h−1(ε(k), λ))× ∂
∂λh−1(ε(k), λ)︸ ︷︷ ︸
backprop’s job
ε(k) ∼ q(ε)
Note that both models contribute with gradients
Reparametrised gradient estimate
= Eq(ε)
[∂
∂λlog pθ(x |h−1(ε, λ))
]dε︸ ︷︷ ︸
expected gradient :D
= Eq(ε)
∂
∂zlog pθ(x |
=z︷ ︸︸ ︷h−1(ε, λ))× ∂
∂λh−1(ε, λ)︸ ︷︷ ︸
chain rule
MC≈ 1
K
K∑k=1
∂
∂zlog pθ(x |
=z︷ ︸︸ ︷h−1(ε(k), λ))× ∂
∂λh−1(ε(k), λ)︸ ︷︷ ︸
backprop’s job
ε(k) ∼ q(ε)
Note that both models contribute with gradients
Reparametrised gradient estimate
= Eq(ε)
[∂
∂λlog pθ(x |h−1(ε, λ))
]dε︸ ︷︷ ︸
expected gradient :D
= Eq(ε)
∂
∂zlog pθ(x |
=z︷ ︸︸ ︷h−1(ε, λ))× ∂
∂λh−1(ε, λ)︸ ︷︷ ︸
chain rule
MC≈ 1
K
K∑k=1
∂
∂zlog pθ(x |
=z︷ ︸︸ ︷h−1(ε(k), λ))× ∂
∂λh−1(ε(k), λ)︸ ︷︷ ︸
backprop’s job
ε(k) ∼ q(ε)
Note that both models contribute with gradients
Deep generative modelsVariational inference
Variational auto-encoderReferences
Gaussian KL
ELBO
Eqλ(z|x) [log pθ(x |z)]− KL (qλ(z |x) || p(z))
Analytical computation of −KL (qλ(z |x) || p(z)):
1
2
d∑i=1
(1 + log
(σ2i
)− µ2
i − σ2i
)Thus backprop will compute − ∂
∂λ KL (qλ(z |x) || p(z)) for us
Wilker Aziz DGMs in NLP 39
Deep generative modelsVariational inference
Variational auto-encoderReferences
Gaussian KL
ELBO
Eqλ(z|x) [log pθ(x |z)]− KL (qλ(z |x) || p(z))
Analytical computation of −KL (qλ(z |x) || p(z)):
1
2
d∑i=1
(1 + log
(σ2i
)− µ2
i − σ2i
)
Thus backprop will compute − ∂∂λ KL (qλ(z |x) || p(z)) for us
Wilker Aziz DGMs in NLP 39
Deep generative modelsVariational inference
Variational auto-encoderReferences
Gaussian KL
ELBO
Eqλ(z|x) [log pθ(x |z)]− KL (qλ(z |x) || p(z))
Analytical computation of −KL (qλ(z |x) || p(z)):
1
2
d∑i=1
(1 + log
(σ2i
)− µ2
i − σ2i
)Thus backprop will compute − ∂
∂λ KL (qλ(z |x) || p(z)) for us
Wilker Aziz DGMs in NLP 39
Deep generative modelsVariational inference
Variational auto-encoderReferences
Computation Graph
x
µ σ
z
x
λ λ
θ
inference model
generative model
ε ∼ N (0, 1)
KL KL
Wilker Aziz DGMs in NLP 40
Deep generative modelsVariational inference
Variational auto-encoderReferences
Computation Graph
x
µ σ
z
x
λ λ
θ
inference model
generative model
ε ∼ N (0, 1)
KL KL
Wilker Aziz DGMs in NLP 40
Deep generative modelsVariational inference
Variational auto-encoderReferences
Computation Graph
x
µ σ
z
x
λ λ
θ
inference model
generative model
ε ∼ N (0, 1)
KL KL
Wilker Aziz DGMs in NLP 40
Deep generative modelsVariational inference
Variational auto-encoderReferences
Computation Graph
x
µ σ
z
x
λ λ
θ
inference model
generative model
ε ∼ N (0, 1)
KL KL
Wilker Aziz DGMs in NLP 40
Deep generative modelsVariational inference
Variational auto-encoderReferences
Computation Graph
x
µ σ
z
x
λ λ
θ
inference model
generative model
ε ∼ N (0, 1)
KL KL
Wilker Aziz DGMs in NLP 40
Deep generative modelsVariational inference
Variational auto-encoderReferences
Example
z xm1 θ
N
λ
Generative model
Z ∼ N (0, I )
Xi |z , x<i ∼ Cat(fθ(z , x<i ))
Inference model
Z ∼ N (µλ(xm1 ), σλ(xm1 )2)
Bowman et al. (2016)Wilker Aziz DGMs in NLP 41
Deep generative modelsVariational inference
Variational auto-encoderReferences
VAEs – Summary
Advantages
Backprop training
Easy to implement
Posterior inference possible
One objective for both NNs
Drawbacks
Discrete latent variables are difficult
Optimisation may be difficult with several latent variables
Location-scale families onlybut see Ruiz et al. (2016) and Kucukelbir et al. (2017)
Wilker Aziz DGMs in NLP 42
Deep generative modelsVariational inference
Variational auto-encoderReferences
VAEs – Summary
Advantages
Backprop training
Easy to implement
Posterior inference possible
One objective for both NNs
Drawbacks
Discrete latent variables are difficult
Optimisation may be difficult with several latent variables
Location-scale families onlybut see Ruiz et al. (2016) and Kucukelbir et al. (2017)
Wilker Aziz DGMs in NLP 42
Deep generative modelsVariational inference
Variational auto-encoderReferences
Summary
Deep learning in NLP
task-driven feature extraction
models with more realistic assumptions
Probabilistic modelling
better (or at least more explicit) statistical assumptions
compact models
semi-supervised learning
Wilker Aziz DGMs in NLP 43
Deep generative modelsVariational inference
Variational auto-encoderReferences
Literature I
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neuralmachine translation by jointly learning to align and translate. 2014.URL http://arxiv.org/abs/1409.0473.
David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variationalinference: A review for statisticians. 01 2016. URLhttps://arxiv.org/abs/1601.00670.
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, RafalJozefowicz, and Samy Bengio. Generating sentences from acontinuous space. In Proceedings of The 20th SIGNLL Conference onComputational Natural Language Learning, pages 10–21, Berlin,Germany, August 2016. Association for Computational Linguistics.URL http://www.aclweb.org/anthology/K16-1002.
Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes.2013. URL http://arxiv.org/abs/1312.6114.
Wilker Aziz DGMs in NLP 44
Deep generative modelsVariational inference
Variational auto-encoderReferences
Literature II
Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, andDavid M. Blei. Automatic differentiation variational inference. Journalof Machine Learning Research, 18(14):1–45, 2017. URLhttp://jmlr.org/papers/v18/16-107.html.
Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochasticbackpropagation and approximate inference in deep generative models.In ICML, pages 1278–1286, 2014. URLhttp://jmlr.org/proceedings/papers/v32/rezende14.pdf.
Francisco R Ruiz, Michalis Titsias RC AUEB, and David Blei. Thegeneralized reparameterization gradient. In D. D. Lee, M. Sugiyama,U. V. Luxburg, I. Guyon, and R. Garnett, editors, NIPS, pages460–468. 2016. URL http://papers.nips.cc/paper/
6328-the-generalized-reparameterization-gradient.pdf.
Wilker Aziz DGMs in NLP 45
Deep generative modelsVariational inference
Variational auto-encoderReferences
Literature III
Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neuralmachine translation models with monolingual data. In Proceedings ofthe 54th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany,August 2016. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/P16-1009.
Michalis Titsias and Miguel Lazaro-Gredilla. Doubly stochasticvariational bayes for non-conjugate inference. In Tony Jebara andEric P. Xing, editors, ICML, pages 1971–1979, 2014. URLhttp://jmlr.org/proceedings/papers/v32/titsias14.pdf.
Wilker Aziz DGMs in NLP 46