Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Deep Generative ModelsVariational Autoencoders
Sudeshna Sarkar
5 April 2017
Generative Nets
• Generative models that represent probability distributions over multiple variables in some way.
• Directed Generative Nets
– Differentiable Generator Nets
Differentiable Generator Nets
• Many generative models are based on the idea of using a differentiable generator network.
• The model transforms samples of latent variables z to samples x or to distributions over samples x using a
differentiable function 𝑔 𝑧; 𝜃 𝑔 , typically represented
using a NN
1. Variational autoencoders - which pair the generator net with an inference net
2. Generative adversarial networks - which pair the generator network with a discriminator network
3. Techniques that train generator networks in isolation.
Generator Networks
• Generator networks are essentially just parameterized computational procedures for generating samples
– the architecture provides the family of possible distributions to sample from
– the parameters select a distribution from within that family.
• Example, the standard procedure for drawing samples from a normal distribution with mean µ and covariance Σ is to feed samples z from a normal distribution with zero mean and identity covariance into a very simple generator network.
– This generator network contains just one affine layer
𝑥 = 𝑔 𝑧 = 𝜇 + 𝐿 𝑧
L is the Cholesky decomposition of Σ
Generator networks
• To generate samples from more complicated distributions, we may use a feedforward network to represent a parametric family of nonlinear functions 𝑔, and use training data to infer the parameters selecting the desired function.
• We can think of g as providing a nonlinear change of variables that transforms the distribution over 𝑧 into the desired distribution over 𝑥.
• we often use indirect means of learning 𝑔
• In some cases, rather than using g to provide a sample of x directly, we use g to define a conditional distribution over x. For example, we could use a generator net whose final layer consists of sigmoid outputs to provide the mean parameters of Bernoulli distributions
𝑝 𝑥𝑖 = 1 𝑧 = 𝑔(𝑧)𝑖
• In this case, when we use g to define p(x | z), we impose a distribution over x by marginalizing z:
𝑝 𝑥 = 𝐸𝑧𝑝 𝑥 𝑧
• The two different approaches to formulating generator nets
– emitting the parameters of a conditional distribution versus
– directly emitting samples
have complementary strengths and weaknesses
1. emitting the parameters of a conditional distribution
2. directly emitting the samples
• When the generator net defines a conditional distribution over x, it is capable of generating discrete data as well as continuous data.
• When the generator net provides samples directly, it is capable of generating only continuous data.
• The advantage to direct sampling is that we are no longer forced to use conditional distributions whose form can be easily written down and algebraically manipulated by a human designer
• Generative modeling seems to be more difficult than
classification or regression because the learning process
requires optimizing intractable criteria.
• In differentiable generator nets, the criteria are intractable
because the data does not specify both the inputs z and the
outputs x
• The learning procedure needs to determine how to arrange z
space in a useful way and additionally how to map from z to x
• Several approaches to training differentiable generator nets
given only training samples of x
Variational Autoencoder
• Graphical models + Neural networks
• A directed model that uses learned approximate inference
and can be trained purely with gradient-based methods
• Lets us design complex generative models of data, and fit
them to large datasets.
• They can be used to learn a low dimensional representation Z
of high dimensional data X such as images (of e.g. faces).
• X and Z are random variables. It’s therefore possible to sample
X from the distribution P(X|Z), thus creating e.g. images of
faces, MNIST Digits, or speech.
VAE History
• Simultaneously discovered by Kingma and Welling. “Auto-Encoding Variational Bayes, International Conference on Learning Representations.” ICLR, 2014.
• Rezende, Mohamed and Wierstra. “Stochastic back-propagation and variational inference in deep latent Gaussian models.” ICML, 2014
Manifold Hypothesis
Variational auto encoders (idea of low dim manifold)
The neural net perspective
THE ENCODER COMPRESSES DATA INTO A LATENT SPACE (Z). THE DECODER RECONSTRUCTS THE DATA GIVEN THE HIDDEN REPRESENTATION.
Example: x is a 28 by 28-pixel photo of a handwritten number. • The encoder ‘encodes’ the data into a latent (hidden) representation
space z, of lower dimension => the encoder must learn an efficient compression of the data
• The lower-dimensional space is stochastic: the encoder outputs parameters to 𝑞𝜃(z |x), which is a Gaussian probability density. We can sample from this distribution to get noisy values of the representations z.
The neural net perspective
Example: x is a 28 by 28-pixel photo of a handwritten number. • The decoder is a neural net denoted by 𝑝𝜙(𝑥|𝑧)
• Input: representation z• outputs the parameters to the probability distribution of the data,
and has weights and biases ϕ. • The decoder gets as input the latent representation of a digit z and
outputs 784 parameters, one for each of the pixels in the image.• Information is lost because it goes from a smaller to a larger
dimensionality - reconstruction log-likelihood log 𝑝𝜙(𝑥|𝑧)
Variational Autoencoders
• To generate a sample from the model,
– the VAE first draws a sample z from the code distribution pmodel(z).
– The sample is then run through a differentiable generator network g(z).
– Finally, x is sampled from a distribution
pmodel(x;g(z)) =pmodel(x | z).
• During training, the approximate inference network (or encoder) q(z | x) is used to obtain z and pmodel(x|z) is then viewed as a decoder network.
• The loss function of the variational autoencoder is the negative log-likelihood with a regularizer.
• The loss function 𝑙𝑖 for datapoint 𝑥𝑖 is:
𝑙𝑖 𝜃, 𝜙 = −𝐸𝑧~𝑞𝜃 𝑧 𝑥𝑖 log 𝑝𝜙 𝑥𝑖 𝑧 + 𝐾𝐿 𝑞𝜃 𝑧 𝑥𝑖 𝑝(𝑧)
𝐿 =
𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠
𝑙𝑖
• The first term is the reconstruction loss, or expected negative log-likelihood of the i-th datapoint. The expectation is taken with respect to the encoder’s distribution over the representations. This term encourages the decoder to learn to reconstruct the data.
• Regularizer KL divergence between the encoder’s distribution 𝑞𝜃 𝑧 𝑥 and 𝑝(𝑧)
• In the variational autoencoder, p is specified as a standard Normal distribution with mean zero and variance one.
• This has the effect of keeping similar numbers’ representations close together.
• We train the variational autoencoder using gradient descent to optimize the loss with respect to the parameters of the encoder and decoder
The probability model perspective
• a variational autoencoder contains a specific probability model of data x and latent variables z.
• The joint probability of the model p(x,z)=p(x∣z)p(z)
• The generative process for each datapoint i:– Draw latent variables zi ∼p(z)
– Draw datapoint xi ∼p(x∣z)
Inference in this model:• Goal: to infer good values of the latent variables given observed
data
𝑝 𝑧 𝑥 =𝑝 𝑥 𝑧 𝑝(𝑧)
𝑝(𝑥)
Evidence 𝑝 𝑥 = 𝑝 𝑥 z 𝑝 𝑧 𝑑𝑧
requires exponential time to compute as it needs to be evaluated over all configurations of latent variables. We therefore need to approximate this posterior distribution.
• Variational inference approximates the posterior with a family of distributions 𝑞𝜆(𝑧|𝑥)
• λ indexes the family of distributions. For example, if q were Gaussian, 𝜆𝑥𝑖 = (𝜇𝑥𝑖,𝜎𝑥𝑖
2 )
• how well our variational posterior q(z∣x) approximates the true posterior p(z∣x)?
• KL-divergence measures the information lost when using q to approximate p
This is intractable. Consider the function ELBO()
By Jensen’s inequality, the KL divergence is always greater than or equal to zero. ==> minimizing the KL divergence is equivalent to maximizing the ELBO (Evidence Lower Bound) which is tractable.
We combine this with the KL divergence and rewrite the evidence as
• In the variational autoencoder model, there are only local latent variables
• So we can decompose the ELBO into a sum where each term depends on a single datapoint.
• This allows us to use stochastic gradient descent with respect to the parameters 𝜆
Reparameterization
• Backpropagation not possible through random sampling!
• how to take derivatives with respect to the parameters of a stochastic variable.
• If we are given z that is drawn from a distribution qθ(z∣x), and we want to take derivatives of a function of z with respect to θ, how do we do that?
• Reparametrize samples in a clever way, such that the stochasticity is independent of the parameters, e,g. for normal distribution,
𝑧 = 𝜇 + 𝜎 ⊙ 𝜖
where 𝜖~𝑁(0,1)
Reparameterization