Deep Generative Models Variational Autoencoderscse.iitkgp.ac.in/~sudeshna/courses/DL17/VAE-Apr-5-17.pdf · 2017-04-11 · Variational autoencoders - which pair the generator net with

Deep Generative ModelsVariational Autoencoders

Sudeshna Sarkar

5 April 2017

Generative Nets

• Generative models that represent probability distributions over multiple variables in some way.

• Directed Generative Nets

– Differentiable Generator Nets

Differentiable Generator Nets

• Many generative models are based on the idea of using a differentiable generator network.

• The model transforms samples of latent variables z to samples x or to distributions over samples x using a

differentiable function 𝑔 𝑧; 𝜃 𝑔 , typically represented

using a NN

1. Variational autoencoders - which pair the generator net with an inference net

2. Generative adversarial networks - which pair the generator network with a discriminator network

3. Techniques that train generator networks in isolation.

Generator Networks

• Generator networks are essentially just parameterized computational procedures for generating samples

– the architecture provides the family of possible distributions to sample from

– the parameters select a distribution from within that family.

• Example, the standard procedure for drawing samples from a normal distribution with mean µ and covariance Σ is to feed samples z from a normal distribution with zero mean and identity covariance into a very simple generator network.

– This generator network contains just one affine layer

𝑥 = 𝑔 𝑧 = 𝜇 + 𝐿 𝑧

L is the Cholesky decomposition of Σ

Generator networks

• To generate samples from more complicated distributions, we may use a feedforward network to represent a parametric family of nonlinear functions 𝑔, and use training data to infer the parameters selecting the desired function.

• We can think of g as providing a nonlinear change of variables that transforms the distribution over 𝑧 into the desired distribution over 𝑥.

• we often use indirect means of learning 𝑔

• In some cases, rather than using g to provide a sample of x directly, we use g to define a conditional distribution over x. For example, we could use a generator net whose final layer consists of sigmoid outputs to provide the mean parameters of Bernoulli distributions

𝑝 𝑥𝑖 = 1 𝑧 = 𝑔(𝑧)𝑖

• In this case, when we use g to define p(x | z), we impose a distribution over x by marginalizing z:

𝑝 𝑥 = 𝐸𝑧𝑝 𝑥 𝑧

• The two different approaches to formulating generator nets

– emitting the parameters of a conditional distribution versus

– directly emitting samples

have complementary strengths and weaknesses

1. emitting the parameters of a conditional distribution

2. directly emitting the samples

• When the generator net defines a conditional distribution over x, it is capable of generating discrete data as well as continuous data.

• When the generator net provides samples directly, it is capable of generating only continuous data.

• The advantage to direct sampling is that we are no longer forced to use conditional distributions whose form can be easily written down and algebraically manipulated by a human designer

• Generative modeling seems to be more difficult than

classification or regression because the learning process

requires optimizing intractable criteria.

• In differentiable generator nets, the criteria are intractable

because the data does not specify both the inputs z and the

outputs x

• The learning procedure needs to determine how to arrange z

space in a useful way and additionally how to map from z to x

• Several approaches to training differentiable generator nets

given only training samples of x

Variational Autoencoder

• Graphical models + Neural networks

• A directed model that uses learned approximate inference

and can be trained purely with gradient-based methods

• Lets us design complex generative models of data, and fit

them to large datasets.

• They can be used to learn a low dimensional representation Z

of high dimensional data X such as images (of e.g. faces).

• X and Z are random variables. It’s therefore possible to sample

X from the distribution P(X|Z), thus creating e.g. images of

faces, MNIST Digits, or speech.

VAE History

• Simultaneously discovered by Kingma and Welling. “Auto-Encoding Variational Bayes, International Conference on Learning Representations.” ICLR, 2014.

• Rezende, Mohamed and Wierstra. “Stochastic back-propagation and variational inference in deep latent Gaussian models.” ICML, 2014

Manifold Hypothesis

Variational auto encoders (idea of low dim manifold)

The neural net perspective

THE ENCODER COMPRESSES DATA INTO A LATENT SPACE (Z). THE DECODER RECONSTRUCTS THE DATA GIVEN THE HIDDEN REPRESENTATION.

Example: x is a 28 by 28-pixel photo of a handwritten number. • The encoder ‘encodes’ the data into a latent (hidden) representation

space z, of lower dimension => the encoder must learn an efficient compression of the data

• The lower-dimensional space is stochastic: the encoder outputs parameters to 𝑞𝜃(z |x), which is a Gaussian probability density. We can sample from this distribution to get noisy values of the representations z.

The neural net perspective

Example: x is a 28 by 28-pixel photo of a handwritten number. • The decoder is a neural net denoted by 𝑝𝜙(𝑥|𝑧)

• Input: representation z• outputs the parameters to the probability distribution of the data,

and has weights and biases ϕ. • The decoder gets as input the latent representation of a digit z and

outputs 784 parameters, one for each of the pixels in the image.• Information is lost because it goes from a smaller to a larger

dimensionality - reconstruction log-likelihood log 𝑝𝜙(𝑥|𝑧)

Variational Autoencoders

• To generate a sample from the model,

– the VAE first draws a sample z from the code distribution pmodel(z).

– The sample is then run through a differentiable generator network g(z).

– Finally, x is sampled from a distribution

pmodel(x;g(z)) =pmodel(x | z).

• During training, the approximate inference network (or encoder) q(z | x) is used to obtain z and pmodel(x|z) is then viewed as a decoder network.

• The loss function of the variational autoencoder is the negative log-likelihood with a regularizer.

• The loss function 𝑙𝑖 for datapoint 𝑥𝑖 is:

𝑙𝑖 𝜃, 𝜙 = −𝐸𝑧~𝑞𝜃 𝑧 𝑥𝑖 log 𝑝𝜙 𝑥𝑖 𝑧 + 𝐾𝐿 𝑞𝜃 𝑧 𝑥𝑖 𝑝(𝑧)

𝐿 =

𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠

𝑙𝑖

• The first term is the reconstruction loss, or expected negative log-likelihood of the i-th datapoint. The expectation is taken with respect to the encoder’s distribution over the representations. This term encourages the decoder to learn to reconstruct the data.

• Regularizer KL divergence between the encoder’s distribution 𝑞𝜃 𝑧 𝑥 and 𝑝(𝑧)

• In the variational autoencoder, p is specified as a standard Normal distribution with mean zero and variance one.

• This has the effect of keeping similar numbers’ representations close together.

• We train the variational autoencoder using gradient descent to optimize the loss with respect to the parameters of the encoder and decoder

The probability model perspective

• a variational autoencoder contains a specific probability model of data x and latent variables z.

• The joint probability of the model p(x,z)=p(x∣z)p(z)

• The generative process for each datapoint i:– Draw latent variables zi ∼p(z)

– Draw datapoint xi ∼p(x∣z)

Inference in this model:• Goal: to infer good values of the latent variables given observed

data

𝑝 𝑧 𝑥 =𝑝 𝑥 𝑧 𝑝(𝑧)

𝑝(𝑥)

Evidence 𝑝 𝑥 = 𝑝 𝑥 z 𝑝 𝑧 𝑑𝑧

requires exponential time to compute as it needs to be evaluated over all configurations of latent variables. We therefore need to approximate this posterior distribution.

• Variational inference approximates the posterior with a family of distributions 𝑞𝜆(𝑧|𝑥)

• λ indexes the family of distributions. For example, if q were Gaussian, 𝜆𝑥𝑖 = (𝜇𝑥𝑖,𝜎𝑥𝑖

2 )

• how well our variational posterior q(z∣x) approximates the true posterior p(z∣x)?

• KL-divergence measures the information lost when using q to approximate p

This is intractable. Consider the function ELBO()

By Jensen’s inequality, the KL divergence is always greater than or equal to zero. ==> minimizing the KL divergence is equivalent to maximizing the ELBO (Evidence Lower Bound) which is tractable.

We combine this with the KL divergence and rewrite the evidence as

• In the variational autoencoder model, there are only local latent variables

• So we can decompose the ELBO into a sum where each term depends on a single datapoint.

• This allows us to use stochastic gradient descent with respect to the parameters 𝜆

Reparameterization

• Backpropagation not possible through random sampling!

• how to take derivatives with respect to the parameters of a stochastic variable.

• If we are given z that is drawn from a distribution qθ(z∣x), and we want to take derivatives of a function of z with respect to θ, how do we do that?

• Reparametrize samples in a clever way, such that the stochasticity is independent of the parameters, e,g. for normal distribution,

𝑧 = 𝜇 + 𝜎 ⊙ 𝜖

where 𝜖~𝑁(0,1)

Reparameterization

Documents

Deep Generative Models Variational Autoencoderscse.iitkgp.ac.in/~sudeshna/courses/DL17/VAE-Apr-5-17.pdf · 2017-04-11 · Variational autoencoders - which pair the generator net with