(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks

How to Train Deep VariationalAutoencoders and Probabilistic Ladder NetworksD2 žÑƈ~

�ĕā70�2¤ ICML 2016

¤ Ƌ��Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, Ole Winther (Technical University of Denmark)

¤ Maaløeî9ºì8VAEĵņ39danyox8ĵņ��űĎĴń�¤ VAE+GANKðã$,4!I¤ D�J�E6�state-of-the-art8òõś�F´ťKðã$,!43BĀà

¤ VAEKÛů7$2«Ķ$>�/,ĕā¤ Î�2Ladder network/=�VAEKðã¤ warm-upCbatch normalization65�47��Ť�«Ķ$>�/2�G

¤ VAE8º��ĵņBƅš7Į¤$2��>&¤ ±©�7pfMaV6Į¤KÛ�&G83"źŴ�-#�

ªñ¤ VAE8âƔ

¤ VAE8�ĵņ¤ òõś�F´ť¤ ƎƖ6�ŵKïÃ&G�¦

¤ ðã�¦¤ ĂĚ�ơ�had{�V¤ PR�qMan¤ ja`Éė¬

¤ «Ķ

¤ >4A

¤ �>�

ªñ¤ VAE8âƔ



¤ «Ķ

¤ >4A

¤ �>�

Variaitonal Autoencoder¤ Variational Autoencoder [Kingma+ 13][Rezende+ 14]

¤ c�^�EĂĚ�ŵK´ť¤ !(#|%)4��ŵKó�G¤ '(#|%)9Žĳ683�Ą8�ŵ3ć�2÷8�ŵ7ì1�G

¤ ëAG�ŵkvr�^9(4)

*

x ⇠ p✓(x|z)

z ⇠ p✓(z)

q�(z|x) �Ãscx

ōĕscx +

scx�ŵ8Ä�¤ �Ãscx',9�#�È�ŵ�ŐŶ�ŵ�32¢F8�Ħ��G

¤ ƗĐĒÊK-ů7�Ő&G!43��Ãscx9ñ8D�6�ŵ8ŀ3¶&!4�3�G

¤ ōĕscx!.9UP[�ŵ3�ñ8D�7ŢÊ�ç&G!4�3�G

How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks

layer or above. To alleviate these problems we propose (1)the probabilistic ladder network, (2) a warm-up period tosupport stochastic units staying active in early training, and(3) use of batch normalization, for optimizing DGMs andshow that the likelihood can increase for up to five layersof stochastic latent variables. These models, consisting ofdeep hierarchies of latent variables, are both highly expres-sive while maintaining the computational efficiency of fullyfactorized models. We first show that these models havecompetitive generative performance, measured in terms oftest log likelihood, when compared to equally or morecomplicated methods for creating flexible variational dis-tributions such as the Variational Gaussian Processes (Tranet al., 2015) Normalizing Flows (Rezende & Mohamed,2015) or Importance Weighted Autoencoders (Burda et al.,2015). We find that batch normalization and warm-up al-ways increase the generative performance, suggesting thatthese methods are broadly useful. We also show that theprobabilistic ladder network performs as good or betterthan strong VAEs making it an interesting model for fur-ther studies. Secondly, we study the learned latent rep-resentations. We find that the methods proposed here arenecessary for learning rich latent representations utilizingseveral layers of latent variables. A qualitative assessmentof the latent representations further indicates that the multi-layered DGMs capture high level structure in the datasetswhich is likely to be useful for semi-supervised learning.

In summary our contributions are:

• A new parametrization of the VAE inspired by theLadder network performing as well or better than thecurrent best models.

• A novel warm-up period in training increasing boththe generative performance across several differentdatasets and the number of active stochastic latentvariables.

• We show that batch normalization is essential fortraining VAEs with several layers of stochastic latentvariables.

2. MethodsVariational autoencoders simultaneously train a generativemodel p

✓

(x, z) = p

✓

(x|z)p✓

(z) for data x using auxil-iary latent variables z, and an inference model q

�

(z|x)1

by optimizing a variational lower bound to the likelihoodp

✓

(x) =

Rp

✓

(x, z)dz.

1The inference models is also known as the recognition modelor the encoder, and the generative model as the decoder.

zd

z

z

a) b)

n

n

nn"Likelihood"

Deterministic bottom up pathway

Stochastic top down pathway

"Posterior"

"Prior"

+ "Copy"

Top down pathway through KL-divergences

in generative model

Bottom up pathway in inference model

Indirect top down

information through prior

Direct flow of information

zn

Generative model

"Copy"

Figure 2. Flow of information in the inference and generativemodels of a) probabilistic ladder network and b) VAE. The prob-abilistic ladder network allows direct integration (+ in figure, seeEq. (21) ) of bottom-up and top-down information in the infer-ence model. In the VAE the top-down information is incorporatedindirectly through the conditional priors in the generative model.

The generative model p✓

is specified as follows:

p

✓

(x|z1) = N�x|µ

✓

(z1),�2✓

(z1)�

or (1)P

✓

(x|z1) = B (x|µ✓

(z1)) (2)

for continuous-valued (Gaussian N ) or binary-valued(Bernoulli B) data, respectively. The latent variables z aresplit into L layers z

i

, i = 1 . . . L:

p

✓

(z

i

|zi+1) = N

�z

i

|µ✓,i

(z

i+1),�2✓,i

(z

i+1)�

(3)

p

✓

(z

L

) = N (z

L

|0, I) . (4)

The hierarchical specification allows the lower layers of thelatent variables to be highly correlated but still maintain thecomputational efficiency of fully factorized models.

Each layer in the inference model q�

(z|x) is specified usinga fully factorized Gaussian distribution:

q

�

(z1|x) = N�z1|µ�,1(x),�

2�,1(x)

�(5)

q

�

(z

i

|zi�1) = N

�z

i

|µ�,i

(z

i�1),�2�,i

(z

i�1)�

(6)

for i = 2 . . . L.

Functions µ(·) and �

2(·) in both the generative and the in-

ference models are implemented as:

d(y) =MLP(y) (7)µ(y) =Linear(d(y)) (8)

�

2(y) =Softplus(Linear(d(y))) , (9)

where MLP is a two layered multilayer perceptron network,Linear is a single linear layer, and Softplus applieslog(1 + exp(·)) non linearity to each component of its ar-gument vector. In our notation, each MLP(·) or Linear(·)gives a new mapping with its own parameters, so the de-terministic variable d is used to mark that the MLP-part isshared between µ and �

2 whereas the last Linear layer isnot shared.

È

ŐŶ








✓

(x, z) = p

✓

(x|z)p✓


�

(z|x)1


✓

(x) =

Rp

✓

(x, z)dz.


zd

z

z

a) b)

n

n

nn"Likelihood"



"Posterior"

"Prior"

+ "Copy"


in generative model


Indirect top down



zn

Generative model

"Copy"




p

✓

(x|z1) = N�x|µ

✓

(z1),�2✓

(z1)�

or (1)P

✓

(x|z1) = B (x|µ✓

(z1)) (2)


i

, i = 1 . . . L:

p

✓

(z

i

|zi+1) = N

�z

i

|µ✓,i

(z

i+1),�2✓,i

(z

i+1)�

(3)

p

✓

(z

L

) = N (z

L

|0, I) . (4)




q

�

(z1|x) = N�z1|µ�,1(x),�

2�,1(x)

�(5)

q

�

(z

i

|zi�1) = N

�z

i

|µ�,i

(z

i�1),�2�,i

(z

i�1)�

(6)

for i = 2 . . . L.





�











✓

(x, z) = p

✓

(x|z)p✓


�

(z|x)1


✓

(x) =

Rp

✓

(x, z)dz.


zd

z

z

a) b)

n

n

nn"Likelihood"



"Posterior"

"Prior"

+ "Copy"


in generative model


Indirect top down



zn

Generative model

"Copy"




p

✓

(x|z1) = N�x|µ

✓

(z1),�2✓

(z1)�

or (1)P

✓

(x|z1) = B (x|µ✓

(z1)) (2)


i

, i = 1 . . . L:

p

✓

(z

i

|zi+1) = N

�z

i

|µ✓,i

(z

i+1),�2✓,i

(z

i+1)�

(3)

p

✓

(z

L

) = N (z

L

|0, I) . (4)




q

�

(z1|x) = N�z1|µ�,1(x),�

2�,1(x)

�(5)

q

�

(z

i

|zi�1) = N

�z

i

|µ�,i

(z

i�1),�2�,i

(z

i�1)�

(6)

for i = 2 . . . L.





�











✓

(x, z) = p

✓

(x|z)p✓


�

(z|x)1


✓

(x) =

Rp

✓

(x, z)dz.


zd

z

z

a) b)

n

n

nn"Likelihood"



"Posterior"

"Prior"

+ "Copy"


in generative model


Indirect top down



zn

Generative model

"Copy"




p

✓

(x|z1) = N�x|µ

✓

(z1),�2✓

(z1)�

or (1)P

✓

(x|z1) = B (x|µ✓

(z1)) (2)


i

, i = 1 . . . L:

p

✓

(z

i

|zi+1) = N

�z

i

|µ✓,i

(z

i+1),�2✓,i

(z

i+1)�

(3)

p

✓

(z

L

) = N (z

L

|0, I) . (4)




q

�

(z1|x) = N�z1|µ�,1(x),�

2�,1(x)

�(5)

q

�

(z

i

|zi�1) = N

�z

i

|µ�,i

(z

i�1),�2�,i

(z

i�1)�

(6)

for i = 2 . . . L.





�




ft�vxhad{�V7DG�ŵ8¶Æ

¤ UP[�ŵ/(0|1, 34)9ft�vxhad{�V7D/2�Ħ&G!4�3�G

¤ oxg�O�ŵℬ(0|1)B�Ń7�Ħ&G!4�3�G








✓

(x, z) = p

✓

(x|z)p✓


�

(z|x)1


✓

(x) =

Rp

✓

(x, z)dz.


zd

z

z

a) b)

n

n

nn"Likelihood"



"Posterior"

"Prior"

+ "Copy"


in generative model


Indirect top down



zn

Generative model

"Copy"




p

✓

(x|z1) = N�x|µ

✓

(z1),�2✓

(z1)�

or (1)P

✓

(x|z1) = B (x|µ✓

(z1)) (2)


i

, i = 1 . . . L:

p

✓

(z

i

|zi+1) = N

�z

i

|µ✓,i

(z

i+1),�2✓,i

(z

i+1)�

(3)

p

✓

(z

L

) = N (z

L

|0, I) . (4)




q

�

(z1|x) = N�z1|µ�,1(x),�

2�,1(x)

�(5)

q

�

(z

i

|zi�1) = N

�z

i

|µ�,i

(z

i�1),�2�,i

(z

i�1)�

(6)

for i = 2 . . . L.





�











✓

(x, z) = p

✓

(x|z)p✓


�

(z|x)1


✓

(x) =

Rp

✓

(x, z)dz.


zd

z

z

a) b)

n

n

nn"Likelihood"



"Posterior"

"Prior"

+ "Copy"


in generative model


Indirect top down



zn

Generative model

"Copy"




p

✓

(x|z1) = N�x|µ

✓

(z1),�2✓

(z1)�

or (1)P

✓

(x|z1) = B (x|µ✓

(z1)) (2)


i

, i = 1 . . . L:

p

✓

(z

i

|zi+1) = N

�z

i

|µ✓,i

(z

i+1),�2✓,i

(z

i+1)�

(3)

p

✓

(z

L

) = N (z

L

|0, I) . (4)




q

�

(z1|x) = N�z1|µ�,1(x),�

2�,1(x)

�(5)

q

�

(z

i

|zi�1) = N

�z

i

|µ�,i

(z

i�1),�2�,i

(z

i�1)�

(6)

for i = 2 . . . L.





�




Sigmoid

How to Train Deep Variational Autoencoders and Probabilistic LadderNetworks

Casper Kaae Sønderby1 [email protected] Raiko2 [email protected] Maaløe3 [email protected]øren Kaae Sønderby1 [email protected] Winther1,3 [email protected] Centre, Department of Biology, University of Copenhagen, Denmark2Department of Computer Science, Aalto University, Finland3Department of Applied Mathematics and Computer Science, Technical University of Denmark

AbstractVariational autoencoders are a powerful frame-work for unsupervised learning. However, pre-vious work has been restricted to shallow mod-els with one or two layers of fully factorizedstochastic latent variables, limiting the flexibil-ity of the latent representation. We propose threeadvances in training algorithms of variational au-toencoders, for the first time allowing to traindeep models of up to five stochastic layers, (1)using a structure similar to the Ladder networkas the inference model, (2) warm-up period tosupport stochastic units staying active in earlytraining, and (3) use of batch normalization. Us-ing these improvements we show state-of-the-artlog-likelihood results for generative modeling onseveral benchmark datasets.

1. IntroductionThe recently introduced variational autoencoder (VAE)(Kingma & Welling, 2013; Rezende et al., 2014) providesa framework for deep generative models (DGM). DGMshave later been shown to be a powerful framework forsemi-supervised learning (Kingma et al., 2014; Maaloeeet al., 2016). Another line of research starting from de-noising autoencoders introduced the Ladder network forunsupervised learning (Valpola, 2014) which have alsobeen shown to perform very well in the semi-supervisedsetting (Rasmus et al., 2015).

Here we study DGMs with several layers of latent vari-

Proceedings of the 33 rdInternational Conference on Machine

Learning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

X

!"z

z

X

!"

d!"

z

d!"

z

z

z

!"

X~

a) b) c)

1

2

1

11

2 2

2 22 2

1 11 1

1 1

2

Figure 1. Inference (or encoder/recognition) and generative (ordecoder) models. a) VAE inference model, b) Probabilistic lad-der inference model and c) generative model. The z’s are latentvariables sampled from the approximate posterior distribution qwith mean and variances parameterized using neural networks.

ables, each conditioned on the layer above, which allowshighly flexible latent distributions. We study two differentmodel parameterizations: the first is a simple extension ofthe VAE to multiple layers of latent variables and the sec-ond is parameterized in such a way that it can be regardedas a probabilistic variational variant of the Ladder networkwhich, contrary to the VAE, allows interactions between abottom up and top-down inference signal.

Previous work on DGMs have been restricted to shallowmodels with one or two layers of stochastic latent variablesconstraining the performance by the restrictive mean fieldapproximation of the intractable posterior distribution. Us-ing only gradient descent optimization we show that DGMsare only able to utilize the stochastic latent variables in thesecond layer to a limited degree and not at all in the third

arX

iv:1

602.

0228

2v1

[sta

t.ML]

6 F

eb 2

016

Ē�£Ó8×ă¤ ŚŕƢ§8£Ó9ñ8D�7×ă3�G�ũ��Į¤9�Ƙ� ,[vOeŦK?2�-#�

¤ »İł��9úkvr�^¬dwaV3Z|nw|W

�úkvr�^¬dwaV96)Ɵ$��¤ kvr�^7Żŋ$6�UP[�ŵ7ĒŨ&G!43�£Ó8ƣĩ8¸ƛō�ķK�Z|nw|W3ƅš7×ă3�G

log' 0 ≥ :;< = 0 log', 0, =!. = 0

= ℒ(), (; 0)

A,,.ℒ ), (; 0 = A,,.:;< = 0 log', 0, =!. = 0

= :/(B,C) A,,.log', 0, =!. = 0

»İł8�7�G

=1EFA,,.logG

HIC

', 0, =(J)

!. =(J) 0

£Ó8ĉ70�2¤ £ÓK58D�7ëAG�70�292¢F�G

¤ A�*8>>8£Ó3ëAG

¤ B�KLƌŐ4úĨÃƉĻ7��G

¤ äĕā39BKŝ°$2�G¤ Z|nw|W��ē6�6G83�Ŷ�®#�6G&��FC&�,A

¤ ºì8ĵņ�IWAE65�39AKŝ°&G!4�Ý�2�G¤ žÑ8úÆ«Ķ39�@$IA8��D�6/2�G

ℒ ), (; 0 = :;< = 0 log', 0, =!. = 0

ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=

Ûů8��8£Ó70�2¤ ƗĐĒÊ�Ûů76/,��

¤ A8£Ó9øŬ7ë>G

¤ B8£ÓBKL��3Z|nw|W&H:ë>G

¤ ¥>3B39Ĝ¯-4ù/2,��×ă$,Eë>/,

' 0|= = ' =O ' =OPC|=O '(x|=C)

! =|0 = ! =C|0 ! =4|=C !(=O|=OPC)

:;< = 0 log RS 0,=;< = 0

= :;< = 0 log R =T R =TUV|=T R x|=V

; =V|0 ; =W|=V ;(=T|=TUV)

−L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|== −L-[!. = 0 ∥ ', =O ] + :;< = 0 log' =OPC|=O ' x|=C

!8��K�ů>3Z|nw|W

ªñ¤ VAE8âƔ



¤ «Ķ

¤ >4A

¤ �>��żň�

VAE7DGòõś�F´ť¤ VAE9òõś�F´ť7ĀŁ3�G!4�êEH2�G

¤ �Ãscx9â��7òõś�F´ť7ĀŁ

¤ CVAE7DGòõś�F´ť [Kingma+ 2014]

�Ə$�9řƤ#L[vOeKĖƆ

¤ Variational Fair Auto Encoder [Louizos+ 15]¤ XKsensitiveŢ�4ó�2�đæą��ŵ'(%|X)�Ŀ 76GD�7MMD3¾²KÄ�G

where ��

(x) is a vector of standard deviations, ⇡�

(x) is a probability vector, and the functionsµ

�

(x), ��

(x) and ⇡�

(x) are represented as MLPs.

3.1.1 Latent Feature Discriminative Model Objective

For this model, the variational bound J (x) on the marginal likelihood for a single data point is:

log p✓

(x) � Eq�(z|x) [log p✓(x|z)]�KL[q

�

(z|x)kp✓

(z)] = �J (x), (5)

The inference network q�

(z|x) (3) is used during training of the model using both the labelled andunlabelled data sets. This approximate posterior is then used as a feature extractor for the labelleddata set, and the features used for training the classifier.

3.1.2 Generative Semi-supervised Model Objective

For this model, we have two cases to consider. In the first case, the label corresponding to a datapoint is observed and the variational bound is a simple extension of equation (5):

log p✓

(x, y)�Eq�(z|x,y) [log p✓(x|y, z) + log p

✓

(y) + log p(z)� log q�

(z|x, y)]=�L(x, y), (6)

For the case where the label is missing, it is treated as a latent variable over which we performposterior inference and the resulting bound for handling data points with an unobserved label y is:

log p✓

(x) � Eq�(y,z|x) [log p✓(x|y, z) + log p

✓

(y) + log p(z)� log q�

(y, z|x)]

=

Xy

q�

(y|x)(�L(x, y)) +H(q�

(y|x)) = �U(x). (7)

The bound on the marginal likelihood for the entire dataset is now:

J =

X(x,y)⇠epl

L(x, y) +X

x⇠epu

U(x) (8)

The distribution q�

(y|x) (4) for the missing labels has the form a discriminative classifier, andwe can use this knowledge to construct the best classifier possible as our inference model. Thisdistribution is also used at test time for predictions of any unseen data.

In the objective function (8), the label predictive distribution q�

(y|x) contributes only to the secondterm relating to the unlabelled data, which is an undesirable property if we wish to use this distribu-tion as a classifier. Ideally, all model and variational parameters should learn in all cases. To remedythis, we add a classification loss to (8), such that the distribution q

�

(y|x) also learns from labelleddata. The extended objective function is:

J ↵

= J + ↵ · Eepl(x,y) [� log q�

(y|x)] , (9)

where the hyper-parameter ↵ controls the relative weight between generative and purely discrimina-tive learning. We use ↵ = 0.1 ·N in all experiments. While we have obtained this objective functionby motivating the need for all model components to learn at all times, the objective 9 can also bederived directly using the variational principle by instead performing inference over the parameters⇡ of the categorical distribution, using a symmetric Dirichlet prior over these parameterss.

3.2 Optimisation

The bounds in equations (5) and (9) provide a unified objective function for optimisation of boththe parameters ✓ and � of the generative and inference models, respectively. This optimisation canbe done jointly, without resort to the variational EM algorithm, by using deterministic reparameter-isations of the expectations in the objective function, combined with Monte Carlo approximation –referred to in previous work as stochastic gradient variational Bayes (SGVB) (Kingma and Welling,2014) or as stochastic backpropagation (Rezende et al., 2014). We describe the core strategy for thelatent-feature discriminative model M1, since the same computations are used for the generativesemi-supervised model.

When the prior p(z) is a spherical Gaussian distribution p(z) = N (z|0, I) and the variational distri-bution q

�

(z|x) is also a Gaussian distribution as in (3), the KL term in equation (5) can be computed

4

%

#

Xòõś

º�8òõś�F´ť¤ Auxiliary Deep Generative Models [Maaløe+ 16]

¤ B��08ƗĐĒÊ4$2�ĥĽĒÊ�Auxiliary variables�[Agakov+ 2004 ]Kģ�

¤ !H9ìƓ�ŵ�DFŹƇ6�ŵ3¶Æ3�GD�7&G,A

¤ !8scx3ÆĐòõś�F´ť3state-of-the-art

Auxiliary Deep Generative Models

2. Auxiliary deep generative modelsRecently Kingma (2013); Rezende et al. (2014) have cou-pled the approach of variational inference with deep learn-ing giving rise to powerful probabilistic models constructedby an inference neural network q(z|x) and a generativeneural network p(x|z). This approach can be perceived asa variational equivalent to the deep auto-encoder, in whichq(z|x) acts as the encoder and p(x|z) the decoder. How-ever, the difference is that these models ensures efficientinference over various continuous distributions in the la-tent space z and complex input datasets x, where the pos-terior distribution p(x|z) is intractable. Furthermore, thegradients of the variational upper bound are easily definedby backpropagation through the network(s). To keep thecomputational requirements low the variational distributionq(z|x) is usually chosen to be a diagonal Gaussian, limitingthe expressive power of the inference model.

In this paper we propose a variational auxiliary vari-able approach (Agakov and Barber, 2004) to improvethe variational distribution: The generative model is ex-tended with variables a to p(x, z, a) such that the originalmodel is invariant to marginalization over a: p(x, z, a) =

p(a|x, z)p(x, z). In the variational distribution, on theother hand, a is used such that marginal q(z|x) =Rq(z|a, x)p(a|x)da is a general non-Gaussian distribution.

This hierarchical specification allows the latent variables tobe correlated through a, while maintaining the computa-tional efficiency of fully factorized models (cf. Fig. 1). InSec. 4.1 we demonstrate the expressive power of the infer-ence model by fitting a complex multimodal distribution.

2.1. Variational auto-encoder

The variational auto-encoder (VAE) has recently been in-troduced as a powerful method for unsupervised learning.Here a latent variable generative model p

✓

(x|z) for data x

is parameterized by a deep neural network with parameters✓. The parameters is inferred by maximizing the variationallower bound of the likelihood �

Pi

UVAE(xi

) with

log p

✓

(x) = log

Z

z

p(x, z)dz

� Eq�(z|x)

log

p

✓

(x|z)p✓

(z)

q

�

(z|x)

�(1)

⌘ �UVAE(x) .

The inference model q�

(z|x) is parameterized as a seconddeep neural network. The inference and generative parame-ters, ✓ and �, are jointly trained by optimizing Eq. 1 with anoptimization framework (e.g. stochastic gradient descent),where we use the reparameterization trick for backpropaga-tion through the Gaussian latent variables (Kingma, 2013;Rezende et al., 2014).

yz

a

x(a) Generative model P .

yz

a

x(b) Inference model Q.

Figure 1. Probabilistic graphical model of the ADGM for semi-supervised learning. The incoming joint connections to each vari-able are deep neural networks with parameters ✓ and �.

2.2. Auxiliary variables

We propose to extend the variational distribution with aux-iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) such thatthe marginal distribution q(z|x) can fit more complicatedposteriors p(z|x). In order to have an unchanged gen-erative model, p(x|z), it is required that the joint modep(x, z, a) gives back the original p(x, z) under marginal-ization over a, thus p(x, z, a) = p(a|x, z)p(x, z). Aux-iliary variables are used in the EM algorithm and Gibbssampling and has previously been considered for varia-tional learning by Agakov and Barber (2004). Recently,Ranganath et al. (2015) has proposed to make the param-eters of the variational distribution stochastic, which leadsto a similar model. It is important to note that in ordernot to fall back to the original VAE model one has to re-quire p(a|x, z) 6= p(a), see Agakov and Barber (2004) andApp. B. The auxiliary VAE lower bound becomes

log p

✓

(x) = log

Z

a

Z

z

p(x, a, z)dadz (2)

� Eq�(a,z|x)

log

p

✓

(a|z, x)p✓

(x|z)p(z)q

�

(a|x)q�

(z|a, x)

�(3)

⌘ �UAVAE(x) . (4)

with p

✓

(a|z, x) and q

�

(a|x) diagonal Gaussian distribu-tions parameterized by deep neural networks.

2.3. Semi-supervised learning

The main focus of this paper is to use the auxiliary ap-proach to build semi-supervised models that learn clas-sifiers from labeled and unlabeled data. To encom-pass the class information we introduce an extra la-tent variable y. The generative model P is defined as








✓

(x|z) for data x


Pi

UVAE(xi

) with

log p

✓

(x) = log

Z

z

p(x, z)dz

� Eq�(z|x)

log

p

✓

(x|z)p✓

(z)

q

�

(z|x)

�(1)

⌘ �UVAE(x) .



(a) Generative model P . (b) Inference model Q.Figure 1. Probabilistic graphical model of the ADGM for semi-supervised learning. The incoming joint connections to each vari-able are deep neural networks with parameters ✓ and �.



log p

✓

(x) = log

Z

a

Z

z

p(x, a, z)dadz (2)

� Eq�(a,z|x)

log

p

✓

(a|z, x)p✓

(x|z)p(z)q

�

(a|x)q�

(z|a, x)

�(3)

⌘ �UAVAE(x) . (4)

with p

✓

(a|z, x) and q

�











✓

(x|z) for data x


Pi

UVAE(xi

) with

log p

✓

(x) = log

Z

z

p(x, z)dz

� Eq�(z|x)

log

p

✓

(x|z)p✓

(z)

q

�

(z|x)

�(1)

⌘ �UVAE(x) .



(a) Generative model P . (b) Inference model Q.Figure 1. Probabilistic graphical model of the ADGM for semi-supervised learning. The incoming joint connections to each vari-able are deep neural networks with parameters ✓ and �.



log p

✓

(x) = log

Z

a

Z

z

p(x, a, z)dadz (2)

� Eq�(a,z|x)

log

p

✓

(a|z, x)p✓

(x|z)p(z)q

�

(a|x)q�

(z|a, x)

�(3)

⌘ �UAVAE(x) . (4)

with p

✓

(a|z, x) and q

�




ªñ¤ VAE8âƔ

¤ VAE8�ĵņ¤ òõś�F´ť¤ ƎƖ6ōĕK«Æ&G�¦


¤ «Ķ

¤ >4A

¤ �>��żň�

Importance Weighted AE¤ Importance Weighted Autoencoder [Burda+ 15]

¤ Ē�£ÓDFB�ÊƢ§7ì�£ÓKðã

¤ Z|nw|W�ÊkKÝC&!43�DF�ÊƢ§7ì�£ÓKåEHG

¤ DF�Ů¬$,£Ó�Ē�Rényi£Ó [Li+ 16]

¤ YK®#�&G!43�£Ó�~��6G�Ə$�9Ë�8[vOeKĖƆ

log' 0 ≥ :;< %(C) # ��;< %(Z) #

logF', 0, =([)

!. =([) 0

Z

\IC

≥ ℒ(), (; 0)

where p(D) =Rp0(✓)

QN

n=1 p(xn

|✓)d✓ is often called marginal likelihood or model evidence. For manypowerful models, including Bayesian neural networks, the true posterior is typically intractable. Varia-tional inference introduces an approximation q(✓) to the true posterior, which is obtained by minimisingthe KL divergence in some tractable distribution family Q:

q(✓) = argminq2Q

KL[q(✓)||p(✓|D)]. (8)

However the KL divergence in (8) is also intractable, mainly because of the di�cult term p(D). Variationalinference sidesteps this di�culty by considering an equivalent optimisation problem:

q(✓) = argmaxq2Q

LV I

(q;D), (9)

where the variational lower-bound or evidence lower-bound (ELBO) LV I

(q;D) is defined by

LV I

(q;D) = log p(D)�KL[q(✓)||p(✓|D)]

= Eq

log

p(✓,D)

q(✓)

�.

(10)

3 Variational Renyi Bound

Recall from Section 2.1 that the family of Renyi divergences includes the KL divergence. Perhaps canvariational free-energy approaches be generalised to the Renyi case? Consider approximating the trueposterior p(✓|D) by minimizing Renyi’s ↵-divergence for some selected ↵ � 0):

q(✓) = argminq2Q

D↵

[q(✓)||p(✓|D)]. (11)

Now we verify the alternative optimization problem

q(✓) = argmaxq2Q

log p(D)�D↵

[q(✓)||p(✓|D)]. (12)

When ↵ 6= 1, the objective can be rewritten as

log p(D)� 1

↵� 1log

Zq(✓)↵p(✓|D)1�↵

d✓

= log p(D)� 1

↵� 1logE

q

"✓p(✓,D)

q(✓)p(D)

◆1�↵

#

=1

1� ↵

logEq

"✓p(✓,D)

q(✓)

◆1�↵

#:= L

↵

(q;D).

(13)

We name this new objective the variational Renyi bound (VR). Importantly the following theorem is adirect result of Proposition 1.

Theorem 1. The objective L↵

(q;D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵

| < +1}.Especially for all 0 < ↵ < 1,

LV I

(q;D) = lim↵!1

L↵

(q;D) L↵

(q;D) L0(q;D).

The last bound L0(q;D) = log p(D) if and only if the support supp(p(✓|D)) ✓ supp(q(✓)).

The above theorem indicates a smooth interpolation between traditional variational lower-bound LV I

and the exact log marginal log p(D) under the mild assumption that q is supported where the trueposterior is supported. This assumption holds for a lot of commonly used distributions, e.g. Gaussian,and in the following we assume this condition is satisfied.

Similar to traditional VI, the VR bound can also be applied as a surrogate objective to the MLEproblem. In the following we detail the derivations of log-likelihood approximation using the variationalauto-encoder [Kingma and Welling, 2014, Rezende et al., 2014] as an example.

4

*8Š8ę?¤ Normalizing flows [Rezende+ 15]

¤ ìƓ�ŵKŹƇ7&G,A7�ƗĐĒÊK�Ã�Ê4&G

¤ Variational Gaussian Process [Tran+ 15]¤ Íū�ìƓKUP[ċń7$2�i|kvrdwaVoO\

Variational Inference with Normalizing Flows

and involve matrix inverses that can be numerically unsta-ble. We therefore require normalizing flows that allow forlow-cost computation of the determinant, or where the Ja-cobian is not needed at all.

4.1. Invertible Linear-time Transformations

We consider a family of transformations of the form:

f(z) = z+ uh(w>z+ b), (10)

where � = {w 2 IR

D,u 2 IR

D, b 2 IR} are free pa-rameters and h(·) is a smooth element-wise non-linearity,with derivative h0

(·). For this mapping we can computethe logdet-Jacobian term in O(D) time (using the matrixdeterminant lemma):

(z) = h0(w

>z+ b)w (11)

det

�� @f@z

�� = | det(I+ u (z)>)| = |1 + u

> (z)|. (12)

From (7) we conclude that the density qK

(z) obtained bytransforming an arbitrary initial density q0(z) through thesequence of maps f

k

of the form (10) is implicitly givenby:

z

K

= fK

� fK�1 � . . . � f1(z)

ln qK

(z

K

) = ln q0(z)�KX

k=1

ln |1 + u

>k

k

(z

k

)|. (13)

The flow defined by the transformation (13) modifies theinitial density q0 by applying a series of contractions andexpansions in the direction perpendicular to the hyperplanew

>z+b = 0, hence we refer to these maps as planar flows.

As an alternative, we can consider a family of transforma-tions that modify an initial density q0 around a referencepoint z0. The transformation family is:

f(z) = z+ �h(↵, r)(z� z0), (14)

det

��@f

@z

�� = [1 + �h(↵, r)]d�1

[1 + �h(↵, r) + h0(↵, r)r)] ,

where r = |z � z0|, h(↵, r) = 1/(↵ + r), and the param-eters of the map are � = {z0 2 IR

D,↵ 2 IR,� 2 IR}.This family also allows for linear-time computation of thedeterminant. It applies radial contractions and expansionsaround the reference point and are thus referred to as radialflows. We show the effect of expansions and contractionson a uniform and Gaussian initial density using the flows(10) and (14) in figure 1. This visualization shows that wecan transform a spherical Gaussian distribution into a bi-modal distribution by applying two successive transforma-tions.

Not all functions of the form (10) or (14) will be invert-ible. We discuss the conditions for invertibility and how tosatisfy them in a numerically stable way in the appendix.

K=1 K=2Planar Radial

q0 K=1 K=2K=10 K=10

Uni

t Gau

ssia

nU

nifo

rm

Figure 1. Effect of normalizing flow on two distributions.

Inference network Generative model

Figure 2. Inference and generative models. Left: Inference net-work maps the observations to the parameters of the flow; Right:generative model which receives the posterior samples from theinference network during training time. Round containers repre-sent layers of stochastic variables whereas square containers rep-resent deterministic layers.

4.2. Flow-Based Free Energy Bound

If we parameterize the approximate posterior distributionwith a flow of length K, q

�

(z|x) := qK

(z

K

), the free en-ergy (3) can be written as an expectation over the initialdistribution q0(z):

F(x) = Eq�(z|x)[log q

�

(z|x)� log p(x, z)]

= Eq0(z0) [ln q

K

(z

K

)� log p(x, zK

)]

= Eq0(z0) [ln q0(z0)]� E

q0(z0) [log p(x, zK

)]

� Eq0(z0)

"KX

k=1

ln |1 + u

>k

k

(z

k

)|#

. (15)

Normalizing flows and this free energy bound can be usedwith any variational optimization scheme, including gener-alized variational EM. For amortized variational inference,we construct an inference model using a deep neural net-work to build a mapping from the observations x to theparameters of the initial density q0 = N (µ,�) (µ 2 IR

D

and � 2 IR

D) as well as the parameters of the flow �.

4.3. Algorithm Summary and Complexity

The resulting algorithm is a simple modification of theamortized inference algorithm for DLGMs described by(Kingma & Welling, 2014; Rezende et al., 2014), whichwe summarize in algorithm 1. By using an inference net-

Variational Inference with Normalizing Flows

distribution :

q(z0) = q(z)

��det@f�1

@z0

�� = q(z)

��det@f

@z

��1

, (5)

where the last equality can be seen by applying the chainrule (inverse function theorem) and is a property of Jaco-bians of invertible functions. We can construct arbitrarilycomplex densities by composing several simple maps andsuccessively applying (5). The density q

K

(z) obtained bysuccessively transforming a random variable z0 with distri-bution q0 through a chain of K transformations f

k

is:

z

K

= fK

� . . . � f2 � f1(z0) (6)

ln qK

(z

K

) = ln q0(z0)�KX

k=1

ln det

��@f

k

@zk

�� , (7)

where equation (6) will be used throughout the paper as ashorthand for the composition f

K

(fK�1(. . . f1(x))). The

path traversed by the random variables zk

= fk

(z

k�1) withinitial distribution q0(z0) is called the flow and the pathformed by the successive distributions q

k

is a normalizingflow. A property of such transformations, often referredto as the law of the unconscious statistician (LOTUS), isthat expectations w.r.t. the transformed density q

K

can becomputed without explicitly knowing q

K

. Any expectationEqK [h(z)] can be written as an expectation under q0 as:

EqK [h(z)] = E

q0 [h(fK � fK�1 � . . . � f1(z0))], (8)

which does not require computation of the the logdet-Jacobian terms when h(z) does not depend on q

K

.

We can understand the effect of invertible flows as a se-quence of expansions or contractions on the initial density.For an expansion, the map z

0= f(z) pulls the points z

away from a region in IR

d, reducing the density in that re-gion while increasing the density outside the region. Con-versely, for a contraction, the map pushes points towardsthe interior of a region, increasing the density in its interiorwhile reducing the density outside.

The formalism of normalizing flows now gives us a sys-tematic way of specifying the approximate posterior distri-butions q(z|x) required for variational inference. With anappropriate choice of transformations f

K

, we can initiallyuse simple factorized distributions such as an independentGaussian, and apply normalizing flows of different lengthsto obtain increasingly complex and multi-modal distribu-tions.

3.2. Infinitesimal Flows

It is natural to consider the case in which the length of thenormalizing flow tends to infinity. In this case, we obtainan infinitesimal flow, that is described not in terms of a fi-nite sequence of transformations — a finite flow, but as a

partial differential equation describing how the initial den-sity q0(z) evolves over ‘time’: @

@t

qt

(z) = Tt

[qt

(z)], whereT describes the continuous-time dynamics.

Langevin Flow. One important family of flows is given bythe Langevin stochastic differential equation (SDE):

dz(t) = F(z(t), t)dt +G(z(t), t)d⇠(t), (9)

where d⇠(t) is a Wiener process with E[⇠i

(t)] = 0 andE[⇠

i

(t)⇠j

(t0)] = �i,j

�(t � t0), F is the drift vector andD = GG

> is the diffusion matrix. If we transform arandom variable z with initial density q0(z) through theLangevin flow (9), then the rules for the transformationof densities is given by the Fokker-Planck equation (orKolmogorov equations in probability theory). The densityqt

(z) of the transformed samples at time t will evolve as:

@

@tqt

(z)=�X

i

@

@zi

[Fi

(z, t)qt

]+

1

2

X

i,j

@2

@zi

@zj

[Dij

(z, t)qt

] .

In machine learning, we most often use the Langevin flowwith F (z, t) = �r

z

L(z) and G(z, t) =

p2�

ij

, whereL(z) is an unnormalised log-density of our model.

Importantly, in this case the stationary solution for qt

(z)

is given by the Boltzmann distribution: q1(z) / e�L(z).That is, if we start from an initial density q0(z) andevolve its samples z0 through the Langevin SDE, the re-sulting points z1 will be distributed according to q1(z) /e�L(z), i.e. the true posterior. This approach has been ex-plored for sampling from complex densities by Welling &Teh (2011); Ahn et al. (2012); Suykens et al. (1998).

Hamiltonian Flow. Hamiltonian Monte Carlo can also bedescribed in terms of a normalizing flow on an augmentedspace ˜

z = (z,!) with dynamics resulting from the Hamil-tonian H(z,!) = �L(z)� 1

2!>M!; HMC is also widely

used in machine learning, e.g., Neal (2011). We will use theHamiltonian flow to make a connection to the recently in-troduced Hamiltonian variational approach from Salimanset al. (2015) in section 5.

4. Inference with Normalizing FlowsTo allow for scalable inference using finite normalizingflows, we must specify a class of invertible transformationsthat can be used and an efficient mechanism for computingthe determinant of the Jacobian. While it is straightforwardto build invertible parametric functions for use in equa-tion (5), e.g., invertible neural networks (Baird et al., 2005;Rippel & Adams, 2013), such approaches typically havea complexity for computing the Jacobian determinant thatscales as O(LD3

), where D is the dimension of the hiddenlayers and L is the number of hidden layers used. Further-more, computing the gradients of the Jacobian determinantinvolves several additional operations that are also O(LD3

)

Under review as a conference paper at ICLR 2016

zifi⇠

✓ D = {(s, t)}

d

(a) VARIATIONAL MODEL

zi x

d

(b) GENERATIVE MODEL

Figure 1: (a) Graphical model of the variational Gaussian process. The VGP generates samples oflatent variables z by evaluating random non-linear mappings of latent inputs ⇠, and then drawingmean-field samples parameterized by the mapping. These latent variables aim to follow the posteriordistribution for a generative model (b), conditioned on data x.

“likelihood”Q

i q(zi | �i). This specifies the variational model,

q(z;✓) =

Z hY

i

q(zi | �i)

iq(�;✓) d�,

which is governed by prior hyperparameters ✓. Hierarchical variational models are richer thanclassical variational families—their expressiveness is determined by the complexity of the priorq(�). Many expressive variational approximations can be viewed under this construct (Jaakkola &Jordan, 1998; Lawrence, 2000; Salimans et al., 2015; Tran et al., 2015).

2.2 GAUSSIAN PROCESSES

We now review the Gaussian process (GP) (Rasmussen & Williams, 2006). Consider a data set of msource-target pairs D = {(sn, tn)}mn=1, where each source sn has c covariates paired with a multi-dimensional target tn 2 Rd. We aim to learn a function over all source-target pairs, tn = f(sn),where f : Rc ! Rd is unknown. Let the function f decouple as f = (f1, . . . , fd), where eachfi : Rc ! R. GP regression estimates the functional form of f by placing a prior,

p(f) =dY

i=1

GP(fi;0,Kss),

where Kss denotes a covariance function k(s, s0) over pairs of inputs s, s0 2 Rc. In this paper, weconsider automatic relevance determination (ARD) kernels

k(s, s0) = �2ARD exp

⇣� 1

2

cX

j=1

!j(sj � s

0j)

2⌘, (2)

with parameters ✓ = (�2ARD,!1, . . . ,!c). The individual weights !j tune the importance of each

dimension. They can be driven to zero, leading to automatic dimensionality reduction.

Given data D, the conditional distribution of the GP forms a distribution over mappings which inter-polate between input-output pairs,

p(f | D) =

dY

i=1

GP(fi;K⇠sK�1ss ti,K⇠⇠ �K⇠sK

�1ss K

>⇠s). (3)

Here, K⇠s denotes the covariance function k(⇠, s) for an input ⇠ and over all data inputs sn, and ti

represents the ith output dimension.

2.3 VARIATIONAL GAUSSIAN PROCESSES

We describe the variational Gaussian process (VGP), a Bayesian nonparametric variational modelthat adapts its structure to match complex posterior distributions. The VGP generates z by gener-ating latent inputs, warping them with random non-linear mappings, and using the warped inputs

3

ªñ¤ VAE8âƔ



¤ «Ķ

¤ >4A

¤ �>��żň�

ªñ¤ VAE8âƔ



¤ «Ķ

¤ >4A

¤ �>��żň�

ðã�¦¤ !H>38ĵņ39�ƗĐĒÊ�10�B$�920-/,

¤ !HË�Ûů7$2B�¢ħ8ƣĩŖ£MxYw\q39�>��6�

¤ *!3�ĵņ39�ñ8308�¦Kðã¤ ĂĚ�ơ�had{�V¤ PR�qMan¤ ja`Éė¬

ĂĚ�ơ�had{�V¤ Ladder network [Valpola, 14][Rasmus+ 15] Kâ7$,

probabilistic ladder networkKðã¤ ōĕscx�ladder network76/2�G¤ �Ãscx9�%








✓

(x, z) = p

✓

(x|z)p✓


�

(z|x)1


✓

(x) =

Rp

✓

(x, z)dz.


zd

z

z

a) b)

n

n

nn"Likelihood"



"Posterior"

"Prior"

+ "Copy"


in generative model


Indirect top down



zn

Generative model

"Copy"




p

✓

(x|z1) = N�x|µ

✓

(z1),�2✓

(z1)�

or (1)P

✓

(x|z1) = B (x|µ✓

(z1)) (2)


i

, i = 1 . . . L:

p

✓

(z

i

|zi+1) = N

�z

i

|µ✓,i

(z

i+1),�2✓,i

(z

i+1)�

(3)

p

✓

(z

L

) = N (z

L

|0, I) . (4)




q

�

(z1|x) = N�z1|µ�,1(x),�

2�,1(x)

�(5)

q

�

(z

i

|zi�1) = N

�z

i

|µ�,i

(z

i�1),�2�,i

(z

i�1)�

(6)

for i = 2 . . . L.





�




Probabilistic ladder network VAE

bottom-up4top-down8ĝHKöŉóſ3�G

top-down8ĝH�Ơ�/2�6�

Top-down8ĝH�Ơ�/2�6�¯ď

Ù<58Ŏ�ŸƑ7J�F1E�83�ĉKþ/2Į¤

¤ ōĕscx9�Z|nw|W&G,A� bottom-up8ĝH��G

¤ �Ãscx9Ĕů3Ƣ§K×ă&G8?3�top-down8ĝH�ŋĐ$6�

:=~;< = 0 log R =T R =TUV|=T R x|=V

; =V|0 ; =W|=V ;(=T|=TUV) =C~! =C|0

=4~! =4|=C

:=~;< = 0 log R =T R =TUV|=T R x|=V

; =V|0 ; =W|=V ;(=T|=TUV) ' =C|=4

' =4

' 0|=C=C~! =C|0

=4~! =4|=C

ft�vxhad{�V38¶Æ¤ ō�scx9��Ì8¡�ĕ�6ĝH

¤ �¡�ĕ��4��89��ŶèÒKþ/2�6�,A

¤ �Ãscx9�£�Ì8ĂĚ�6ĝH


Figure 3. MNIST test-set log-likelihood values for VAEs and theprobabilistic ladder networks with different number of latent lay-ers, Batch normalizationBN and Warm-up WU

The variational principle provides a tractable lower boundon the log likelihood which can be used as a training crite-rion L.

log p(x) � E

q�(z|x)

log

p

✓

(x, z)

q

�

(z|x)

�= �L(✓,�;x) (10)

= �KL(q

�

(z|x)||p✓

(z)) + E

q�(z|x) [p✓(x|z)] ,(11)

where KL is the Kullback-Leibler divergence.

A strictly tighter bound on the likelihood may be obtainedat the expense of a K-fold increase of samples by using theimportance weighted bound (Burda et al., 2015):

log p(x) � E

q�(z(1)|x) . . . Eq�(z(K)|x)

"log

KX

k=1

p

✓

(x, z

(k))

q

�

(z

(k)|x)

#

� �L(✓,�;x) . (12)

The inference and generative parameters, ✓ and �, arejointly trained by optimizing Eq. (11) using stochasticgradient descent where we use the reparametrization trickfor stochastic backpropagation through the Gaussian latentvariables (Kingma & Welling, 2013; Rezende et al., 2014).All expectations are approximated using Monte Carlo sam-pling by drawing from the corresponding q distribution.

Previous work has been restricted to shallow models (L 2). We propose two improvements to VAE training and anew variational model structure allowing us to train deepmodels with up to at least L = 5 stochastic layers.

2.1. Probabilistic Ladder Network

We propose a new inference model where we use the struc-ture of the Ladder network (Valpola, 2014; Rasmus et al.,

200 400 600 800 1000 1200 1400 1600 1800 2000

Epoch

-90

-88

-86

-84

-82

L(x)

VAE

VAE+BN

VAE+BN+WU

Prob. Ladder+BN+WU

Figure 4. MNIST train (full lines) and test (dashed lines) perfor-mance during training. The test set performance was estimatedusing 5000 importance weighted samples providing a tighterbound than the training bound explaining the better performancehere.

2015) as the inference model of a VAE, as shown in Figure1. The generative model is the same as before.

The inference is constructed to first make a deterministicupward pass:

d1 =MLP(x) (13)µ

d,i

=Linear(d

i

), i = 1 . . . L (14)

�

2d,i

=Softplus(Linear(d

i

)), i = 1 . . . L (15)

d

i

=MLP(µ

d,i�1), i = 2 . . . L (16)

followed by a stochastic downward pass:

q

�

(z

L

|x) =N�µ

d,L

,�

2d,L

�(17)

t

i

=MLP(z

i+1), i = 1 . . . L� 1 (18)µ

t,i

=Linear(t

i

) (19)

�

2t,i

=Softplus(Linear(t

i

)) (20)

q

✓

(z

i

|zi+1,x) =N

µ

t,i

�

�2t,i

+ µ

d,i

�

�2d,i

�

�2t,i

+ �

�2d,i

,

1

�

�2t,i

+ �

�2d,i

!.

(21)

Here µ

d

and �

2d

carry the bottom-up information and �

2t

and µ

t

carry top-down information. This parametrizationhas a probabilistic motivation by viewing µ

d

and �

2d

as theGaussian likelihood that is combined with a Gaussian priorµ

t

and �

2t

from the top-down connections, together formingthe approximate posterior distribution q

✓

(z

i

|zi+1,x), see

Figure 2 a).

A line of motivation, already noted by Dayan et al. (1995),is that a purely bottom-up inference process as in i.e. VAEs

�Ŷ9þJ6�




log p(x) � E

q�(z|x)

log

p

✓

(x, z)

q

�

(z|x)

�= �L(✓,�;x) (10)

= �KL(q

�

(z|x)||p✓

(z)) + E

q�(z|x) [p✓(x|z)] ,(11)



log p(x) � E

q�(z(1)|x) . . . Eq�(z(K)|x)

"log

KX

k=1

p

✓

(x, z

(k))

q

�

(z

(k)|x)

#

� �L(✓,�;x) . (12)





200 400 600 800 1000 1200 1400 1600 1800 2000

Epoch

-90

-88

-86

-84

-82

L(x)

VAE

VAE+BN

VAE+BN+WU

Prob. Ladder+BN+WU




d1 =MLP(x) (13)µ

d,i

=Linear(d

i

), i = 1 . . . L (14)

�

2d,i

=Softplus(Linear(d

i

)), i = 1 . . . L (15)

d

i

=MLP(µ

d,i�1), i = 2 . . . L (16)


q

�

(z

L

|x) =N�µ

d,L

,�

2d,L

�(17)

t

i

=MLP(z

i+1), i = 1 . . . L� 1 (18)µ

t,i

=Linear(t

i

) (19)

�

2t,i

=Softplus(Linear(t

i

)) (20)

q

✓

(z

i

|zi+1,x) =N

µ

t,i

�

�2t,i

+ µ

d,i

�

�2d,i

�

�2t,i

+ �

�2d,i

,

1

�

�2t,i

+ �

�2d,i

!.

(21)

Here µ

d

and �

2d


2t

and µ

t


d

and �

2d


t

and �

2t


✓

(z

i

|zi+1,x), see

Figure 2 a).





log p(x) � E

q�(z|x)

log

p

✓

(x, z)

q

�

(z|x)

�= �L(✓,�;x) (10)

= �KL(q

�

(z|x)||p✓

(z)) + E

q�(z|x) [p✓(x|z)] ,(11)



log p(x) � E

q�(z(1)|x) . . . Eq�(z(K)|x)

"log

KX

k=1

p

✓

(x, z

(k))

q

�

(z

(k)|x)

#

� �L(✓,�;x) . (12)





200 400 600 800 1000 1200 1400 1600 1800 2000

Epoch

-90

-88

-86

-84

-82

L(x)

VAE

VAE+BN

VAE+BN+WU

Prob. Ladder+BN+WU




d1 =MLP(x) (13)µ

d,i

=Linear(d

i

), i = 1 . . . L (14)

�

2d,i

=Softplus(Linear(d

i

)), i = 1 . . . L (15)

d

i

=MLP(µ

d,i�1), i = 2 . . . L (16)


q

�

(z

L

|x) =N�µ

d,L

,�

2d,L

�(17)

t

i

=MLP(z

i+1), i = 1 . . . L� 1 (18)µ

t,i

=Linear(t

i

) (19)

�

2t,i

=Softplus(Linear(t

i

)) (20)

q

✓

(z

i

|zi+1,x) =N

µ

t,i

�

�2t,i

+ µ

d,i

�

�2d,i

�

�2t,i

+ �

�2d,i

,

1

�

�2t,i

+ �

�2d,i

!.

(21)

Here µ

d

and �

2d


2t

and µ

t


d

and �

2d


t

and �

2t


✓

(z

i

|zi+1,x), see

Figure 2 a).


ìƓ��ŵ

UP[�ŵ8��ŵ�PRML65ĖƆ�

ªñ¤ VAE8âƔ



¤ «Ķ

¤ >4A

¤ �>��żň�

�ƙF¹F�7DGċ´ťÀÐ¤ B8£Ó79�úĨÃƉĻ8Š7Éŧ¬Ƅ�ľ>H2�G

¤ !8Éŧ¬7D/2��0�8ƗĐufad�MVbNm36�6G!4��G [MacKey 01]¤ ufad^8ìƓ��ŵ ��*H�Ĭ8��ŵKÉŧ¬$2$>��§�µ¡��

¤ ´ť8Ô»7!H��JHG4�Ö½6¶Æ>3ƙF¹/2$>�¤ «ß�¥�8«Ķ3B*H�Ăé#H2�G

¤ áÿ�7ġÂç7Ɛ/2$>�

ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=


does not correspond well with real perception, where iter-ative interaction between bottom-up and top-down signalsproduces the final activity of a unit2, see Figure 2. Notablyit is difficult for the purely bottom-up inference networksto model the explaining away phenomenon, see van denBroeke (2016, Chapter 5) for a recent discussion on thisphenomenon. The probabilistic ladder network providesa framework with the wanted interaction, while keepingcomplications manageable. A further extension could beto make the inference in k steps over an iterative inferenceprocedure (Raiko et al., 2014).

2.2. Warm-up from deterministic to variationalautoencoder

The variational training criterion in Eq. (11) contains thereconstruction term p

✓

(x|z) and the variational regular-ization term. The variational regularization term causessome of the latent units to become inactive during train-ing (MacKay, 2001) because the approximate posterior forunit k, q(z

i,k

| . . . ) is regularized towards its own priorp(z

i,k

| . . . ), a phenomenon also recognized in the VAE set-ting (Burda et al., 2015). This can be seen as a virtueof automatic relevance determination, but also as a prob-lem when many units are pruned away early in trainingbefore they learned a useful representation. We observedthat such units remain inactive for the rest of the training,presumably trapped in a local minima or saddle point atKL(q

i,k

|pi,k

) ⇡ 0, with the optimization algorithm unableto re-activate them.

We propose to alleviate the problem by initializing train-ing using the reconstruction error only (corresponding totraining a standard deterministic auto-encoder), and thengradually introducing the variational regularization term:

�L(✓,�;x)T

= (22)��KL(q

�

(z|x)||p✓

(z)) + E

q�(z|x) [p✓(x|z)] ,

where � is increased linearly from 0 to 1 during the first Nt

epochs of training. We denote this scheme warm-up (ab-breviated WU in tables and graphs) because the objectivegoes from having a delta-function solution (correspond-ing to zero temperature) and then move towards the fullystochastic variational objective. A similar idea has previ-ously been considered in Raiko et al. (2007, Section 6.2),however here used for Bayesian models trained with a co-ordinate descent algorithm.

2.3. Batch Normalization

Batch normalization (Ioffe & Szegedy, 2015) is a recentinnovation that improves convergence speed and stabilizes

2The idea was dismissed at the time, since it could introducesubstantial theoretical complications.

Table 1. Fine-tuned test log-likelihood values for 5 layered VAEand probabilistic ladder networks trained on MNIST. ANN. LR:

Annealed Learning rate, MC: Monte Carlo samples to approxi-mate Eq(·)[·], IW: Importance weighted samples

FINETUNING NONEANN. LR.

MC=1IW=1

ANN. LR.MC=10IW=1

ANN. LR.MC=1IW=10

ANN. LR.MC=10IW=10

VAE �82.14 �81.97 �81.84 �81.41 �81.30PROB. LADDER �81.87 �81.54 �81.46 �81.35 -81.20

training in deep neural networks by normalizing the outputsfrom each layer. We show that batch normalization (abbre-viated BN in tables and graphs), applied to all layers exceptthe output layers, is essential for learning deep hierarchiesof latent variables for L > 2.

3. ExperimentsTo test our models we use the standard benchmark datasetsMNIST, OMNIGLOT (Lake et al., 2013) and NORB (Le-Cun et al., 2004). The largest models trained used a hierar-chy of five layers of stochastic latent variables of sizes 64,32, 16, 8 and 4, going from bottom to top. We implementedall mappings using two-layered MLP’s. In all models theMLP’s between x and z1 or d1 were of size 512. Subse-quent layers were connected by MLP’s of sizes 256, 128,64 and 32 for all connections in both the VAE and proba-bilistic ladder network. Shallower models were created byremoving latent variables from the top of the hierarchy. Wesometimes refer to the five layer models as 64-32-16-8-4,the four layer models as 64-32-16-8 and so fourth. Themodels were trained end-to-end using the Adam (Kingma& Ba, 2014) optimizer with a mini-batch size of 256. Thereported test log-likelihoods were approximated using Eq.(12) with 5000 importance weighted samples as in Burdaet al. (2015). The models were implemented in the Theano(Bastien et al., 2012), Lasagne (Dieleman et al., 2015) andParmesan3 frameworks.

For MNIST we used a sigmoid output layer to predict themean of a Bernoulli observation model and leaky rectifiers(max(x, 0.1x)) as nonlinearities in the MLP’s. The modelswere trained for 2000 epochs with a learning rate of 0.001on the complete training set. Models using warm-up usedN

t

= 200. Similarly to Burda et al. (2015) we resam-ple the binarized training values from the real-valued im-ages using a Bernoulli distribution after each epoch whichprevents the models from over-fitting. Some of the mod-els were fine-tuned by continuing training for 2000 epochswhile multiplying the learning rate with 0.75 after every200 epochs and increase the number of Monte Carlo and

3github.com/casperkaae/parmesan





✓


i,k


i,k


i,k

|pi,k



�L(✓,�;x)T

= (22)��KL(q

�

(z|x)||p✓

(z)) + E

q�(z|x) [p✓(x|z)] ,









MC=1IW=1

ANN. LR.MC=10IW=1

ANN. LR.MC=1IW=10

ANN. LR.MC=10IW=10

VAE �82.14 �81.97 �81.84 �81.41 �81.30PROB. LADDER �81.87 �81.54 �81.46 �81.35 -81.20




t



PR�qMan4ja`Éŧ¬¤ D/2�ºÔ9úĨÃƉĻ8?3´ť&G

¤ !H9¡�ĕ�6S�dQ|X�_4�ł

¤ *8��ƚ�7Éŧ¬Ƅ8łK~��$2��

→Warm-up

¤ *8Š7Batch Normalization[Ioffe+ 15]KŞ°¤ Ĉƀ8�Ņ¬4¼�Á¤ ƗĐĒÊ�2ůË�39Ö½





✓


i,k


i,k


i,k

|pi,k



�L(✓,�;x)T

= (22)��KL(q

�

(z|x)||p✓

(z)) + E

q�(z|x) [p✓(x|z)] ,









MC=1IW=1

ANN. LR.MC=10IW=1

ANN. LR.MC=1IW=10

ANN. LR.MC=10IW=10

VAE �82.14 �81.97 �81.84 �81.41 �81.30PROB. LADDER �81.87 �81.54 �81.46 �81.35 -81.20




t



«ĶÄ�¤ 308c�^]adKÏ°

¤ MNIST�oxg�O�ŵ�¤ OMNIGLOT [Lake+ 13]�oxg�O�ŵ�¤ NORB [LeCun+ 04]�UP[�ŵ�

¤ ƗĐĒÊ964-32-16-8-485ů

¤ Ĕ�ŵ8NN92ů¤ ƃHů9ƗĐĒÊ1ůª�512�2ůªËŖ9*H+H�256,128,64,3246G

¤ UP[�ŵ84�9tanh�Š9leaky rectifiers

¤ b[dƢ§�k=50008importance weightedZ|nw|W

¤ *8Š8Ä�9ĕāĖƆ

áÿ�MNIST8b[dƢ§

¤ ä8VAE92ůË�9Ƣ§��E6�¤ Batch normalization & warm-up7D/2ŭ§9Ì�¤ #E7probabilistic ladder network7D/2J'�7Ì�




log p(x) � E

q�(z|x)

log

p

✓

(x, z)

q

�

(z|x)

�= �L(✓,�;x) (10)

= �KL(q

�

(z|x)||p✓

(z)) + E

q�(z|x) [p✓(x|z)] ,(11)



log p(x) � E

q�(z(1)|x) . . . Eq�(z(K)|x)

"log

KX

k=1

p

✓

(x, z

(k))

q

�

(z

(k)|x)

#

� �L(✓,�;x) . (12)





200 400 600 800 1000 1200 1400 1600 1800 2000

Epoch

-90

-88

-86

-84

-82

L(x)

VAE

VAE+BN

VAE+BN+WU

Prob. Ladder+BN+WU




d1 =MLP(x) (13)µ

d,i

=Linear(d

i

), i = 1 . . . L (14)

�

2d,i

=Softplus(Linear(d

i

)), i = 1 . . . L (15)

d

i

=MLP(µ

d,i�1), i = 2 . . . L (16)


q

�

(z

L

|x) =N�µ

d,L

,�

2d,L

�(17)

t

i

=MLP(z

i+1), i = 1 . . . L� 1 (18)µ

t,i

=Linear(t

i

) (19)

�

2t,i

=Softplus(Linear(t

i

)) (20)

q

✓

(z

i

|zi+1,x) =N

µ

t,i

�

�2t,i

+ µ

d,i

�

�2d,i

�

�2t,i

+ �

�2d,i

,

1

�

�2t,i

+ �

�2d,i

!.

(21)

Here µ

d

and �

2d


2t

and µ

t


d

and �

2d


t

and �

2t


✓

(z

i

|zi+1,x), see

Figure 2 a).


áÿ�b[dƢ§8őí7DGį�¤ s|bTxzZ|nw|W�MC�Cimportance weightedZ|nw|W�IW�8Ê7DGb[dƢ§8ğƕ

¤ MCCIW7DGZ|nw|W3b[dƢ§�Ì�&G!4�Ăé�ÅG¤ permutation invariant MNIST8Š8b[dƢ§9

¤ -82.90 [Burda+ 15]¤ -81.90 [Tran+ 15]¤ ,-$�ƊƝ79ğƕ3�6�





✓


i,k


i,k


i,k

|pi,k



�L(✓,�;x)T

= (22)��KL(q

�

(z|x)||p✓

(z)) + E

q�(z|x) [p✓(x|z)] ,









MC=1IW=1

ANN. LR.MC=10IW=1

ANN. LR.MC=1IW=10

ANN. LR.MC=10IW=10

VAE �82.14 �81.97 �81.84 �81.41 �81.30PROB. LADDER �81.87 �81.54 �81.46 �81.35 -81.20




t



áÿ�MVbNm6ƗĐufad8Ê

¤ KLƌŐ�0.01DF~��H:MVbNm4ÜĞ

¤ ¢ħ8VAE9ƗĐĒÊů�Ň�6G4�MVbNmufad�Úē¤ Batch normalization (BN)7D/2ÝÎ&G¤ Warm-up (WU)39�>FŁÿ�6�¤ Probabilistic ladder network7D/2J'�7Ģ�6G


Table 2. Number of active latent units in five layer VAE and prob-abilistic ladder networks trained on MNIST. A unit was definedas active if KL(qi,k||pi,k) > 0.01

VAE VAE+BN

VAE+BN+WU

PROB. LADDER+BN+WU

LAYER 1 20 20 34 46

LAYER 2 1 9 18 22

LAYER 3 0 3 6 8

LAYER 4 0 3 2 3

LAYER 5 0 2 1 2

TOTAL 21 37 61 81

importance weighted samples to 10 to reduce the variancein the approximation of the expectations in Eq. (10) andimprove the inference model, respectively.

Models trained on the OMNIGLOT dataset4, consisting of28x28 binary images images were trained similar to aboveexcept that the number of training epochs was 1500.

Models trained on the NORB dataset5, consisting of 32x32grays-scale images with color-coding rescaled to [0, 1],used a Gaussian observation model with mean and vari-ance predicted using a linear and a softplus output layerrespectively. The settings were similar to the models aboveexcept that: hyperbolic tangent was used as nonlinearitiesin the MLP’s, the learning rate was 0.002, N

t

= 1000 andthe number of training epochs were 4000.

4. Results & DiscussionTo asses the performance of the VAE and the probabilis-tic ladder networks with different numbers of stochastic la-tent variable layers and with or without batch normaliza-tion and warm-up we use the MNIST, OMNIGLOT andNORB datasets. Firstly we study the generative model log-likelihood for the different datasets and show that modelsusing our contributions perform better than the vanilla VAEmodels6. Secondly we study the latent space representa-tion and how batch normalization, warm-up and the proba-bilistic ladder network affect these compared to the vanillaVAE.

Generative log-likelihood performance

In Figure 3 we show the test set log-likelihood for a se-ries of different models with varying number of stochas-

4The OMNIGLOT data was partitionedand preprocessed as in Burda et al. (2015),https://github.com/yburda/iwae/tree/master/datasets/OMNIGLOT

5The NORB dataset was downloaded in resized format fromgithub.com/gwtaylor/convnet matlab

6we use vanilla to refer to VAE models with out batch normal-ization and warm-up

Table 3. Test set Log-likelihood values for models trained on theOMNIGLOT and NORB datasets. The left most column showdataset and the number of latent variables i each model.

VAE VAE+BN

VAE+BN+WU

PROB. LADDER+BN+WU

OMNIGLOT64 �114.45 �108.79 �104.63 �64-32 �112.60 �106.86 �102.03 �102.1264-32-16 �112.13 �107.09 �101.60 -101.2664-32-16-8 �112.49 �107.66 �101.68 �101.2764-32-16-8-4 �112.10 �107.94 �101.86 �101.59

NORB64 2630.8 3263.7 3481.5 �64-32 2830.8 3140.1 3532.9 3522.764-32-16 2757.5 3247.3 3346.7 3458.764-32-16-8 2832.0 3302.3 3393.6 3499.464-32-16-8-4 3064.1 3258.7 3393.6 3430.3

tic layers. The performance of the vanilla VAE model didnot improve with more than two layers of stochastic latentvariables. Contrary to this, models trained with batch nor-malization and warm-up consistently increase the modelperformance for additional layers of stochastic latent vari-ables. As expected the improvement in performance is de-creasing for each additional layer, but we emphasize thatthe improvements are consistent even for the addition ofthe top-most layers. The best performance is achieved us-ing the probabilistic ladder network slightly outperformingthe best VAE models trained with batch normalization andtemperature. We emphasize that these VAE models alreadyhave a very strong performance showing that the proba-bilistic ladder network have competitive generative perfor-mance.

The models in Figure 3 were trained using fixed learn-ing rate and one Monte Carlo (MC) and one importanceweighted (IW) sample. To improve performance we fine-tuned the best performing five layer models by trainingthese for a further 2000 epochs with annealed learning rateand increasing the number of MC and IW samples. InTable 1 we see that that by annealing the learning rate,drawing 10 MC and IW samples we can improve the log-likelihood of the VAE and probabilistic ladder network to�81.30 and �81.20 respectively.

Comparing the results obtained here with current state-of-the art results on permutation invariant MNIST, Burda et al.(2015) report a test-set performance of �82.90 using 50 IWsamples and a two layer model with 100-50 latent units.Using a similar model Tran et al. (2015) achieves �81.90

by adding a variational Gaussian Process prior for eachlayer of latent variables, however here we note that our re-sults are not directly comparable to these due to differences

áÿ�OMNIGLOT4NORB

¤ BN4WU3Ƣ§�Ì�&G¤ NORB39ladder�¼�$2�6�

¤ >,�ůKÝC$2BƢ§�Ì�$6�

¤ UP[�ŵ8��9�´ťĚKĪ�$�tanh7&G4¼�&G


Table 2. Number of active latent units in five layer VAE and prob-abilistic ladder networks trained on MNIST. A unit was definedas active if KL(qi,k||pi,k) > 0.01

VAE VAE+BN

VAE+BN+WU

PROB. LADDER+BN+WU

LAYER 1 20 20 34 46

LAYER 2 1 9 18 22

LAYER 3 0 3 6 8

LAYER 4 0 3 2 3

LAYER 5 0 2 1 2

TOTAL 21 37 61 81

importance weighted samples to 10 to reduce the variancein the approximation of the expectations in Eq. (10) andimprove the inference model, respectively.

Models trained on the OMNIGLOT dataset4, consisting of28x28 binary images images were trained similar to aboveexcept that the number of training epochs was 1500.

Models trained on the NORB dataset5, consisting of 32x32grays-scale images with color-coding rescaled to [0, 1],used a Gaussian observation model with mean and vari-ance predicted using a linear and a softplus output layerrespectively. The settings were similar to the models aboveexcept that: hyperbolic tangent was used as nonlinearitiesin the MLP’s, the learning rate was 0.002, N

t

= 1000 andthe number of training epochs were 4000.

4. Results & DiscussionTo asses the performance of the VAE and the probabilis-tic ladder networks with different numbers of stochastic la-tent variable layers and with or without batch normaliza-tion and warm-up we use the MNIST, OMNIGLOT andNORB datasets. Firstly we study the generative model log-likelihood for the different datasets and show that modelsusing our contributions perform better than the vanilla VAEmodels6. Secondly we study the latent space representa-tion and how batch normalization, warm-up and the proba-bilistic ladder network affect these compared to the vanillaVAE.

Generative log-likelihood performance

In Figure 3 we show the test set log-likelihood for a se-ries of different models with varying number of stochas-

4The OMNIGLOT data was partitionedand preprocessed as in Burda et al. (2015),https://github.com/yburda/iwae/tree/master/datasets/OMNIGLOT

5The NORB dataset was downloaded in resized format fromgithub.com/gwtaylor/convnet matlab

6we use vanilla to refer to VAE models with out batch normal-ization and warm-up

Table 3. Test set Log-likelihood values for models trained on theOMNIGLOT and NORB datasets. The left most column showdataset and the number of latent variables i each model.

VAE VAE+BN

VAE+BN+WU

PROB. LADDER+BN+WU

OMNIGLOT64 �114.45 �108.79 �104.63 �64-32 �112.60 �106.86 �102.03 �102.1264-32-16 �112.13 �107.09 �101.60 -101.2664-32-16-8 �112.49 �107.66 �101.68 �101.2764-32-16-8-4 �112.10 �107.94 �101.86 �101.59

NORB64 2630.8 3263.7 3481.5 �64-32 2830.8 3140.1 3532.9 3522.764-32-16 2757.5 3247.3 3346.7 3458.764-32-16-8 2832.0 3302.3 3393.6 3499.464-32-16-8-4 3064.1 3258.7 3393.6 3430.3

tic layers. The performance of the vanilla VAE model didnot improve with more than two layers of stochastic latentvariables. Contrary to this, models trained with batch nor-malization and warm-up consistently increase the modelperformance for additional layers of stochastic latent vari-ables. As expected the improvement in performance is de-creasing for each additional layer, but we emphasize thatthe improvements are consistent even for the addition ofthe top-most layers. The best performance is achieved us-ing the probabilistic ladder network slightly outperformingthe best VAE models trained with batch normalization andtemperature. We emphasize that these VAE models alreadyhave a very strong performance showing that the proba-bilistic ladder network have competitive generative perfor-mance.

The models in Figure 3 were trained using fixed learn-ing rate and one Monte Carlo (MC) and one importanceweighted (IW) sample. To improve performance we fine-tuned the best performing five layer models by trainingthese for a further 2000 epochs with annealed learning rateand increasing the number of MC and IW samples. InTable 1 we see that that by annealing the learning rate,drawing 10 MC and IW samples we can improve the log-likelihood of the VAE and probabilistic ladder network to�81.30 and �81.20 respectively.

Comparing the results obtained here with current state-of-the art results on permutation invariant MNIST, Burda et al.(2015) report a test-set performance of �82.90 using 50 IWsamples and a two layer model with 100-50 latent units.Using a similar model Tran et al. (2015) achieves �81.90

by adding a variational Gaussian Process prior for eachlayer of latent variables, however here we note that our re-sults are not directly comparable to these due to differences

áÿ�ƗĐĒÊ8¶Æ¤ VAE8Éŧ¬7D/2�Şý6ƗĐ¶ÆKƂå3�G

¤ $�$�[k�[76F&�G!4B�G

¤ ĔƗĐĒÊ 4 8KLƌŐKųG¤ KLƌŐ�076G4��!49� �&6J.��ŵ8èÒKĸBň�26�4��!4

¤ BNCWU�Ladder7D/2~Ō7ÞŘ#H2�G

To study this effect we calculated the KL-divergence be-tween q(z

i,k

|zi�1,k)

tent variable k during training as seen in Figure

To study this effect we calculated the KL-divergence be-and p(z

i

|zi+1) for each stochastic la-

during training as seen in Figureterm is zero if the inference model is independent of thedata, i.e. q(z

i,k

|zi�1,k) = q(z

i,k

), and hence collapsed

áÿ�ƗĐĒÊ8¶Æ8ěĭ¬¤ ƗĐĒÊ8¶ÆKPCAnzad

¤ BNK$6�4�ţƥ7ŷHG¤ Ladder65KŞ°&G4�Ʀƞ7��HGD�76F�òõś�F´ť7þ�GěûÁ��G

áÿ�ƎƖ6�ŵ8¶Æ¤ Ĕů8�8ů3đæ1�EH,�ŵ4�¾Ĥ�6Íū�ìƓ8į�ğ;G¤ 4ŗĊÉė�ŵ8KLƌŐK×ă

¤ ¢ħ8VAE9Ī�ů39Éė�ŵ396��ŵ76/2�G��2ůË�9�,-8Éė�ŵ76/2�G

¤ Ladder8��BNCWUDFBMVbNm76/2�G¤ ¡�ĕ�6�Ì�8ĝH7D/2�øŬ7èÒ�ňJG


from the input data to all latent distributions in the infer-ence model.

To qualitatively study the latent representations, PCA plotsof z

i

⇠ q(z

i

|zi�1) are seen in Figure 7. It is clear that the

VAE model without batch normalization is completely col-lapsed on a standard normal prior above the second layer.The latent representations shows progressively more clus-tering according to class which is best seen in the topmostlayer of the probabilistic ladder network. These findingsindicate that hierarchical latent variable models producestructured high level latent representations that are likelyuseful for semi-supervised learning.

The hierarchical latent variable models used here allowshighly flexible distributions of the lower layers conditionedon the layers above. We measure the divergence betweenthese conditional distributions and the restrictive mean fieldapproximation by calculating the KL-divergence betweenq(z

i

|zi�1) and a standard normal distribution for several

models trained on MNIST, see Figure 6 a). As expectedthe lower layers have highly non (standard) Gaussian dis-tributions when conditioned on the layers above. Interest-ingly the probabilistic ladder network seems to have moreactive intermediate layers than t,he VAE with batch nor-malization and warm-up. Again this might be explainedby the deterministic upward pass easing flow of informa-tion to the intermediate and upper layers. We further notethat the KL-divergence is approximately zero in the vanillaVAE model above the second layer confirming the inactiv-ity of these layers. Figure 6 b) shows generative samplesfrom the probabilistic ladder network created by injectingz

⇤ ⇠ N (0, I) at different layers in the model. Samplingfrom the top-most layer, having a standard normal prior,produces nice looking samples whereas the samples getprogressively worse when we inject z⇤ in the lower layerssupporting the finding above.

5. ConclusionWe presented three methods for optimization of VAEs us-ing deep hierarchies of latent stochastic variables. Usingthese models we demonstrated state-of-the-art generativeperformance for the MNIST and OMNIGLOT datasets andshowed that a new parametrization, the probabilistic laddernetwork, performed as good or better than current mod-els. Further we demonstrated that VAEs with up to fivelayer hierarchies of active stochastic units can be trainedand that especially the new probabilistic ladder networkwere able to utilize many stochastic units. Future work in-cludes extending these models to semi-supervised learningand studying more elaborate iterative inference schemes inthe probabilistic ladder network.

Acknowledgments

This research was supported by the Novo Nordisk Founda-tion, Danish Innovation Foundation and the NVIDIA Cor-poration with the donation of TITAN X and Tesla K40GPUs.

ReferencesBastien, Frederic, Lamblin, Pascal, Pascanu, Razvan,

Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud,Bouchard, Nicolas, Warde-Farley, David, and Bengio,Yoshua. Theano: new features and speed improvements.arXiv preprint arXiv:1211.5590, 2012.

Burda, Yuri, Grosse, Roger, and Salakhutdinov, Rus-lan. Importance weighted autoencoders. arXiv preprint

arXiv:1509.00519, 2015.

Dayan, Peter, Hinton, Geoffrey E, Neal, Radford M, andZemel, Richard S. The Helmholtz machine. Neural com-

putation, 7(5):889–904, 1995.

Dieleman, Sander, Schlter, Jan, Raffel, Colin, Olson, Eben,ren Kaae Sø nderby, Sø, Nouri, Daniel, van den Oord,Aaron, and and, Eric Battenberg. Lasagne: First re-lease., August 2015. URL http://dx.doi.org/

10.5281/zenodo.27878.

Ioffe, Sergey and Szegedy, Christian. Batch normalization:Accelerating deep network training by reducing internalcovariate shift. arXiv preprint arXiv:1502.03167, 2015.

Kingma, Diederik and Ba, Jimmy. Adam: Amethod for stochastic optimization. arXiv preprint

arXiv:1412.6980, 2014.

Kingma, Diederik P and Welling, Max. Auto-encodingvariational bayes. arXiv preprint arXiv:1312.6114,2013.

Kingma, Diederik P, Mohamed, Shakir, Rezende,Danilo Jimenez, and Welling, Max. Semi-supervisedlearning with deep generative models. In Advances in

Neural Information Processing Systems, pp. 3581–3589,2014.

Lake, Brenden M, Salakhutdinov, Ruslan R, and Tenen-baum, Josh. One-shot learning by inverting a composi-tional causal process. In Advances in neural information

processing systems, pp. 2526–2534, 2013.

LeCun, Yann, Huang, Fu Jie, and Bottou, Leon. Learningmethods for generic object recognition with invarianceto pose and lighting. In Computer Vision and Pattern

Recognition, 2004. CVPR 2004. Proceedings of the 2004

IEEE Computer Society Conference on, volume 2, pp.II–97. IEEE, 2004.

áÿ�ƎƖ6�ŵ8¶Æ¤ Ĕů8ƗĐĒÊ7Éė�ŵiO\K�H,4��Ã#HGĆƁ8ğƕ

¤ Probabilistic ladder network3«Ķ

¤ ů�Ň�6G<5�Şý6Z|nxK�Ã3�G

ªñ¤ VAE8âƔ



¤ «Ķ

¤ >4A

¤ �>�

>4A¤ ƗĐĒÊ�Ŝů�6VAE8ºŞ¬�¦K30ðã

¤ ĂĚ�ơ�had{�V¤ PR�qMan¤ ja`Éė¬

¤ «Ķ7D/2ñ8!4KĂé$,¤ MNIST4OMNIGLOT8Ƣ§�state-of-the-art76/,¤ Probabilistic ladder network�¢ħ8scxDFB��ŭ§46/,¤ Probabilistic ladder network7D/2Û�8ƗĐufadKÏ°3�GD�76/,

ªñ¤ VAE8âƔ



¤ «Ķ

¤ >4A

¤ �>�

�>��VAEvOmvw¤ şī�

¤ !H>3�2�,D�7�Ń�6�¦8VAE�Ŕ�$2�G¤ $�$�!HEK«Ū$D�4ù/2B�ñ8D�6¯ď3Žĳ

¤ ĕā79ĂĚ�ŵ8ĉ$�ü�2�6��ft�vxhad9…��¤ Ûů¬�0ŹƇ¬$2�G

→NNK¿ŏ('�WvlNTxscxK�ď7ÕAGŲÕ?�č½

¤ $�$ÆĠ39!8đæ7�/,VAE8«Ū�6�¤ �ĵņ39Lasagneo�[8Ŀ�¨��Parmesan�Kþ/2�G

¤ $�$J�F1E��œĲ$1E�¤ 683��3vOmvw�6ĸ�Kï/2?,

VAEvOmvwþ°ļ¤ ļ��Ů�6VAE8«Ū

¤ ƒŋ8VAE�¦8<4L5KTj��$2G0BF�Ladder�RNNŒ9>-�

¤ ì}ĵņŊ�CDLWx�n�3Ĥ�·¨Ø��

#recognition model (encoder)q_h = [Dense(rng,28*28,200,activation=T.nnet.relu),

Dense(rng,200,200,activation=T.nnet.relu)]q_mean = [Dense(rng,200,50,activation=None)]q_sigma = [Dense(rng,200,50,activation=T.nnet.softplus)]

q = [Gaussian(q_h,q_mean,q_sigma)]

#generate model (decoder)p_h = [Dense(rng,50,200,activation=T.nnet.relu),

Dense(rng,200,200,activation=T.nnet.relu)]p_mean = [Dense(rng,200,28*28,activation=T.nnet.sigmoid)]

p = [Bernoulli(p_h,p_mean)]

#VAEscxK�Ħmodel = VAE_z_x(q,p,k=1,alpha=1,random=rseed)

#scx8´ťmodel.train(train_x)

#b[d�ÊƢ§KëAGlog_likelihood_test = model.log_likelihood_test(test_x,k=1000,mode='iw')

#ƗĐĒÊ�E8Z|nw|Wsample_x = model.p_sample_mean_x(sample_z)

đæą��ŵC�ŵ8Ûů¬7B�Ę

Renyi£Ó657B�Ę

CNN65B�ď7Ä×ěû�Lasagneo�[7ı�Ø��

Technology

(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks