14
1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim Tatarchenko, Thomas Brox Abstract—We train generative ’up-convolutional’ neural networks which are able to generate images of objects given object style, viewpoint, and color. We train the networks on rendered 3D models of chairs, tables, and cars. Our experiments show that the networks do not merely learn all images by heart, but rather find a meaningful representation of 3D models allowing them to assess the similarity of different models, interpolate between given views to generate the missing ones, extrapolate views, and invent new objects not present in the training set by recombining training instances, or even two different object classes. Moreover, we show that such generative networks can be used to find correspondences between different objects from the dataset, outperforming existing approaches on this task. Index Terms—Convolutional networks, generative models, image generation, up-convolutional networks 1 I NTRODUCTION Generative modeling of natural images is a long stand- ing and difficult task. The problem naturally falls into two components: learning the distribution from which the images are to be generated, and learning the generator which produces an image conditioned on a vector from this distribution. In this paper we approach the second sub- problem. We suppose we are given high-level descriptions of a set of images, and only train the generator. We propose to use an ’up-convolutional’ generative network for this task and show that it is capable of generating realistic images. In recent years, convolutional neural networks (CNNs, ConvNets) have become a method of choice in many areas of computer vision, especially on recognition [1]. Recogni- tion can be posed as a supervised learning problem and ConvNets are known to perform well given a large enough labeled dataset. In this work, we stick with supervised training, but we turn the standard classification CNN up- side down and use it to generate images given high-level information. This way, instead of learning a mapping from raw sensor inputs to a condensed, abstract representation, such as object identity or position, we generate images from their high-level descriptions. Given a set of 3D models (of chairs, tables, or cars), we train a neural network capable of generating 2D projec- tions of the models given the model number (defining the style), viewpoint, and, optionally, additional transformation parameters such as color, brightness, saturation, zoom, etc. Our generative networks accept as input these high-level values and produce an RGB images. We train them with standard backpropagation to minimize the Euclidean recon- struction error of the generated image. A large enough neural network can learn to perform perfectly on the training set. That is, a network potentially could just learn by heart all examples and generate these All authors are with the Computer Science Department at the University of Freiburg E-mail: {dosovits, springj, tatarchm, brox}@cs.uni-freiburg.de perfectly, but would fail to produce reasonable results when confronted with inputs it has not seen during training. We show that the networks we train do generalize to previously unseen data in various ways. Namely, we show that these networks are capable of: 1) knowledge transfer within an object class: given limited number of views of a chair, the network can use the knowledge learned from other chairs to infer the remaining viewpoints; 2) knowledge transfer between classes (chairs and tables): the network can transfer the knowledge about views from tables to chairs; 3) feature arithmetics: addition and subtraction of feature vectors leads to interpretable results in the image space; 4) interpolation between different objects within a class and between classes; 5) randomly generating new object styles. After a review of related work in Section 2, we describe the network architecture and training process in Section 3. In Section 4 we compare different network architectures and dataset sizes, then in Section 5 we test the generalization abilities of the networks and apply them to a practical task of finding correspondences between different objects. Finally, in Section 6 we analyze the internal representation of the networks. 2 RELATED WORK Work on generative models of images typically addresses the problem of unsupervised learning of a data model which can generate samples from a latent representation. Promi- nent examples from this line of work are restricted Boltz- mann machines (RBMs) [2] and Deep Boltzmann Machines (DBMs) [3], as well as the plethora of models derived from them [4, 5, 6, 7, 8]. RBMs and DBMs are undirected graphical models which aim to build a probabilistic model of the data and treat encoding and generation as an (intractable) joint inference problem. A different approach is to train directed graphical mod- els of the data distribution. This includes a wide variety of methods ranging from Gaussian mixture models [9, 10] to autoregressive models [11] and stochastic variations of arXiv:1411.5928v3 [cs.CV] 3 Dec 2015

1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

1

Learning to Generate Chairs, Tables and Carswith Convolutional Networks

Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim Tatarchenko, Thomas Brox

Abstract—We train generative ’up-convolutional’ neural networks which are able to generate images of objects given object style,viewpoint, and color. We train the networks on rendered 3D models of chairs, tables, and cars. Our experiments show that thenetworks do not merely learn all images by heart, but rather find a meaningful representation of 3D models allowing them to assess thesimilarity of different models, interpolate between given views to generate the missing ones, extrapolate views, and invent new objectsnot present in the training set by recombining training instances, or even two different object classes. Moreover, we show that suchgenerative networks can be used to find correspondences between different objects from the dataset, outperforming existingapproaches on this task.

Index Terms—Convolutional networks, generative models, image generation, up-convolutional networks

F

1 INTRODUCTION

Generative modeling of natural images is a long stand-ing and difficult task. The problem naturally falls intotwo components: learning the distribution from which theimages are to be generated, and learning the generatorwhich produces an image conditioned on a vector fromthis distribution. In this paper we approach the second sub-problem. We suppose we are given high-level descriptionsof a set of images, and only train the generator. We proposeto use an ’up-convolutional’ generative network for this taskand show that it is capable of generating realistic images.

In recent years, convolutional neural networks (CNNs,ConvNets) have become a method of choice in many areasof computer vision, especially on recognition [1]. Recogni-tion can be posed as a supervised learning problem andConvNets are known to perform well given a large enoughlabeled dataset. In this work, we stick with supervisedtraining, but we turn the standard classification CNN up-side down and use it to generate images given high-levelinformation. This way, instead of learning a mapping fromraw sensor inputs to a condensed, abstract representation,such as object identity or position, we generate images fromtheir high-level descriptions.

Given a set of 3D models (of chairs, tables, or cars),we train a neural network capable of generating 2D projec-tions of the models given the model number (defining thestyle), viewpoint, and, optionally, additional transformationparameters such as color, brightness, saturation, zoom, etc.Our generative networks accept as input these high-levelvalues and produce an RGB images. We train them withstandard backpropagation to minimize the Euclidean recon-struction error of the generated image.

A large enough neural network can learn to performperfectly on the training set. That is, a network potentiallycould just learn by heart all examples and generate these

• All authors are with the Computer Science Departmentat the University of FreiburgE-mail: {dosovits, springj, tatarchm, brox}@cs.uni-freiburg.de

perfectly, but would fail to produce reasonable results whenconfronted with inputs it has not seen during training. Weshow that the networks we train do generalize to previouslyunseen data in various ways. Namely, we show that thesenetworks are capable of: 1) knowledge transfer within anobject class: given limited number of views of a chair,the network can use the knowledge learned from otherchairs to infer the remaining viewpoints; 2) knowledgetransfer between classes (chairs and tables): the network cantransfer the knowledge about views from tables to chairs;3) feature arithmetics: addition and subtraction of featurevectors leads to interpretable results in the image space; 4)interpolation between different objects within a class andbetween classes; 5) randomly generating new object styles.

After a review of related work in Section 2, we describethe network architecture and training process in Section 3.In Section 4 we compare different network architectures anddataset sizes, then in Section 5 we test the generalizationabilities of the networks and apply them to a practical task offinding correspondences between different objects. Finally,in Section 6 we analyze the internal representation of thenetworks.

2 RELATED WORK

Work on generative models of images typically addressesthe problem of unsupervised learning of a data model whichcan generate samples from a latent representation. Promi-nent examples from this line of work are restricted Boltz-mann machines (RBMs) [2] and Deep Boltzmann Machines(DBMs) [3], as well as the plethora of models derived fromthem [4, 5, 6, 7, 8]. RBMs and DBMs are undirected graphicalmodels which aim to build a probabilistic model of the dataand treat encoding and generation as an (intractable) jointinference problem.

A different approach is to train directed graphical mod-els of the data distribution. This includes a wide varietyof methods ranging from Gaussian mixture models [9, 10]to autoregressive models [11] and stochastic variations of

arX

iv:1

411.

5928

v3 [

cs.C

V]

3 D

ec 2

015

Page 2: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

2

neural networks [12, 13, 14, 15, 16]. Among them Rezendeet al. [14] developed an approach for training a generativemodel with variational inference by performing (stochastic)backpropagation through a latent Gaussian representation.Goodfellow et al. [13] model natural images using a ”de-convolutional” generative network that is similar to ourarchitecture.

Most unsupervised generative models can be extendedto incorporate label information, forming semi-supervisedand conditional generative models which lie between fullyunsupervised approaches and our work. Examples include:gated conditional RBMs [5] for modeling image transfor-mations, training RBMs to disentangle face identity andpose information using conditional RBMs [17], and learninga generative model of digits conditioned on digit classusing variational autoencoders [18]. In contrast to our work,these approaches are typically restricted to small modelsand images, and they often require an expensive inferenceprocedure – both during training and for generating images.

The general difference of our approach to prior workon learning generative models is that we assume a high-level latent representation of the images is given and usesupervised training. This allows us 1) to generate relativelylarge high-quality images of 128× 128 pixels (as comparedto maximum of 48×48 pixels in the aforementioned works)and 2) to completely control which images to generaterather than relying on random sampling. The downsideis, of course, the need for a label that fully describes theappearance of each image.

Modeling of viewpoint variation is often considered inthe context of pose-invariant face recognition [19, 20]. In arecent work Zhu et al. [21] approached this task with aneural network: their network takes a face image as inputand generates a random view of this face together with thecorresponding viewpoint. The network is fully connectedand hence restricted to small images and, similarly to gen-erative models, requires random sampling to generate adesired view. This makes it inapplicable to modeling largeand diverse images, such as the chair images we model.

Our work is also loosely related to applications of CNNsto non-discriminative tasks, such as super-resolution [22] orinferring depth from a single image [23].

3 MODEL DESCRIPTION

Our goal is to train a neural network to generate accurateimages of objects from a high-level description: style, orien-tation with respect to the camera, and additional parameterssuch as color, brightness, etc. The task at hand hence is theinverse of a typical recognition task: rather than convertingan image to a compressed high-level representation, weneed to generate an image given the high-level parameters.

Formally, we assume that we are given a dataset ofexamples D = {(c1,v1, θ1), . . . , (cN ,vN , θN )} with targetsO = {(x1, s1), . . . , (xN , sN )}. The input tuples consist ofthree vectors: c is the one-hot encoding of the model identity(defining the style), v – azimuth and elevation of the cameraposition (represented by their sine and cosine 1) and θ – the

1. We do this to deal with periodicity of the angle. If we simply usedthe number of degrees, the network would have no way to understandthat 0 and 359 degrees are in fact very close.

parameters of additional artificial transformations appliedto the images. The targets are the RGB output image x andthe segmentation mask s.

We include artificial transformations Tθ described bythe randomly generated parameter vector θ to increase theamount of variation in the training data and reduce over-fitting, analogous to data augmentation in discriminativeCNN training [1]. Each Tθ is a combination of the follow-ing transformations: in-plane rotation, translation, zoom,stretching horizontally or vertically, changing hue, changingsaturation, changing brightness.

3.1 Network architecturesConceptually the generative network, which we formallyrefer to as g(c,v, θ), looks like a usual CNN turned upsidedown. It can be thought of as the composition of twoprocessing steps g = u ◦ h. We experimented with severalarchitectures, one of them is shown in Figure 1.

Layers FC-1 to FC-4 first build a shared, high dimen-sional hidden representation h(c,v, θ) from the input pa-rameters. Within these layers the three input vectors arefirst independently fed through two fully connected layerswith 512 neurons each, and then the outputs of these threestreams are concatenated. This independent processing isfollowed by two fully connected layers with 1024 neuronseach, yielding the response of the fourth fully connectedlayer (FC-4).

The layers FC-5 and uconv-1 to uconv-4 then generatean image and object segmentation mask from this sharedhidden representation. We denote this expanding part ofthe network by u. We experimented with two architecturesfor u. In one of them, as depicted in Figure 1, the imageand the segmentation mask are generated by two separatestreams uRGB(·) and usegm(·). Each of these consists ofa fully connected layer, the output of which is reshapedto a 8 × 8 multichannel image and fed through 4 ’un-pooling+convolution’ layers with 5 × 5 filters and 2 × 2unpooling. Alternatively, there is just one stream, which isused to predict both the image and the segmentation mask.

In order to map the dense 8 × 8 representation to ahigh dimensional image, we need to unpool the featuremaps (i.e. increase their spatial span) as opposed to thepooling (shrinking the feature maps) implemented by usualCNNs. This is similar to the “deconvolutional” layers usedin previous work [13, 24, 25]. As illustrated in Figure 2(left), we perform unpooling by simply replacing each entryof a feature map by an s × s block with the entry valuein the top left corner and zeros elsewhere. This increasesthe width and the height of the feature map s times. Weused s = 2 in our networks. When a convolutional layeris preceded by such an unpooling operation we can thinkof unpooling+convolution (“up-convolution”) as the inverseoperation of the convolution+pooling steps performed in astandard CNN, see Figure 2 right. This figure also illustrateshow in practice the unpooling and convolution steps do nothave to be performed sequentially, but can be combined ina single operation. Implementation-wise this operation isequivalent to a backward pass through a usual convolu-tional layer with stride s.

In all networks each layer, except the output, is followedby a leaky rectified linear (LReLU) nonlinearity. In most

Page 3: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

3

55

9264

64

9232809

55

32class

view

transf.param.

4

8

512

512

512

1536

1024 1024

FC-1

FC-2

FC-3 FC-4

256256

16

16

8

8

uconv-1uconv-2

uconv-3input Target RGB

3232

55

32

128128

16

16

8

8

FC-5

RGB reconstruction loss

55

55

3

128

128

32

64

64

55

55

55

uconv-4

Segmentation reconstruction loss

128

128

1

(transformed)

c

v

θ

𝑇𝜽(𝒙 ∙ 𝒔)

𝑇𝜽𝒔

Target

(transformed)segmentation

Fig. 1. Architecture of a 2-stream network that generates 128 × 128 pixel images. Layer names are shown above: FC - fully connected, uconv -unpooling+convolution.

11

5

5

2

2x2 unpooling:2x2 unpooling +5x5 convolution:

Fig. 2. Illustration of unpooling (left) and unpooling+convolution (right)as used in the generative network.

experiments we generated 128 × 128 pixel images, but wealso experimented with 64× 64 and 256× 256 pixel images.The only difference in architecture in these cases is oneless or one more up-convolution, respectively. We foundthat adding a convolutional layer after each up-convolutionsignificantly improves the quality of the generated images.We compare different architectures in section 4.2.

3.2 Network training

The network parameters W, consisting of all layer weightsand biases, are trained by minimizing the error of recon-structing the segmented-out chair image and the segmenta-tion mask (the weights W are omitted from the argumentsof h and u for brevity of notation):

minW

N∑i=1

LRGB(Tθi(x

i · si), uRGB(h(ci,vi, θi)))

+λ·Lsegm(Tθis

i, usegm(h(ci,vi, θi))),

(1)

where LRGB and Lsegm are loss functions for the RGBimage and for the segmentation mask respectively, and λis a weighting term, trading off between these two. In ourexperiments LRGB was always squared Euclidean distance,while for Lsegm we tried two choices: squared Euclideandistance and negative log-likelihood loss preceded by a

Fig. 3. Representative images used for training the networks. Top:chairs, middle: tables, bottom: cars.

softmax layer. We set λ = 0.1 in the first case and λ = 100in the second case.

Note that although the segmentation mask could beinferred indirectly from the RGB image by exploitingmonotonous background, we do not rely on this but ratherrequire the network to explicitly output the mask. We neverdirectly show these generated masks in the following, butwe use them to add white background to the generatedexamples in almost all figures.

3.3 Probabilistic generative modelingAs we show below, the problem formulation from Section3 can be used to learn a generator network which caninterpolate between different objects. The learned network,however, implements a deterministic function from a high-level description (including the object identity as given bythe one-hot encoding c) to a generated image. The sharedstructure between multiple objects is only implicitly learned

Page 4: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

4

by the network and not explicitly accessible. It can be usedto generate new “random” objects by blending betweenobjects from the training set. However, this heuristics comeswith no guarantees regarding the quality of the generatedimages. To get a more principled way of generating newobjects we train a probabilistic generative model with anintermediate Gaussian representation. We replace the inde-pendent processing stream for the class identity c in layerFC-2 (the magenta part in Figure 1) with random samplesdrawn from an inference network q(z | ci) = N (µz,Σz)that learns to capture the underlying (latent) structure of dif-ferent chair images. This inference network can be learnedalongside the generator described above by maximizinga variational bound on the sample log-likelihood. A fulldescription on the probabilistic generative model trainingis given in Appendix A.

3.4 DatasetsAs training data for the generative networks we usedrenderings of 3D models of different objects: chairs, madepublic by Aubry et al. [26], as well as car and table modelsfrom the ShapeNet [27] dataset.

Aubry et al. provide renderings of 1393 chair models,each rendered from 62 viewpoints: 31 azimuth angles (witha step of 11◦) and 2 elevation angles (20◦ and 30◦), witha fixed distance to the chair. We found that the datasetincludes many near-duplicate models, models differing onlyby color, or low-quality models. After removing these weended up with a reduced dataset of 809 models, which weused in our experiments. We cropped the renders to have asmall border around the chair and resized to a common sizeof 128× 128 pixels, padding with white where necessary tokeep the aspect ratio. Example images are shown in Figure 3.For training the network we also used segmentation masksof all training examples, which we produced by subtractingthe monotonous white background.

We took models of cars and tables from ShapeNet, adataset containing tens of thousands of consistently aligned3D models of multiple classes. We rendered a turntable ofeach model in Blender using 36 azimuth angles (from 0◦

to 350◦ with step of 10◦) and 5 elevation angles (from 0◦

to 40◦ with step of 10◦), which resulted in 180 images permodel. Positions of the camera and the light source werefixed during rendering. For experiments in this paper weused renderings of 7124 car models and 1000 table models.All renderings are 256 × 256 pixels, and we additionallydirectly rendered the corresponding segmentation masks.Example renderings are shown in Figure 3.

4 TRAINING PARAMETERS

In this section we describe the details of training, comparedifferent network architectures and analyze the effect ofdataset size and data augmentation on training.

4.1 Training detailsFor training the networks we built on top of the caffe CNNimplementation [28]. We used Adam [29] with momentumparameters β1 = 0.9 and β2 = 0.999, and regularization pa-rameter ε = 10−6. Mini-batch size was 128. We started with

GT 2s-E 2s-S 1s-S 1s-S-wide 1s-S-deep

Fig. 4. Qualitative results with different networks trained on chairs. Seethe description of architectures in section 4.2.

Net 2s-E 2s-S 1s-S 1st-S-wide 1st-s-deepMSE (·10−3) 3.43 3.44 3.51 3.41 2.90#param 27M 27M 18M 23M 19M

TABLE 1Per-pixel mean squared error of the generated images with different

network architectures and the number of parameters in the expandingparts of these networks.

a learning rate of 0.0005 for 250, 000 mini-batch iterations,and then divided the learning rate by 2 after every 100, 000iterations, stopping after 500, 000 iterations. We initializedthe weights with Gaussian noise with variance computedbased on the input dimensionality, as suggested by Susillo[30] and He et al. [31] .

In most experiments we used networks generating128 × 128 pixels images. In the viewpoint interpolationexperiments in section 5.2 we generated 64×64 pixel imagesto reduce the computation time, since multiple networkshad to be trained for those experiments. When workingwith cars, we generated 256×256 images to check if deepernetworks capable of generating these larger images can besuccessfully trained. We did not observe any complicationsduring training of these deeper networks.

4.2 Comparing network architectures

As mentioned above, we experimented with different net-work architectures. These include:

“2s-E” – a network shown in Figure 1 with two streamsand squared Euclidean loss on the segmentation mask;

“2s-S” – same network with softmax and negative log-likelihood (NLL) loss on the segmentation mask;

“1s-S” – network with only one stream, same architectureas the RGB stream in Figure 1, which in the end predictsboth the RGB image and the segmentation mask, withNLL loss on the segmentation mask;

“1s-S-wide” – same as “1s-S”, but with roughly 1.3 timesmore channels in each up-convolutional layer;

“1s-S-deep” – same as “1s-S”, but with an extra convo-lutional layer after each up-convolutional one. Theseconvolutional layers have 3 × 3 kernels and the samenumber of channels as the preceding up-convolutionallayer.

Images generated with these architectures, as well as theground truth (GT), are shown in Figure 4. Reconstructionerrors are shown in Table 1. Clearly, the deeper “1s-S-deep”network is significantly better than others both qualitatively

Page 5: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

5

GT 500 1000 3000 7124 1000+aug

Fig. 5. Qualitative results for different numbers of car models in thetraining set.

Num models 500 1000 3000 7124 1000augMSE (·10−3) 0.48 0.66 0.84 0.97 1.18

TABLE 2Per-pixel mean squared error of image generation with varying number

of car models in the training set.

and quantitatively. For this reason we used this network inmost experiments.

4.3 Training set size and data augmentation

We experimented with the training set size and analyzedwhat is the effect of data augmentation. We used cars forthese experiments, since we have most car models available.While keeping the network architecture fixed, we varied thetraining set size. Example generated images are shown inFigure 5. Each column corresponds to a different number ofmodels in the training set, and all networks except the onein the rightmost column were trained without data augmen-tation. While for a standard car model (top row) there is notmuch difference, for difficult models (other rows) smallertraining set leads to better reconstruction of fine details. Theeffect of data augmentation is qualitatively very similar toincreasing the training set size. Reconstruction errors shownin Table 2 support these observations.

Data augmentation leads to worse reconstruction of finedetails, but it is expected to lead to better generalization.To check this, we tried to morph one model into anotherby linearly interpolating between their one-hot input stylevectors. The result is shown in Figure 6. Note how the net-work trained without augmentation (top row) better modelsthe images from the training set, but fails to interpolatesmoothly.

5 EXPERIMENTS

We show how the networks successfully model the com-plex data and demonstrate their generalization abilities bygenerating images unseen during training: new viewpointsand object styles. We also show an application of generativenetworks to finding correspondences between objects fromthe training set.

Fig. 6. Interpolation between two car models. Top: without data aug-mentation, bottom: with data augmentation.

5.1 Modeling transformationsFigure 7 shows how a network is able to generate chairsthat are significantly transformed relative to the originalimages. Each row shows a different type of transforma-tion. Images in the central column are non-transformed.Even in the presence of large transformations, the qualityof the generated images is basically as good as withouttransformation. The image quality typically degrades a littlein case of unusual chair shapes (such as rotating officechairs) and chairs including fine details such as armrests(see e.g. one of the armrests in the second to last row inFigure 7). Interestingly, the network successfully modelszoom-out (row 3 of Figure 7), even though it has never beenpresented any zoomed-out images during training.

The network easily deals with extreme color-relatedtransformations, but has some problems representing largespatial changes, especially translations. The generation qual-ity in such cases could likely be improved with a morecomplex architecture, which would allow transformationparameters to explicitly affect the feature maps of convolu-tional layers (by translating, rotating, zooming them), per-haps in the fashion similar to Karol et al. [32] or Jaderberget al. [33].

5.2 Interpolation between viewpointsIn this section we show that the generative network is ableto generate previously unseen views by interpolating be-tween views present in the training data. This demonstratesthat the network internally learns a representation of chairswhich enables it to judge about chair similarity and use theknown examples to generate previously unseen views.

In this experiment we used a 64× 64 network to reducecomputational costs. We randomly separated the chair stylesinto two subsets: the ’source set’ with 90 % styles and the’target set’ with the remaining 10 % chairs. We then variedthe number of viewpoints per style either in both thesedatasets together (’no transfer’) or just in the target set (’withtransfer’) and then trained a generative network as before.In the second setup the idea is that the network may use theknowledge about chairs learned from the source set (whichincludes all viewpoints) to generate the missing viewpointsof the chairs from the target set.

Figure 8 shows some representative examples of angleinterpolation. For 15 views in the target set (first pair ofrows) the effect of the knowledge transfer is already visible:interpolation is smoother and fine details are preservedbetter, for example a leg in the middle column. Startingfrom 8 views (second pair of rows and below) the networkwithout knowledge transfer fails to produce satisfactory

Page 6: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

6

Fig. 7. Generation of chair images while activating various transforma-tions. Each row shows one transformation: translation, rotation, zoom,stretch, saturation, brightness, color. The middle column shows thereconstruction without any transformation.

interpolation, while the one with knowledge transfer worksreasonably well even with just one view presented duringtraining (bottom pair of rows). However, in this case somefine details, such as the armrest shape, are lost.

In Figure 9 we plot the average Euclidean error of thegenerated missing viewpoints from the target set, bothwith and without transfer (blue and green curves). Clearly,presence of all viewpoints in the source dataset dramaticallyimproves the performance on the target set, especially forsmall numbers of available viewpoints.

One might suppose (for example looking at the bottompair of rows of Figure 8) that the network simply learnsall the views of the chairs from the source set and then,given a limited number of views of a new chair, finds themost similar one, in some sense, among the known modelsand simply returns the images of that chair. To check if thisis the case, we evaluated the performance of such a naivenearest neighbor approach. For each image in the target setwe found the closest match in the source set for each of thegiven views and interpolated the missing views by linearcombinations of the corresponding views of the nearestneighbors. For finding nearest neighbors we tried two sim-ilarity measures: Euclidean distance between RGB imagesand between HOG descriptors. The results are shown inFigure 9. Interestingly, although HOG yields semanticallymuch more meaningful nearest neighbors (not shown infigures), RGB similarity performs much better numerically.The performance of this nearest neighbor method is alwaysworse than that of the network, suggesting that the networklearns more than just linearly combining the known chairs,especially when many viewpoints are available in the targetset.

increasing

difficulty

Fig. 8. Examples of view interpolation (azimuth angle). In each pairof rows the top row is with knowledge transfer and the second row iswithout it. In each row the leftmost and the rightmost images are theviews presented to the network during training while all intermediateones are new to the network and, hence, are the result of interpolation.The number of different views per chair available during training is 15, 8,4, 2, 1 (top-down). Image quality is worse than in other figures becausewe used the 64× 64 network.

1 2 4 8 150

0.05

0.1

0.15

0.2

0.25

0.3

Number of viewpoints in the target set

Ave

rage

squ

ared

err

or p

er p

ixel

No knowledge transferWith knowledge transferNearest neighbor HOGNearest neighbor RGB

Fig. 9. Reconstruction error for unseen views of chairs from the targetset depending on the number of viewpoints present during training.Blue: all viewpoints available in the source dataset (knowledge transfer),green: the same number of viewpoints are available in the source andtarget datasets (no knowledge transfer).

Page 7: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

7

0◦ 10◦ 20◦ 30◦ 40◦ 50◦ 70◦

Fig. 10. Elevation angle knowledge transfer. In each pair of rows toprow: trained only on chairs (no knowledge transfer), bottom row:trained both on chairs and tables (with knowledge transfer). Greenbackground denotes elevations not presented during training.

5.3 Elevation transfer and extrapolationThe chairs dataset only contains renderings with elevationangles 20◦ and 30◦, while for tables elevations between0◦ and 40◦ are available. We show that we can transferinformation about elevations from one class to another. Tothis end we trained a network on both chairs and tables, andthen generated images of chairs with elevations not presentduring training. As a baseline we use a network trainedsolely on chairs. The results are shown in Figure 10. Whilethe network trained only on chairs does not generalize tounseen elevation angles almost at all, the one trained withtables is able to generate unseen views of chairs very well.The only drawback is that the generated images do notalways precisely correspond to the desired elevation, forexample 0◦ and 10◦ for the second model in Figure 10. Still,this result suggests that the network is able to transfer theunderstanding of 3D object structure from one object classto another.

The network trained both on chairs and tables can verywell predict views of tables from previously unseen eleva-tion angles. Figure 11 shows how the network can generateimages with previously unseen elevations from 50◦ to 90◦.Interestingly, the presence of chairs in the training set helpsbetter extrapolate views of tables. Our hypothesis is thatthe network trained on both object classes is forced to notonly model one kind of objects, but also the general 3Dgeometry. This helps generating reasonable views from newelevation angles. We hypothesize that modeling even moreobject classes with a single network would allow to learn auniversal class-independent representation of 3D shapes.

5.4 Interpolation between stylesRemarkably, the generative network can not only imaginepreviously unseen views of a given object, but also inventnew objects by interpolating between given ones. To obtainsuch interpolations, we simply linearly change the inputlabel vector from one class to another. Some representa-tive examples of such morphings for chairs and cars areshown in Figures 12 and 13 respectively. The morphings ofeach object class are sorted by subjective morphing quality

0◦ 10◦ 20◦ 30◦ 40◦ 50◦ 70◦ 90◦

Fig. 11. Elevation extrapolation. In each pair of rows top row: trainedonly on tables, bottom row: trained both on chairs and tables. Greenbackground denotes elevations not presented during training.

(decreasing from top to bottom). The networks producevery naturally looking morphings even in challenging cases.Even when the two morphed objects are very dissimilar, forexample the last two rows in Figure 12, the intermediatechairs look very realistic.

It is also possible to interpolate between more than twoobjects. Figure 14 shows morphing between three chairs: onetriple in the upper triangle and one in the lower triangle ofthe table. The network successfully combines the features ofthree chairs.

The networks can interpolate between objects of thesame class, but can they morph objects of different classesinto each other? The inter-class difference is larger thanthe intra-class variance, hence to successfully interpolatebetween classes the network has to close this large gapbetween different classes. We check if this large gap can beclosed by a network trained on chairs and tables. Resultsare shown in Figure 15. The quality of intermediate imagesis somewhat worse than for intra-class morphings shownabove, but overall very good, especially considering thatthe network has never seen anything intermediate betweena chair and a table.

5.5 Feature space arithmetics

In the previous section we have seen that the featurerepresentation learned by the network allows for smoothtransitions between two or even three different objects.Can this property be used to transfer properties of oneobject onto another by performing simple arithmetics in thefeature space? Figure 16 shows that this is indeed possible.By simple subtraction and addition in the feature space (FC-2 features in this case) we can change an armchair into achair with similar style, or a chair with a stick back into anidentical chair with a solid back. We found that the exactlayer where the arithmetic is performed does not matter:the results are basically identical when we manipulate theinput style vectors, or the outputs of layers FC-1 or FC-2(not shown).

5.6 Random chair generation

In this section we show results on generating random chairimages using the ideas briefly outlined in Section 3.3. In

Page 8: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

8

Fig. 12. Examples of morphing different chairs, one morphing per row.Leftmost and rightmost chairs in each row are present in the training set,all intermediate ones are generated by the network. Rows are orderedby decreasing subjective quality of the morphing, from top to bottom.

Fig. 13. Examples of morphing different cars, one morphing per row.Leftmost and rightmost chairs in each row are present in the training set,all intermediate ones are generated by the network. Rows are orderedby decreasing subjective quality of the morphing, from top to bottom.

Fig. 14. Interpolation between three chairs. On the main diagonal theinterpolation between the top left and bottom right chairs are shown.In the upper and lower triangle interpolations between three chairs areshown: top left, bottom right, and respectively bottom left and top right.

Fig. 15. Interpolation between chairs and tables. The chairs on the leftand the tables on the right are present in the training set.

particular, we use both networks trained in a fully super-vised manner using the training objective from 3, as wellas networks trained with the variational bound objectivedescribed in Appendix A.

As mentioned before there is no principled way toperform sampling for networks trained in a supervisedmanner. Nonetheless there are some natural heuristics thatcan be used to obtain “quasi random” generations. Wecan first observe that the style input of the network isa probability distribution over styles, which at trainingtime is concentrated on a single style (i.e. c is a one-hotencoding of the chair style). However, in the interpolation

Page 9: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

9

Fig. 16. Feature arithmetics: simple operations in the feature space leadto interpretable changes in the image space.

experiments we have seen that the network also generatesplausible images given inputs with several non-zero entries.This suggests generating random images by using randomdistributions as input for the network. We tried two familiesof distributions: (1) we computed the softmax of a Gaussiannoise vector with the same size as c and with standard de-viation σ, and (2) we first randomly selected M styles, thensampled coefficient for each of them from uniform([0, 1]),then normalized to unit sum.

Exemplary results of these two experiments are shown inFigure 17 (a)-(d). For each generated image the closest chairfrom the dataset, according to Euclidean distance, is shown.(a) and (b) are generated with method (1) with σ = 2 andσ = 4 respectively. The results are not good: when σ is verylow the generated chairs are all similar, while with higherσ they just copy chairs from the training set. (c) and (d)are produced with method (2) with M = 3 and M = 5respectively. Here the generated chairs are quite diverse andnot too similar to the chairs from the training set.

The model which was trained with a variational boundobjective (as described in Appendix A) directly allows usto sample from the assumed prior (a multivariate Gaussiandistribution with unit covariance) and generate images fromthese draws. To that end we thus simply input randomGaussian noise into the network instead of using the FC-2 activations of the style stream. The results are shownin Figure 17 (e)-(f). The difference between (e) and (f) isthat in (e) the KL-divergence term in the loss function wasweighted 10 times higher than in (f). This leads to muchmore diverse chairs being generated.

As a control in Figure 17 (g)-(h) we also show chairsgenerated in the same way (but with adjusted standarddeviations) from a network trained without the variationalbound objective. While such a procedure is not guaranteedto result in any visually appealing images, since the hiddenlayer activations are not restricted to a particular regimeduring training, we found that it does result in sharp chairimages. However, both for low (g) and high (h) standarddeviations the generated chairs are not very diverse.

Overall, we see that the the heuristics with combiningseveral chairs and the variational-bound-based training lead

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 17. Random chairs, generated with different methods. In eachpair of rows top: generated chairs, bottom: nearest neighbors from thetraining set. (a),(b): from softmax of a Gaussian in the input layer, (c),(d):interpolations between several chairs from the training set, (e),(f): fromGaussian noise in FC-2 of stochastic networks trained with the vari-ational bound loss, (g),(h): from Gaussian noise in FC-2 of networkstrained with usual loss. See the text for more details.

Page 10: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

10

to generating images of roughly similar quality and di-versity. However, the second approach is advantageous inthat it allows generating images simply from a Gaussiandistribution and it is more principled, potentially promisingfurther improvement when better optimized or combinedwith other kinds of stochastic networks.

5.7 Correspondences

The ability of the generative CNN to interpolate betweendifferent chairs allows us to find dense correspondencesbetween different object instances, even if their appearanceis very dissimilar.

Given two chairs from the training dataset, we used the“1s-S-deep” network to generate a morphing consisting of64 images (with fixed view). We then computed the opticalflow in the resulting image sequence using the code ofBrox et al. [34]. To compensate for the drift, we refinedthe computed optical flow by recomputing it with a stepof 9 frames, initialized by concatenated per-frame flows.Concatenation of these refined optical flows gives the globalvector field that connects corresponding points in the twochair images.

In order to numerically evaluate the quality of the cor-respondences, we created a small test set of 30 image pairs.To analyze the performance in more detail, we separatedthese into 10 ’simple’ pairs (two chairs are quite similar inappearance) and 20 ’difficult’ pairs (two chairs differ a lotin appearance). Exemplar pairs are shown in Figure 18. Wemanually annotated several keypoints in the first image ofeach pair (in total 295 keypoints in all images) and asked9 people to manually mark corresponding points in thesecond image of each pair. We then used mean keypointpositions in the second images as ground truth. At testtime we measured the performance of different methods bycomputing average displacement of predicted keypoints inthe second images given keypoints in the first images. Wealso manually annotated an additional validation set of 20image pairs to tune the parameters of all methods (however,we were not able to search the parameters exhaustivelybecause some methods have many).

In Table 3 we show the performance of our algorithmcompared to human performance and two baselines: SIFTflow [35] and Deformable Spatial Pyramid [36] (DSP). Onaverage the very basic approach we used outperforms bothbaselines thanks to the intermediate samples produced bythe generative neural network. More interestingly, whileSIFT flow and DSP have problems with the difficult pairs,our algorithm does not. This suggests that errors of ourmethod are largely due to contrast changes and drift in theoptical flow, which does not depend on the difficulty of theimage pair. The approaches are hence complementary: whilefor similar objects direct matching is fairly accurate, for moredissimilar ones intermediate morphings are very helpful.

6 ANALYSIS OF THE NETWORK

We have shown that the networks can model objects ex-tremely well. We now analyze the inner workings of net-works, trying to get some insight into the source of theirsuccess. The “2s-E” network was used in this section.

Fig. 18. Exemplar image pairs from the test set with ground truthcorrespondences. Left: ’simple’ pairs, right: ’difficult’ pairs

Method All Simple DifficultDSP [36] 5.2 3.3 6.3SIFT flow [35] 4.0 2.8 4.8Ours 3.4 3.1 3.5Human 1.1 1.1 1.1

TABLE 3Average displacement (in pixels) of corresponding keypoints found by

different methods on the whole test set and on the ’simple’ and ’difficult’subsets.

6.1 Activating single units

One way to analyze a neural network (artificial or real) isto visualize the effect of single neuron activations. Althoughthis method does not allow us to judge about the network’sactual functioning, which involves a clever combination ofmany neurons, it still gives a rough idea of what kind ofrepresentation is created by the different network layers.

Activating single neurons of uconv-3 feature maps (lastfeature maps before the output) is equivalent to simplylooking at the filters of these layers which are shown inFigure 19. The final output of the network at each positionis a linear combination of these filters. As to be expected,they include edges and blobs.

Our model is tailored to generate images from high-level neuron activations, which allows us to activate asingle neuron in some of the higher layers and forward-propagate down to the image. The results of this procedurefor different layers of the network are shown in Figures 20and 23. Each row corresponds to a different network layer.The leftmost image in each row is generated by setting allneurons of the layer to zero, and the other images – byactivating one randomly selected neuron.

In Figure 20 the first two rows show images producedwhen activating neurons of FC-1 and FC-2 feature maps

Page 11: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

11

Fig. 19. Output layer filters of the “2s-E” network. Top: RGB stream.Bottom: Segmentation stream.

Fig. 20. Images generated from single unit activations in feature mapsof different fully connected layers of the “2s-E” network. From top tobottom: FC-1 and FC-2 of the class stream, FC-3, FC-4.

Fig. 21. The effect of specialized neurons in the layer FC-4. Each rowshows the result of increasing the value of a single FC-4 neuron giventhe feature maps of a real chair. Effects of all neurons, top to bottom:translation upwards, zoom, stretch horizontally, stretch vertically, rotatecounter-clockwise, rotate clockwise, increase saturation, decrease sat-uration, make violet.

Fig. 22. Some of the neural network weights corresponding to thetransformation neurons shown in Figure 21. Each row shows weightsconnected to one neuron, in the same order as in Figure 21. Only aselected subset of most interesting channels is shown.

Fig. 23. Images generated from single neuron activations in featuremaps of some layers of the “2s-E” network. From top to bottom: uconv-2, uconv-1, FC-5 of the RGB stream. Relative scale of the images iscorrect. Bottom images are 57× 57 pixel, approximately half of the chairsize.

of the class stream while keeping viewpoint and trans-formation inputs fixed. The results clearly look chair-likebut do not show much variation (the most visible dif-ference is chair vs armchair), which suggests that largervariations are achievable by activating multiple neurons.The last two rows show results of activating neurons ofFC-3 and FC-4 feature maps. These feature maps containjoint class-viewpoint-transformation representations, hencethe viewpoint is not fixed anymore. The generated imagesstill resemble chairs but get much less realistic. This is to beexpected: the further away from the inputs, the less semanticmeaning there is in the activations.

In the middle of the bottom row in Figure 20 one cannotice a neuron which seems to generate a zoomed chair. Bylooking at FC-4 neurons more carefully we found that thisis indeed a ’zoom neuron’, and, moreover, for each trans-formation there is a specialized neurons in FC-4. The effectof these is shown in Figure 21. Increasing the activation ofone of these neurons while keeping all other activationsin FC-4 fixed results in a transformation of the generatedimage. It is quite surprising that information about trans-formations is propagated without change through all fullyconnected layers. It seems that all potential transformedversions of a chair are already contained in the FC-4 fea-tures, and the ’transformation neurons’ only modify themto give more relevance to the activations correspondingto the required transformation. The corresponding weightsconnecting these specialized neurons to the next layer areshown in Figure 22 (one neuron per row). Some outputchannels are responsible for spatial transformations of theimage, while others deal with the color and brightness.

Images generated from single neurons of the convolu-tional layers are shown in Figure 23. A somewhat disap-pointing observation is that while single neurons in later

Page 12: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

12

layers (uconv-2 and uconv-3) produce edge-like images,the neurons of higher deconvolutional layers generate onlyblurry ’clouds’, as opposed to the results of Zeiler and Fer-gus [24] with a classification network and max-unpooling.Our explanation is that because we use naive regular-gridunpooling, the network cannot slightly shift small parts toprecisely arrange them into larger meaningful structures.Hence it must find another way to generate fine details.In the next subsection we show that this is achieved by acombination of spatially neighboring neurons.

6.2 Analysis of the hidden layers

Rather than just activating single neurons while keeping allothers fixed to zero, we can use the network to normallygenerate an image and then analyze the hidden layer acti-vations by either looking at them or modifying them andobserving the results. An example of this approach wasalready used above in Figure 21 to understand the effect ofthe ’transformation neurons’. We present two more resultsin this direction here.

In order to find out how the blurry ’clouds’ generatedby single high-level deconvolutional neurons (Figure 23)form perfectly sharp chair images, we smoothly interpolatebetween a single activation and the whole chair. Namely,we start with the FC-5 feature maps of a chair, which havea spatial extent of 8 × 8. Next we only keep active neuronsin a region around the center of the feature map (settingall other activations to zero), gradually increasing the sizeof this region from 2 × 2 to 8 × 8. Hence, we can seethe effect of going from almost single-neuron activationlevel to the whole image level. The outcome is shown inFigure 24. Clearly, the interaction of neighboring neurons isvery important: in the central region, where many neuronsare active, the image is sharp, while in the periphery it isblurry. One interesting effect that is visible in the imagesis how sharply the legs of the chair end in the second tolast image but appear in the larger image. This suggestshighly non-linear suppression effects between activations ofneighboring neurons.

Lastly some interesting observations can be made bytaking a closer look at the feature maps of the uconv-3 layer(the last pre-output layer). Some of them exhibit regularpatterns shown in Figure 25. These feature maps correspondto filters which look near-empty in Figure 19 (such asthe 3rd and 10th filters in the first row). Our explanationof these patterns is that they compensate high-frequencyartifacts originating from fixed filter sizes and regular-gridunpooling. This is supported by the last row of Figure 25which shows what happens to the generated image whenthese feature maps are set to zero.

7 CONCLUSIONS

We have shown that supervised training of convolutionalneural networks can be used not only for discriminativetasks, but also for generating images given high-level style,viewpoint, and lighting information. A network trained forsuch a generative task does not merely learn to generate thetraining samples, but learns a generic implicit representa-tion, which allows it to smoothly morph between different

Fig. 24. Chairs generated from spatially masked FC-5 feature maps (thefeature map size is 8× 8). The size of the non-zero region increases leftto right: 2× 2, 4× 4, 6× 6, 8× 8.

Fig. 25. Top: Selected feature maps from the pre-output layer (uconv-3)of the RGB stream. These feature maps correspond to the filters whichlook near-empty in Figure 19. Middle: Close-ups of the feature maps.Bottom: Generation of a chair with these feature maps set to zero (leftimage pair) or left unchanged (right). Note the high-frequency artifactsin the left pair of images.

object views or object instances with all intermediate imagesbeing meaningful. Moreover, when trained in a stochasticregime, can be creative and invent new chair styles basedon random noise. Our experiments suggest that networkstrained to reconstruct objects of multiple classes developsome understanding of 3D shape and geometry. It is fas-cinating that the relatively simple architecture we proposedis already able to learn all these complex behaviors.

ACKNOWLEDGMENTS

AD, MT and TB acknowledge funding by the ERC Start-ing Grant VideoLearn (279401). JTS is supported by theBrainLinks-BrainTools Cluster of Excellence funded by theGerman Research Foundation (EXC 1086).

APPENDIX A: TRAINING USING A VARIATIONALBOUND

Here we give additional details on the training objectiveused the experiment on random chair generation in sec-tion 5.6. We phrase the problem of generating chairs as thatof learning a (conditional) probabilistic generative model.Formally, we can derive such a generative model from thedefinitions in Section 3 as follows. As before, let uRGB andusegm denote the “expanding” part of the generator and letus assume full knowledge about the view v, transformationparameters θ and chair identity c. We further assume there

Page 13: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

13

exists a distribution p(z | c) over latent states z ∈ R512 –which capture the underlying manifold of chair images –such that the generators uRGB and usegm can generate thecorresponding image using h(z,v, θ) as input. We defineh as the mapping obtained by replacing the independentprocessing stream for the class identity c with the 512dimensional random vector z in layer FC-2 (c.f. Figure 1).We can then define the likelihood of a segmentation masksi under transformation Tθi as

p(Tθisi | zi, θi,vi) =

∏usegm

(h(zi,vi, θi)

), (2)

where usegm outputs per-pixel probabilities (i.e. it has asigmoid output nonlinearity). Assuming the pixels in an im-age xi are distributed according to a multivariate Gaussiandistribution the likelihood of xi can be defined as

p(Tθi(x

i · si) | zi, θi,vi)

= N(uRGB

(h(zi,vi, θi)

),Σ),

(3)where N (µ,Σ) denotes a Gaussian distribution with meanµ and covariance Σ. We simplify this formulation by as-suming a diagonal covariance structure Σ = Iσ. Sinceall distributions appearing in our derivation are alwaysconditioned on the augmentation parameters θi and viewparameters vi we will omit them in the following to simplifynotation, i.e. we define p

(Tθis

i | zi)

:= p(Tθis

i | zi, θi,vi)

and p(Tθi(x

i · si) | zi)

:= p(Tθi(x

i · si) | zi, θi,vi). With

these definitions in place we can express the marginal loglikelihood of an image and its segmentation mask as

log p(si,xi) = Ez

[p(Tθis

i | zi)p(Tθi(x

i · si) | zi) ],

with zi ∼ p(z | ci).(4)

Since we, a priori, have no knowledge regarding the struc-ture of p(z | ci) – i.e. we do not know which sharedunderlying structure different chairs posses – we have toreplace it with an approximate inference distribution q(z |ci) = N (µzi , Iσzi). In our case we parameterize q as a twolayer fully connected neural network (with 512 units each)predicting mean µzi and variance σzi of the distribution. Werefer to the parameters of this network with Φ. Followingrecent examples from the neural networks literature [14, 15]we can then jointly train this approximate inference networkand the generator networks by maximizing a variationallower bound on the log-likelihood from (4) as

log p(si,xi) ≥

Ez

[log

p(Tθisi | zi)p

(Tθi(x

i · si) | zi)

qΦ(z | ci)

]= Ez

[log p(Tθis

i | zi) + log p(Tθi(x

i · si) | zi) ]

−KL(q(z | ci)‖p(zi)

),

with zi ∼ qΦ(z | ci) = LV B(xi, si.ci,vi, θi),

(5)

and where p(z) is a prior on the latent distribution, whichwe always set as p(z) = N (0,1). The optimization problemwe seek to solve can then be expressed through the sum

of these losses over all data points (and M independentsamples from qφ) and is given as

maxW,Φ

N∑i=1

M∑1

LV B(xi, si.ci,vi, θi),

with z ∼ qΦ(z | ci,vi, θi),(6)

where, for our experiments, we simply take only one sampleM = 1 per data point. Equation (6) can then be opti-mized using standard stochastic gradient descent since thederivative with respect to all parameters can be computedin closed form. For a detailed explanation on how one canback-propagate through the sampling procedure we refer toKingma et al. [15].

REFERENCES

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNetclassification with deep convolutional neural networks,”in NIPS, 2012, pp. 1106–1114.

[2] G. E. Hinton and R. R. Salakhutdinov, “Reducing thedimensionality of data with neural networks,” Science, pp.504–507, 2006.

[3] R. Salakhutdinov and G. E. Hinton, “Deep boltzmannmachines,” in AISTATS, 2009.

[4] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learningalgorithm for deep belief nets,” Neural Comput., vol. 18,no. 7, pp. 1527–1554, Jul. 2006.

[5] R. Memisevic and G. Hinton, “Unsupervised learning ofimage transformations,” in CVPR, 2007.

[6] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convo-lutional deep belief networks for scalable unsupervisedlearning of hierarchical representations,” in ICML, 2009,pp. 609–616.

[7] Y. Tang and A.-R. Mohamed, “Multiresolution deep beliefnetworks,” in International Conference on Artificial Intelli-gence and Statistics (AISTATS), 2012, 2012.

[8] M. Ranzato, J. Susskind, V. Mnih, and G. E. Hinton, “Ondeep generative models with applications to recognition.”in CVPR. IEEE, 2011, pp. 2857–2864.

[9] R. P. Francos, H. Permuter, and J. Francos, “Gaussianmixture models of texture and colour for image database,”in ICASSP, 2003.

[10] L. Theis, R. Hosseini, and M. Bethge, “Mixtures of condi-tional gaussian scale mixtures applied to multiscale imagerepresentations,” PLoS ONE, 2012.

[11] H. Larochelle and I. Murray, “The neural autoregressivedistribution estimator,” JMLR: W&CP, vol. 15, pp. 29–37,2011.

[12] Y. Bengio, E. Laufer, G. Alain, and J. Yosinski, “Deepgenerative stochastic networks trainable by backprop,” inICML, 2014.

[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,“Generative adversarial nets,” in NIPS, 2014.

[14] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochasticbackpropagation and approximate inference in deep gen-erative models,” in ICML, 2014.

[15] D. P. Kingma and M. Welling, “Auto-encoding variationalbayes.” in ICLR, 2014.

[16] Y. Tang and R. Salakhutdinov, “Learningstochastic feedforward neural networks,” inAdvances in Neural Information Processing Systems26. Curran Associates, Inc., 2013, pp. 530–538.[Online]. Available: http://papers.nips.cc/paper/5026-learning-stochastic-feedforward-neural-networks.pdf

Page 14: 1 Learning to Generate Chairs, Tables and Cars with ...1 Learning to Generate Chairs, Tables and Cars with Convolutional Networks Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim

14

[17] S. Reed, K. Sohn, Y. Zhang, and H. Lee, “Learning todisentangle factors of variation with manifold interaction,”in ICML, 2014.

[18] D. Kingma, D. Rezende, S. Mohamed, and M. Welling,“Semi-supervised learning with deep generative models,”in NIPS, 2014.

[19] V. Blanz and T. Vetter, “Face recognition based on fitting a3D morphable model,” TPAMI, vol. 25, no. 9, Sep. 2003.

[20] X. Zhang, Y. Gao, and M. K. H. Leung, “Recognizingrotated faces from frontal and side views: An approachtoward effective use of mugshot databases.” 2008, pp. 684–697.

[21] Z. Zhu, P. Luo, X. Wang, and X. Tang, “Multi-view per-ceptron: a deep model for learning face identity and viewrepresentations,” in NIPS, 2014.

[22] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning adeep convolutional network for image super-resolution,”in 13th European Conference on Computer Vision (ECCV),2014, pp. 184–199.

[23] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map pre-diction from a single image using a multi-scale deepnetwork,” in NIPS, 2014.

[24] M. D. Zeiler and R. Fergus, “Visualizing and understand-ing convolutional networks,” in ECCV, 2014.

[25] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptivedeconvolutional networks for mid and high level featurelearning,” in IEEE International Conference on ComputerVision, ICCV, 2011, pp. 2018–2025.

[26] M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic,“Seeing 3D chairs: exemplar part-based 2D-3D alignmentusing a large dataset of CAD models,” in CVPR, 2014.

[27] M. Savva, A. X. Chang, and P. Hanrahan, “Semantically-Enriched 3D Models for Common-sense Knowledge,”CVPR 2015 Workshop on Functionality, Physics, Intentionalityand Causality, 2015.

[28] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Con-volutional architecture for fast feature embedding,” arXivpreprint arXiv:1408.5093, 2014.

[29] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” CoRR, vol. abs/1412.6980, 2014. [Online].Available: http://arxiv.org/abs/1412.6980

[30] D. Sussillo, “Random walks: Training very deep nonlinearfeed-forward networks with smart initialization,” CoRR,vol. abs/1412.6558, 2014. [Online]. Available: http://arxiv.org/abs/1412.6558

[31] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deepinto rectifiers: Surpassing human-level performance onimagenet classification,” CoRR, vol. abs/1502.01852, 2015.[Online]. Available: http://arxiv.org/abs/1502.01852

[32] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, andD. Wierstra, “DRAW: A recurrent neural network forimage generation,” in Proceedings of the 32nd InternationalConference on Machine Learning, ICML 2015, Lille, France,6-11 July 2015, 2015, pp. 1462–1471. [Online]. Available:http://jmlr.org/proceedings/papers/v37/gregor15.html

[33] M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu, “Spatial transformer networks,” inNIPS, 2015.

[34] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “Highaccuracy optical flow estimation based on a theory forwarping,” in ECCV, 2004.

[35] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman,“Sift flow: Dense correspondence across different scenes,”in ECCV, 2008.

[36] J. Kim, C. Liu, F. Sha, and K. Grauman, “Deformablespatial pyramid matching for fast dense correspondences.”in CVPR, 2013.

Alexey Dosovitskiy received his Specialist(equivalent of MSc, with distinction) and Ph.D.degrees in mathematics from Moscow State Uni-versity in 2009 and 2012 respectively. His Ph.D.thesis is in the field of functional analysis, re-lated to measures in infinite-dimensional spacesand representations theory. In summer 2012 hespent three months at the Computational Visionand Neuroscience Group at the University ofTubingen. Since September 2012 he is a post-doctoral researcher with the Computer Vision

Group at the University of Freiburg in Germany. His current main re-search interests are computer vision, machine learning and optimiza-tion.

Jost Tobias Springenberg is a PhD studentin the machine learning lab at the University ofFreiburg, Germany, supervised by Martin Ried-miller. Prior to starting his PhD Tobias stud-ied Cognitive Science at the University of Os-nabrueck, earning his BSc in 2009. From 2009-2012 he then went to obtain a MSc in Com-puter Science from the University of Freiburg,focusing on representation learning with deepneural networks for computer vision problems.His research interests include machine learning,

especially representation learning, and learning efficient control strate-gies for robotics.

Maxim Tatarchenko Maxim Tatarchenko re-ceived his BSc with honors in Applied Mathemat-ics from Russian State Technological University”MATI” in 2011. He spent three years develop-ing numerical algorithms for satellite navigationat an R&D Company ’GPSCOM’, Moscow. Heproceeded with master studies at the Universityof Freiburg and is now close to obtaining his MScdegree. In January 2016 he is going to join theComputer Vision Group of Prof. Dr. Thomas Broxas a PhD student. His main research interest is

computer vision with a special focus on deep generative models forimages.

Thomas Brox received his Ph.D. degree incomputer science from the Saarland Universityin Germany in 2005. He spent two years asa postdoctoral researcher at the University ofBonn and two years at the University of Cali-fornia at Berkeley. Since 2010, he is headingthe Computer Vision Group at the University ofFreiburg in Germany. His research interests arein computer vision, in particular video analysisand deep learning for computer vision. Prof. Broxis associate editor of the IEEE Transactions on

Pattern Analysis and Machine Intelligence and the International Journalof Computer Vision. He received the Longuet-Higgins Best Paper Awardand the Koendrink Prize for Fundamental Contributions in ComputerVision.