1 An Introduction to Image Synthesis with Generative Adversarial … · 2018. 11. 20. · 1 An Introduction to Image Synthesis with Generative Adversarial Nets He Huang, Philip S

1

An Introduction to Image Synthesis withGenerative Adversarial Nets

He Huang, Philip S. Yu and Changhu Wang

Abstract—There has been a drastic growth of research in Generative Adversarial Nets (GANs) in the past few years. Proposed in2014, GAN has been applied to various applications such as computer vision and natural language processing, and achievesimpressive performance. Among the many applications of GAN, image synthesis is the most well-studied one, and research in thisarea has already demonstrated the great potential of using GAN in image synthesis. In this paper, we provide a taxonomy of methodsused in image synthesis, review different models for text-to-image synthesis and image-to-image translation, and discuss someevaluation metrics as well as possible future research directions in image synthesis with GAN.

Index Terms—Deep Learning, Generative Adversarial Nets, Image Synthesis, Computer Vision.

F

1 INTRODUCTION

W ITH recent advances in deep learning, machine learn-ing algorithms have evolved to such an extent that

they can compete and even defeat humans in some tasks,such as image classification on ImageNet [1], playing Go[2] and Texas Hold’em poker [3]. However, we still cannotconclude that those algorithms have true “intelligence”,since knowing how to do something does not necessarilymean understanding something, and it is critical for a trulyintelligent agent to understand its tasks. “What I cannotcreate, I do not understand”, said the famous physicist RichardFeynman. To put this quote in the case of machine learning,we can say that, for machines to understand their input data,they need to learn to create the data. The most promisingapproach is to use generative models that learn to discoverthe essence of data and find a best distribution to representit. Also, with a learned generative model, we can even drawsamples which are not in the training set but follow the samedistribution.

As a new framework of generative model, GenerativeAdversarial Net (GAN) [4], proposed in 2014, is able togenerate better synthetic images than previous generativemodels, and since then it has become one of the most popu-lar research areas. A Generative Adversarial Net consists oftwo neural networks, a generator and a discriminator, wherethe generator tries to produce realistic samples that fool thediscriminator, while the discriminator tries to distinguishreal samples from generated ones. There are two mainthreads of research on GAN. One is the theoretical threadthat tries to alleviate the instability and mode collapseproblems of GAN [5] [6] [7] [8] [9] [10], or reformulateit from different angles like information theory [11] andenergy-based models [12]. The other thread focuses on theapplications of GAN in computer vision (CV) [5], naturallanguage processing (NLP) [13] and other areas.

• He Huang and Philip S. Yu are with the Department of Computer Science,University of Illinois at Chicago, USA.Emails: hehuang, [email protected]

• Changhu Wang is with ByteDance AI Lab, China.Email: [email protected]

There is a great tutorial given by Goodfellow in NIPS2016 on GAN [14] where he describes the importance ofgenerative models, explains how GAN works, comparesGAN with other generative model and discusses frontierresearch topics in GAN. Also, there is a recent review paperon GAN [15] which reviews several GAN architecturesand training techniques, and introduces some applicationsof GAN. However, both of these papers are in a generalscope, without going into details of specific applications.In this paper, we specifically focus on image synthesis,whose goal is to generate images, since it is by far themost studied area where GAN has been applied. Besidesimage synthesis, there are many other applications of GANin computer vision, such as image in-painting [16], imagecaptioning [17] [18] [19], object detection [20] and semanticsegmentation [21].

Research of applying GAN in natural language process-ing is also a growing trend, such as text modeling [13] [22][23], dialogue generation [24], question answering [25] andneural machine translation [26]. However, training GANin NLP tasks is more difficult and requires more tech-niques [13], which also makes it a challenging but intriguingresearch area.

The main goal of this paper is to provide an overviewof the methods used in image synthesis with GAN andpoint out strengths and weaknesses of current methods.We classify the main approaches in image synthesis intothree methods, i.e. direct methods, hierarchical methodsand iterative methods. Besides these most commonly usedmethods, there are also other methods which we will brieflymention. We then give a detailed discussion in two of themost important tasks in image synthesis, i.e. text-to-imagesynthesis and image-to-image translation. We also discussthe possible reasons why GAN performs so well in certaintasks and its role in our goal to artificial intelligence. Thegoal of this paper is to provide a simple guideline for thosewho want to apply GAN to their problems and help furtherresearch in GAN.

The rest of this paper is organized as follows. In Section 2we first review some core concepts of GAN, as well as some

arX

iv:1

803.

0446

9v2

[cs

.CV

] 1

7 N

ov 2

018

2

variants and training issues. Then in Section 3 we introducethree main approaches and some other approaches used inimage synthesis. In Section 4, we discuss several methodsin text-to-image synthesis and some possible research di-rections for improvements. In Section 5, we first introducesupervised and unsupervised methods for image-to-imagetranslation, and then turn to more specific applications likeface editing, video prediction and image super-resolution. InSection 6 we review some evaluation metrics for syntheticimages, while in Section 7 we discuss the discriminator’srole as a learned loss function. Conclusions are given inSection 8.

2 GAN PRELIMINARIES

In this section, we review some core concepts of GenerativeAdversarial Nets (GANs) and some improvements.

A generative model G parameterized by θ takes as inputa random noise z and output a sampleG(z; θ), so the outputcan be regarded as a sample drawn from a distribution:G(z; θ) ∼ pg . Meanwhile, we have a lot of training datax drawn from pdata, and the training objective for thegenerative model G is to approximate pdata using pg .

True/FakeG D

z

G(z)

x

Generator Discriminator

Random noise True data

Synthetic data

Fig. 1. General structure of a Generative Adversarial Network, wherethe generator G takes a noise vector z as input and output a syntheticsample G(z), and the discriminator takes both the synthetic input G(z)and true sample x as inputs and predict whether they are real or fake.

Generative Adversarial Net (GAN) [4] consists of twoseparate neural networks: a generatorG that takes a randomnoise vector z, and outputs synthetic data G(z); a discrimi-nator D that takes an input x or G(z) and output a proba-bility D(x) or D(G(z)) to indicate whether it is synthetic orfrom the true data distribution, as shown in Figure 1. Bothof the generator and discriminator can be arbitrary neuralnetworks. The first GAN [4] uses fully connected layeras its building block. Later, DCGAN [5] proposes to usefully convolutional neural networks which achieves betterperformance, and since then convolution and transposed con-volution layers have become the core components in manyGAN models. For more details on (transposed) convolutionarithmetic, please refer to this report [27].

The original way to train the generator and discriminatoris to form a two-player min-max game where the generatorG tries to generate realistic data to fool the discriminatorwhile discriminator D tries to distinguish between real andsynthetic data [4]. The value function to be optimized isshown in Equation 1, where pdata(x) denotes the true datadistribution and pz(z) denote the noise distribution.

minG

maxD

V (D,G) = Ex∼pdata(x)[logD(x)]

+ Ez∼pz(z)[log(1−D(G(z)))] (1)

However, when the discriminator is trained much betterthan the generator, D can reject the samples from G withconfidence close to 1, and thus the loss log(1 − D(G(z)))saturates and G can not learn anything from zero gra-dient. To prevent this, instead of training G to minimizelog(1 −D(G(z))), we can train it to maximize logD(G(z))[4]. Although the new loss function for G gives a differentscale of gradient than the original one, it still provides thesame direction of gradient and does not saturate.

2.1 Conditional GANIn the original GAN, we have no control of what to begenerated, since the output is only dependent on randomnoise. However, we can add a conditional input c to therandom noise z so that the generated image is definedby G(c, z) [28]. Typically, the conditional input vector cis concatenated with the noise vector z, and the resultingvector is put into the generator as it is in the original GAN.Besides, we can perform other data augmentation on c andz, as in [29]. The meaning of conditional input c is arbitrary,for example, it can be the class of image, attributes of object[28] or an embedding of text descriptions of the image wewant to generate [30] [31].

2.2 GAN with Auxiliary ClassifierIn order to feed more side-information and to allow forsemi-supervised learning, one can add an additional task-specific auxiliary classifier to the discriminator, so that themodel is optimized on the original tasks as well as theadditional task [32] [6]. The architecture of such method isillustrated in Figure 2, where C is the auxiliary classifier.Adding auxiliary classifiers allows us to use pre-trainedmodels (e.g. image classifiers trained on ImageNet), andexperiments in AC-GAN [33] demonstrate that such methodcan help generating sharper images as well as alleviate themode collapse problem. Using auxiliary classifiers can alsohelp in applications such as text-to-image synthesis [34] andimage-to-image translation [35].

G D

C

z<latexit sha1_base64="bPhcpTa7EXRQYPdsuRRM5WelIWA=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVBIR1FvRi8cKxhbbUjbbl3bpZhN2N0IN/RdePKh49ed489+4aXPQ1oGFYeY9dt4EieDauO63s7S8srq2Xtoob25t7+xW9vbvdZwqhj6LRaxaAdUouETfcCOwlSikUSCwGYyuc7/5iErzWN6ZcYLdiA4kDzmjxkoPnYiaYRBmT5NeperW3CnIIvEKUoUCjV7lq9OPWRqhNExQrduem5huRpXhTOCk3Ek1JpSN6ADblkoaoe5m08QTcmyVPgljZZ80ZKr+3shopPU4CuxknlDPe7n4n9dOTXjRzbhMUoOSzT4KU0FMTPLzSZ8rZEaMLaFMcZuVsCFVlBlbUtmW4M2fvEj809plzbs9q9avijZKcAhHcAIenEMdbqABPjCQ8Ayv8OZo58V5dz5mo0tOsXMAf+B8/gBq2JDy</latexit><latexit sha1_base64="bPhcpTa7EXRQYPdsuRRM5WelIWA=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVBIR1FvRi8cKxhbbUjbbl3bpZhN2N0IN/RdePKh49ed489+4aXPQ1oGFYeY9dt4EieDauO63s7S8srq2Xtoob25t7+xW9vbvdZwqhj6LRaxaAdUouETfcCOwlSikUSCwGYyuc7/5iErzWN6ZcYLdiA4kDzmjxkoPnYiaYRBmT5NeperW3CnIIvEKUoUCjV7lq9OPWRqhNExQrduem5huRpXhTOCk3Ek1JpSN6ADblkoaoe5m08QTcmyVPgljZZ80ZKr+3shopPU4CuxknlDPe7n4n9dOTXjRzbhMUoOSzT4KU0FMTPLzSZ8rZEaMLaFMcZuVsCFVlBlbUtmW4M2fvEj809plzbs9q9avijZKcAhHcAIenEMdbqABPjCQ8Ayv8OZo58V5dz5mo0tOsXMAf+B8/gBq2JDy</latexit><latexit sha1_base64="bPhcpTa7EXRQYPdsuRRM5WelIWA=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVBIR1FvRi8cKxhbbUjbbl3bpZhN2N0IN/RdePKh49ed489+4aXPQ1oGFYeY9dt4EieDauO63s7S8srq2Xtoob25t7+xW9vbvdZwqhj6LRaxaAdUouETfcCOwlSikUSCwGYyuc7/5iErzWN6ZcYLdiA4kDzmjxkoPnYiaYRBmT5NeperW3CnIIvEKUoUCjV7lq9OPWRqhNExQrduem5huRpXhTOCk3Ek1JpSN6ADblkoaoe5m08QTcmyVPgljZZ80ZKr+3shopPU4CuxknlDPe7n4n9dOTXjRzbhMUoOSzT4KU0FMTPLzSZ8rZEaMLaFMcZuVsCFVlBlbUtmW4M2fvEj809plzbs9q9avijZKcAhHcAIenEMdbqABPjCQ8Ayv8OZo58V5dz5mo0tOsXMAf+B8/gBq2JDy</latexit><latexit sha1_base64="bPhcpTa7EXRQYPdsuRRM5WelIWA=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVBIR1FvRi8cKxhbbUjbbl3bpZhN2N0IN/RdePKh49ed489+4aXPQ1oGFYeY9dt4EieDauO63s7S8srq2Xtoob25t7+xW9vbvdZwqhj6LRaxaAdUouETfcCOwlSikUSCwGYyuc7/5iErzWN6ZcYLdiA4kDzmjxkoPnYiaYRBmT5NeperW3CnIIvEKUoUCjV7lq9OPWRqhNExQrduem5huRpXhTOCk3Ek1JpSN6ADblkoaoe5m08QTcmyVPgljZZ80ZKr+3shopPU4CuxknlDPe7n4n9dOTXjRzbhMUoOSzT4KU0FMTPLzSZ8rZEaMLaFMcZuVsCFVlBlbUtmW4M2fvEj809plzbs9q9avijZKcAhHcAIenEMdbqABPjCQ8Ayv8OZo58V5dz5mo0tOsXMAf+B8/gBq2JDy</latexit>

G(y, z)<latexit sha1_base64="RsVxNg2XP82wnwDX6Hzari7ouq8=">AAAB93icbVDLSsNAFL2pr1ofjbp0M1iEClISEdRd0YUuKxhbaEOZTCft0MmDmYmQhn6JGxcqbv0Vd/6NkzYLbT0wcDjnXu6Z48WcSWVZ30ZpZXVtfaO8Wdna3tmtmnv7jzJKBKEOiXgkOh6WlLOQOoopTjuxoDjwOG1745vcbz9RIVkUPqg0pm6AhyHzGcFKS32zelvvBViNPD9LTyfTk75ZsxrWDGiZ2AWpQYFW3/zqDSKSBDRUhGMpu7YVKzfDQjHC6bTSSySNMRnjIe1qGuKASjebBZ+iY60MkB8J/UKFZurvjQwHUqaBpyfzkHLRy8X/vG6i/Es3Y2GcKBqS+SE/4UhFKG8BDZigRPFUE0wE01kRGWGBidJdVXQJ9uKXl4lz1rhq2PfnteZ10UYZDuEI6mDDBTThDlrgAIEEnuEV3oyJ8WK8Gx/z0ZJR7BzAHxifP43TkpI=</latexit><latexit sha1_base64="RsVxNg2XP82wnwDX6Hzari7ouq8=">AAAB93icbVDLSsNAFL2pr1ofjbp0M1iEClISEdRd0YUuKxhbaEOZTCft0MmDmYmQhn6JGxcqbv0Vd/6NkzYLbT0wcDjnXu6Z48WcSWVZ30ZpZXVtfaO8Wdna3tmtmnv7jzJKBKEOiXgkOh6WlLOQOoopTjuxoDjwOG1745vcbz9RIVkUPqg0pm6AhyHzGcFKS32zelvvBViNPD9LTyfTk75ZsxrWDGiZ2AWpQYFW3/zqDSKSBDRUhGMpu7YVKzfDQjHC6bTSSySNMRnjIe1qGuKASjebBZ+iY60MkB8J/UKFZurvjQwHUqaBpyfzkHLRy8X/vG6i/Es3Y2GcKBqS+SE/4UhFKG8BDZigRPFUE0wE01kRGWGBidJdVXQJ9uKXl4lz1rhq2PfnteZ10UYZDuEI6mDDBTThDlrgAIEEnuEV3oyJ8WK8Gx/z0ZJR7BzAHxifP43TkpI=</latexit><latexit sha1_base64="RsVxNg2XP82wnwDX6Hzari7ouq8=">AAAB93icbVDLSsNAFL2pr1ofjbp0M1iEClISEdRd0YUuKxhbaEOZTCft0MmDmYmQhn6JGxcqbv0Vd/6NkzYLbT0wcDjnXu6Z48WcSWVZ30ZpZXVtfaO8Wdna3tmtmnv7jzJKBKEOiXgkOh6WlLOQOoopTjuxoDjwOG1745vcbz9RIVkUPqg0pm6AhyHzGcFKS32zelvvBViNPD9LTyfTk75ZsxrWDGiZ2AWpQYFW3/zqDSKSBDRUhGMpu7YVKzfDQjHC6bTSSySNMRnjIe1qGuKASjebBZ+iY60MkB8J/UKFZurvjQwHUqaBpyfzkHLRy8X/vG6i/Es3Y2GcKBqS+SE/4UhFKG8BDZigRPFUE0wE01kRGWGBidJdVXQJ9uKXl4lz1rhq2PfnteZ10UYZDuEI6mDDBTThDlrgAIEEnuEV3oyJ8WK8Gx/z0ZJR7BzAHxifP43TkpI=</latexit><latexit sha1_base64="RsVxNg2XP82wnwDX6Hzari7ouq8=">AAAB93icbVDLSsNAFL2pr1ofjbp0M1iEClISEdRd0YUuKxhbaEOZTCft0MmDmYmQhn6JGxcqbv0Vd/6NkzYLbT0wcDjnXu6Z48WcSWVZ30ZpZXVtfaO8Wdna3tmtmnv7jzJKBKEOiXgkOh6WlLOQOoopTjuxoDjwOG1745vcbz9RIVkUPqg0pm6AhyHzGcFKS32zelvvBViNPD9LTyfTk75ZsxrWDGiZ2AWpQYFW3/zqDSKSBDRUhGMpu7YVKzfDQjHC6bTSSySNMRnjIe1qGuKASjebBZ+iY60MkB8J/UKFZurvjQwHUqaBpyfzkHLRy8X/vG6i/Es3Y2GcKBqS+SE/4UhFKG8BDZigRPFUE0wE01kRGWGBidJdVXQJ9uKXl4lz1rhq2PfnteZ10UYZDuEI6mDDBTThDlrgAIEEnuEV3oyJ8WK8Gx/z0ZJR7BzAHxifP43TkpI=</latexit>

x<latexit sha1_base64="an7dqMRjsg1Zj7dcJehomnZr97c=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVBIR1FvRi8cKxhbbUjbbTbt0swm7L2IJ/RdePKh49ed489+4aXPQ1oGFYeY9dt4EiRQGXffbWVpeWV1bL22UN7e2d3Yre/v3Jk414z6LZaxbATVcCsV9FCh5K9GcRoHkzWB0nfvNR66NiNUdjhPejehAiVAwilZ66EQUh0GYPU16lapbc6cgi8QrSBUKNHqVr04/ZmnEFTJJjWl7boLdjGoUTPJJuZManlA2ogPetlTRiJtuNk08IcdW6ZMw1vYpJFP190ZGI2PGUWAn84Rm3svF/7x2iuFFNxMqSZErNvsoTCXBmOTnk77QnKEcW0KZFjYrYUOqKUNbUtmW4M2fvEj809plzbs9q9avijZKcAhHcAIenEMdbqABPjBQ8Ayv8OYY58V5dz5mo0tOsXMAf+B8/gBn0JDw</latexit><latexit sha1_base64="an7dqMRjsg1Zj7dcJehomnZr97c=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVBIR1FvRi8cKxhbbUjbbTbt0swm7L2IJ/RdePKh49ed489+4aXPQ1oGFYeY9dt4EiRQGXffbWVpeWV1bL22UN7e2d3Yre/v3Jk414z6LZaxbATVcCsV9FCh5K9GcRoHkzWB0nfvNR66NiNUdjhPejehAiVAwilZ66EQUh0GYPU16lapbc6cgi8QrSBUKNHqVr04/ZmnEFTJJjWl7boLdjGoUTPJJuZManlA2ogPetlTRiJtuNk08IcdW6ZMw1vYpJFP190ZGI2PGUWAn84Rm3svF/7x2iuFFNxMqSZErNvsoTCXBmOTnk77QnKEcW0KZFjYrYUOqKUNbUtmW4M2fvEj809plzbs9q9avijZKcAhHcAIenEMdbqABPjBQ8Ayv8OYY58V5dz5mo0tOsXMAf+B8/gBn0JDw</latexit><latexit sha1_base64="an7dqMRjsg1Zj7dcJehomnZr97c=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVBIR1FvRi8cKxhbbUjbbTbt0swm7L2IJ/RdePKh49ed489+4aXPQ1oGFYeY9dt4EiRQGXffbWVpeWV1bL22UN7e2d3Yre/v3Jk414z6LZaxbATVcCsV9FCh5K9GcRoHkzWB0nfvNR66NiNUdjhPejehAiVAwilZ66EQUh0GYPU16lapbc6cgi8QrSBUKNHqVr04/ZmnEFTJJjWl7boLdjGoUTPJJuZManlA2ogPetlTRiJtuNk08IcdW6ZMw1vYpJFP190ZGI2PGUWAn84Rm3svF/7x2iuFFNxMqSZErNvsoTCXBmOTnk77QnKEcW0KZFjYrYUOqKUNbUtmW4M2fvEj809plzbs9q9avijZKcAhHcAIenEMdbqABPjBQ8Ayv8OYY58V5dz5mo0tOsXMAf+B8/gBn0JDw</latexit><latexit sha1_base64="an7dqMRjsg1Zj7dcJehomnZr97c=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVBIR1FvRi8cKxhbbUjbbTbt0swm7L2IJ/RdePKh49ed489+4aXPQ1oGFYeY9dt4EiRQGXffbWVpeWV1bL22UN7e2d3Yre/v3Jk414z6LZaxbATVcCsV9FCh5K9GcRoHkzWB0nfvNR66NiNUdjhPejehAiVAwilZ66EQUh0GYPU16lapbc6cgi8QrSBUKNHqVr04/ZmnEFTJJjWl7boLdjGoUTPJJuZManlA2ogPetlTRiJtuNk08IcdW6ZMw1vYpJFP190ZGI2PGUWAn84Rm3svF/7x2iuFFNxMqSZErNvsoTCXBmOTnk77QnKEcW0KZFjYrYUOqKUNbUtmW4M2fvEj809plzbs9q9avijZKcAhHcAIenEMdbqABPjBQ8Ayv8OYY58V5dz5mo0tOsXMAf+B8/gBn0JDw</latexit>

y<latexit sha1_base64="4DloZUHGDU4lmdnl3rJrNIVLirc=">AAAB+HicbVBNS8NAFHypX7V+RT16WSyCp5KIoN6KXjxWMLbQlLLZbtqlm03Y3RRCyD/x4kHFqz/Fm//GTZuDtg4sDDPv8WYnSDhT2nG+rdra+sbmVn27sbO7t39gHx49qTiVhHok5rHsBVhRzgT1NNOc9hJJcRRw2g2md6XfnVGpWCwedZbQQYTHgoWMYG2koW37E6xzP8J6EoR5VhRDu+m0nDnQKnEr0oQKnaH95Y9ikkZUaMKxUn3XSfQgx1IzwmnR8FNFE0ymeEz7hgocUTXI58kLdGaUEQpjaZ7QaK7+3shxpFQWBWayjKiWvVL8z+unOrwe5EwkqaaCLA6FKUc6RmUNaMQkJZpnhmAimcmKyARLTLQpq2FKcJe/vEq8i9ZNy324bLZvqzbqcAKncA4uXEEb7qEDHhCYwTO8wpuVWy/Wu/WxGK1Z1c4x/IH1+QO/XJPv</latexit><latexit sha1_base64="4DloZUHGDU4lmdnl3rJrNIVLirc=">AAAB+HicbVBNS8NAFHypX7V+RT16WSyCp5KIoN6KXjxWMLbQlLLZbtqlm03Y3RRCyD/x4kHFqz/Fm//GTZuDtg4sDDPv8WYnSDhT2nG+rdra+sbmVn27sbO7t39gHx49qTiVhHok5rHsBVhRzgT1NNOc9hJJcRRw2g2md6XfnVGpWCwedZbQQYTHgoWMYG2koW37E6xzP8J6EoR5VhRDu+m0nDnQKnEr0oQKnaH95Y9ikkZUaMKxUn3XSfQgx1IzwmnR8FNFE0ymeEz7hgocUTXI58kLdGaUEQpjaZ7QaK7+3shxpFQWBWayjKiWvVL8z+unOrwe5EwkqaaCLA6FKUc6RmUNaMQkJZpnhmAimcmKyARLTLQpq2FKcJe/vEq8i9ZNy324bLZvqzbqcAKncA4uXEEb7qEDHhCYwTO8wpuVWy/Wu/WxGK1Z1c4x/IH1+QO/XJPv</latexit><latexit sha1_base64="4DloZUHGDU4lmdnl3rJrNIVLirc=">AAAB+HicbVBNS8NAFHypX7V+RT16WSyCp5KIoN6KXjxWMLbQlLLZbtqlm03Y3RRCyD/x4kHFqz/Fm//GTZuDtg4sDDPv8WYnSDhT2nG+rdra+sbmVn27sbO7t39gHx49qTiVhHok5rHsBVhRzgT1NNOc9hJJcRRw2g2md6XfnVGpWCwedZbQQYTHgoWMYG2koW37E6xzP8J6EoR5VhRDu+m0nDnQKnEr0oQKnaH95Y9ikkZUaMKxUn3XSfQgx1IzwmnR8FNFE0ymeEz7hgocUTXI58kLdGaUEQpjaZ7QaK7+3shxpFQWBWayjKiWvVL8z+unOrwe5EwkqaaCLA6FKUc6RmUNaMQkJZpnhmAimcmKyARLTLQpq2FKcJe/vEq8i9ZNy324bLZvqzbqcAKncA4uXEEb7qEDHhCYwTO8wpuVWy/Wu/WxGK1Z1c4x/IH1+QO/XJPv</latexit><latexit sha1_base64="4DloZUHGDU4lmdnl3rJrNIVLirc=">AAAB+HicbVBNS8NAFHypX7V+RT16WSyCp5KIoN6KXjxWMLbQlLLZbtqlm03Y3RRCyD/x4kHFqz/Fm//GTZuDtg4sDDPv8WYnSDhT2nG+rdra+sbmVn27sbO7t39gHx49qTiVhHok5rHsBVhRzgT1NNOc9hJJcRRw2g2md6XfnVGpWCwedZbQQYTHgoWMYG2koW37E6xzP8J6EoR5VhRDu+m0nDnQKnEr0oQKnaH95Y9ikkZUaMKxUn3XSfQgx1IzwmnR8FNFE0ymeEz7hgocUTXI58kLdGaUEQpjaZ7QaK7+3shxpFQWBWayjKiWvVL8z+unOrwe5EwkqaaCLA6FKUc6RmUNaMQkJZpnhmAimcmKyARLTLQpq2FKcJe/vEq8i9ZNy324bLZvqzbqcAKncA4uXEEb7qEDHhCYwTO8wpuVWy/Wu/WxGK1Z1c4x/IH1+QO/XJPv</latexit>

True/Fakey<latexit sha1_base64="e66E6ccHir3AxvLPPR/3LTLcVzw=">AAAB8HicbVBNSwMxFHxbv2r9qnr0EiyCp7IrgnorevFYwbXFdinZNNuGZpMlyQrL0n/hxYOKV3+ON/+N2XYP2joQGGbeI/MmTDjTxnW/ncrK6tr6RnWztrW9s7tX3z940DJVhPpEcqm6IdaUM0F9wwyn3URRHIecdsLJTeF3nqjSTIp7kyU0iPFIsIgRbKz02I+xGYdRnk0H9YbbdGdAy8QrSQNKtAf1r/5QkjSmwhCOte55bmKCHCvDCKfTWj/VNMFkgke0Z6nAMdVBPks8RSdWGaJIKvuEQTP190aOY62zOLSTRUK96BXif14vNdFlkDORpIYKMv8oSjkyEhXnoyFTlBieWYKJYjYrImOsMDG2pJotwVs8eZn4Z82rpnd33mhdl21U4QiO4RQ8uIAW3EIbfCAg4Ble4c3Rzovz7nzMRytOuXMIf+B8/gBpVJDx</latexit><latexit sha1_base64="e66E6ccHir3AxvLPPR/3LTLcVzw=">AAAB8HicbVBNSwMxFHxbv2r9qnr0EiyCp7IrgnorevFYwbXFdinZNNuGZpMlyQrL0n/hxYOKV3+ON/+N2XYP2joQGGbeI/MmTDjTxnW/ncrK6tr6RnWztrW9s7tX3z940DJVhPpEcqm6IdaUM0F9wwyn3URRHIecdsLJTeF3nqjSTIp7kyU0iPFIsIgRbKz02I+xGYdRnk0H9YbbdGdAy8QrSQNKtAf1r/5QkjSmwhCOte55bmKCHCvDCKfTWj/VNMFkgke0Z6nAMdVBPks8RSdWGaJIKvuEQTP190aOY62zOLSTRUK96BXif14vNdFlkDORpIYKMv8oSjkyEhXnoyFTlBieWYKJYjYrImOsMDG2pJotwVs8eZn4Z82rpnd33mhdl21U4QiO4RQ8uIAW3EIbfCAg4Ble4c3Rzovz7nzMRytOuXMIf+B8/gBpVJDx</latexit><latexit sha1_base64="e66E6ccHir3AxvLPPR/3LTLcVzw=">AAAB8HicbVBNSwMxFHxbv2r9qnr0EiyCp7IrgnorevFYwbXFdinZNNuGZpMlyQrL0n/hxYOKV3+ON/+N2XYP2joQGGbeI/MmTDjTxnW/ncrK6tr6RnWztrW9s7tX3z940DJVhPpEcqm6IdaUM0F9wwyn3URRHIecdsLJTeF3nqjSTIp7kyU0iPFIsIgRbKz02I+xGYdRnk0H9YbbdGdAy8QrSQNKtAf1r/5QkjSmwhCOte55bmKCHCvDCKfTWj/VNMFkgke0Z6nAMdVBPks8RSdWGaJIKvuEQTP190aOY62zOLSTRUK96BXif14vNdFlkDORpIYKMv8oSjkyEhXnoyFTlBieWYKJYjYrImOsMDG2pJotwVs8eZn4Z82rpnd33mhdl21U4QiO4RQ8uIAW3EIbfCAg4Ble4c3Rzovz7nzMRytOuXMIf+B8/gBpVJDx</latexit><latexit sha1_base64="e66E6ccHir3AxvLPPR/3LTLcVzw=">AAAB8HicbVBNSwMxFHxbv2r9qnr0EiyCp7IrgnorevFYwbXFdinZNNuGZpMlyQrL0n/hxYOKV3+ON/+N2XYP2joQGGbeI/MmTDjTxnW/ncrK6tr6RnWztrW9s7tX3z940DJVhPpEcqm6IdaUM0F9wwyn3URRHIecdsLJTeF3nqjSTIp7kyU0iPFIsIgRbKz02I+xGYdRnk0H9YbbdGdAy8QrSQNKtAf1r/5QkjSmwhCOte55bmKCHCvDCKfTWj/VNMFkgke0Z6nAMdVBPks8RSdWGaJIKvuEQTP190aOY62zOLSTRUK96BXif14vNdFlkDORpIYKMv8oSjkyEhXnoyFTlBieWYKJYjYrImOsMDG2pJotwVs8eZn4Z82rpnd33mhdl21U4QiO4RQ8uIAW3EIbfCAg4Ble4c3Rzovz7nzMRytOuXMIf+B8/gBpVJDx</latexit>

Fig. 2. Architecture of GAN with auxiliary classifier, where y is theconditional input label and C is the classifier that takes the syntheticimage G(y, z) as input and predict its label y

2.3 GAN with EncoderAlthough GAN can transform a noise vector z into a syn-thetic data sample G(z), it does not allow inverse trans-formation. If we treat the noise distribution as a latent

3

feature space for data samples, GAN lacks the ability to mapdata sample x into latent feature z. In order to allow suchmapping, two concurrent works BiGAN [36] and ALI [37]propose to add an encoder E in the original GAN frame-work, as shown in Figure 3. Let Ωx be the data space andΩz be the latent feature space, the encoder E takes x ∈ Ωx

as input and produce a feature vector E(x) ∈ Ωz as output.The discriminator D is modified to take both a data sampleand a feature vector as input to calculate P (Y |x, z), whereY = 1 indicates the sample is real and Y = 0 means thedata is generated by G.

G

D

E

z

x

G(z)

E(x)

True/FakeFeatures Data

x, E(x)

G(z), z

Fig. 3. Architecture of BiGAN/ALI

The objective is thus defined as:

minG,E

maxD

V (G,E,D) = Ex∼pdata(x) logD(x, E(x))

+ Ez∼pz(z) log(1−D(G(z), z)) (2)

2.4 GAN with Variational Auto-Encoder

E

G<latexit sha1_base64="pFDH4crTX9zItE59qTLKz3nGZxI=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FD3pswdhCG8pmO2nXbjZhdyOU0F/gxYOKV/+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RRWVtfWN4qbpa3tnd298v7Bg45TxdBjsYhVO6AaBZfoGW4EthOFNAoEtoLRzdRvPaHSPJb3ZpygH9GB5CFn1FipedsrV9yqOwNZJrWcVCBHo1f+6vZjlkYoDRNU607NTYyfUWU4EzgpdVONCWUjOsCOpZJGqP1sduiEnFilT8JY2ZKGzNTfExmNtB5Hge2MqBnqRW8q/ud1UhNe+hmXSWpQsvmiMBXExGT6NelzhcyIsSWUKW5vJWxIFWXGZlOyIdQWX14m3ln1quo2zyv16zyNIhzBMZxCDS6gDnfQAA8YIDzDK7w5j86L8+58zFsLTj5zCH/gfP4ACjqMng==</latexit><latexit sha1_base64="pFDH4crTX9zItE59qTLKz3nGZxI=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FD3pswdhCG8pmO2nXbjZhdyOU0F/gxYOKV/+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RRWVtfWN4qbpa3tnd298v7Bg45TxdBjsYhVO6AaBZfoGW4EthOFNAoEtoLRzdRvPaHSPJb3ZpygH9GB5CFn1FipedsrV9yqOwNZJrWcVCBHo1f+6vZjlkYoDRNU607NTYyfUWU4EzgpdVONCWUjOsCOpZJGqP1sduiEnFilT8JY2ZKGzNTfExmNtB5Hge2MqBnqRW8q/ud1UhNe+hmXSWpQsvmiMBXExGT6NelzhcyIsSWUKW5vJWxIFWXGZlOyIdQWX14m3ln1quo2zyv16zyNIhzBMZxCDS6gDnfQAA8YIDzDK7w5j86L8+58zFsLTj5zCH/gfP4ACjqMng==</latexit><latexit sha1_base64="pFDH4crTX9zItE59qTLKz3nGZxI=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FD3pswdhCG8pmO2nXbjZhdyOU0F/gxYOKV/+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RRWVtfWN4qbpa3tnd298v7Bg45TxdBjsYhVO6AaBZfoGW4EthOFNAoEtoLRzdRvPaHSPJb3ZpygH9GB5CFn1FipedsrV9yqOwNZJrWcVCBHo1f+6vZjlkYoDRNU607NTYyfUWU4EzgpdVONCWUjOsCOpZJGqP1sduiEnFilT8JY2ZKGzNTfExmNtB5Hge2MqBnqRW8q/ud1UhNe+hmXSWpQsvmiMBXExGT6NelzhcyIsSWUKW5vJWxIFWXGZlOyIdQWX14m3ln1quo2zyv16zyNIhzBMZxCDS6gDnfQAA8YIDzDK7w5j86L8+58zFsLTj5zCH/gfP4ACjqMng==</latexit> D

<latexit sha1_base64="B8OEB9BR+vZzWlqH4Nr95bNTDxE=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FPXhswdhCG8pmO2nXbjZhdyOU0F/gxYOKV/+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RRWVtfWN4qbpa3tnd298v7Bg45TxdBjsYhVO6AaBZfoGW4EthOFNAoEtoLRzdRvPaHSPJb3ZpygH9GB5CFn1FipedsrV9yqOwNZJrWcVCBHo1f+6vZjlkYoDRNU607NTYyfUWU4EzgpdVONCWUjOsCOpZJGqP1sduiEnFilT8JY2ZKGzNTfExmNtB5Hge2MqBnqRW8q/ud1UhNe+hmXSWpQsvmiMBXExGT6NelzhcyIsSWUKW5vJWxIFWXGZlOyIdQWX14m3ln1quo2zyv16zyNIhzBMZxCDS6gDnfQAA8YIDzDK7w5j86L8+58zFsLTj5zCH/gfP4ABbGMmw==</latexit><latexit sha1_base64="B8OEB9BR+vZzWlqH4Nr95bNTDxE=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FPXhswdhCG8pmO2nXbjZhdyOU0F/gxYOKV/+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RRWVtfWN4qbpa3tnd298v7Bg45TxdBjsYhVO6AaBZfoGW4EthOFNAoEtoLRzdRvPaHSPJb3ZpygH9GB5CFn1FipedsrV9yqOwNZJrWcVCBHo1f+6vZjlkYoDRNU607NTYyfUWU4EzgpdVONCWUjOsCOpZJGqP1sduiEnFilT8JY2ZKGzNTfExmNtB5Hge2MqBnqRW8q/ud1UhNe+hmXSWpQsvmiMBXExGT6NelzhcyIsSWUKW5vJWxIFWXGZlOyIdQWX14m3ln1quo2zyv16zyNIhzBMZxCDS6gDnfQAA8YIDzDK7w5j86L8+58zFsLTj5zCH/gfP4ABbGMmw==</latexit><latexit sha1_base64="B8OEB9BR+vZzWlqH4Nr95bNTDxE=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FPXhswdhCG8pmO2nXbjZhdyOU0F/gxYOKV/+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RRWVtfWN4qbpa3tnd298v7Bg45TxdBjsYhVO6AaBZfoGW4EthOFNAoEtoLRzdRvPaHSPJb3ZpygH9GB5CFn1FipedsrV9yqOwNZJrWcVCBHo1f+6vZjlkYoDRNU607NTYyfUWU4EzgpdVONCWUjOsCOpZJGqP1sduiEnFilT8JY2ZKGzNTfExmNtB5Hge2MqBnqRW8q/ud1UhNe+hmXSWpQsvmiMBXExGT6NelzhcyIsSWUKW5vJWxIFWXGZlOyIdQWX14m3ln1quo2zyv16zyNIhzBMZxCDS6gDnfQAA8YIDzDK7w5j86L8+58zFsLTj5zCH/gfP4ABbGMmw==</latexit>

x<latexit sha1_base64="oPhXEP482MDn3uh8idqRXjWu40Q=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVFIR1FvRi8cKxhbbUDbbTbt0swm7L2IJ/RdePKh49ed489+4aXPQ1oGFYeY9dt4EiRQGXffbWVpeWV1bL22UN7e2d3Yre/v3Jk414x6LZazbATVcCsU9FCh5O9GcRoHkrWB0nfutR66NiNUdjhPuR3SgRCgYRSs9dCOKwyDMnia9StWtuVOQRVIvSBUKNHuVr24/ZmnEFTJJjenU3QT9jGoUTPJJuZsanlA2ogPesVTRiBs/myaekGOr9EkYa/sUkqn6eyOjkTHjKLCTeUIz7+Xif14nxfDCz4RKUuSKzT4KU0kwJvn5pC80ZyjHllCmhc1K2JBqytCWVLYl1OdPXiTeae2y5t6eVRtXRRslOIQjOIE6nEMDbqAJHjBQ8Ayv8OYY58V5dz5mo0tOsXMAf+B8/gBnfpDv</latexit><latexit sha1_base64="oPhXEP482MDn3uh8idqRXjWu40Q=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVFIR1FvRi8cKxhbbUDbbTbt0swm7L2IJ/RdePKh49ed489+4aXPQ1oGFYeY9dt4EiRQGXffbWVpeWV1bL22UN7e2d3Yre/v3Jk414x6LZazbATVcCsU9FCh5O9GcRoHkrWB0nfutR66NiNUdjhPuR3SgRCgYRSs9dCOKwyDMnia9StWtuVOQRVIvSBUKNHuVr24/ZmnEFTJJjenU3QT9jGoUTPJJuZsanlA2ogPesVTRiBs/myaekGOr9EkYa/sUkqn6eyOjkTHjKLCTeUIz7+Xif14nxfDCz4RKUuSKzT4KU0kwJvn5pC80ZyjHllCmhc1K2JBqytCWVLYl1OdPXiTeae2y5t6eVRtXRRslOIQjOIE6nEMDbqAJHjBQ8Ayv8OYY58V5dz5mo0tOsXMAf+B8/gBnfpDv</latexit><latexit sha1_base64="oPhXEP482MDn3uh8idqRXjWu40Q=">AAAB8HicbVBNS8NAFHzxs9avqkcvi0XwVFIR1FvRi8cKxhbbUDbbTbt0swm7L2IJ/RdePKh49ed489+4aXPQ1oGFYeY9dt4EiRQGXffbWVpeWV1bL22UN7e2d3Yre/v3Jk414x6LZazbATVcCsU9FCh5O9GcRoHkrWB0nfutR66NiNUdjhPuR3SgRCgYRSs9dCOKwyDMnia9StWtuVOQRVIvSBUKNHuVr24/ZmnEFTJJjenU3QT9jGoUTPJJuZsanlA2ogPesVTRiBs/myaekGOr9EkYa/sUkqn6eyOjkTHjKLCTeUIz7+Xif14nxfDCz4RKUuSKzT4KU0kwJvn5pC80ZyjHllCmhc1K2JBqytCWVLYl1OdPXiTeae2y5t6eVRtXRRslOIQjOIE6nEMDbqAJHjBQ8Ayv8OYY58V5dz5mo0tOsXMAf+B8/gBnfpDv</latexit>

z E(x)<latexit sha1_base64="5+blyoBF/yZDaw2t9Ib2lg5rC4s=">AAACBHicbVBNS8NAEN3Ur1q/oh71sFiEeimJCOqtKILHCsYWmlA22027dDcJuxuxhly8+Fe8eFDx6o/w5r9x00bQ1gcDj/dmmJnnx4xKZVlfRmlufmFxqbxcWVldW98wN7duZJQITBwcsUi0fSQJoyFxFFWMtGNBEPcZafnD89xv3RIhaRReq1FMPI76IQ0oRkpLXXPX5UgN/CC9z1xJObyo/Qh32UHXrFp1aww4S+yCVEGBZtf8dHsRTjgJFWZIyo5txcpLkVAUM5JV3ESSGOEh6pOOpiHiRHrp+IsM7mulB4NI6AoVHKu/J1LEpRxxX3fmJ8ppLxf/8zqJCk68lIZxokiIJ4uChEEVwTwS2KOCYMVGmiAsqL4V4gESCCsdXEWHYE+/PEucw/pp3bo6qjbOijTKYAfsgRqwwTFogEvQBA7A4AE8gRfwajwaz8ab8T5pLRnFzDb4A+PjG/sXmG8=</latexit><latexit sha1_base64="5+blyoBF/yZDaw2t9Ib2lg5rC4s=">AAACBHicbVBNS8NAEN3Ur1q/oh71sFiEeimJCOqtKILHCsYWmlA22027dDcJuxuxhly8+Fe8eFDx6o/w5r9x00bQ1gcDj/dmmJnnx4xKZVlfRmlufmFxqbxcWVldW98wN7duZJQITBwcsUi0fSQJoyFxFFWMtGNBEPcZafnD89xv3RIhaRReq1FMPI76IQ0oRkpLXXPX5UgN/CC9z1xJObyo/Qh32UHXrFp1aww4S+yCVEGBZtf8dHsRTjgJFWZIyo5txcpLkVAUM5JV3ESSGOEh6pOOpiHiRHrp+IsM7mulB4NI6AoVHKu/J1LEpRxxX3fmJ8ppLxf/8zqJCk68lIZxokiIJ4uChEEVwTwS2KOCYMVGmiAsqL4V4gESCCsdXEWHYE+/PEucw/pp3bo6qjbOijTKYAfsgRqwwTFogEvQBA7A4AE8gRfwajwaz8ab8T5pLRnFzDb4A+PjG/sXmG8=</latexit><latexit sha1_base64="5+blyoBF/yZDaw2t9Ib2lg5rC4s=">AAACBHicbVBNS8NAEN3Ur1q/oh71sFiEeimJCOqtKILHCsYWmlA22027dDcJuxuxhly8+Fe8eFDx6o/w5r9x00bQ1gcDj/dmmJnnx4xKZVlfRmlufmFxqbxcWVldW98wN7duZJQITBwcsUi0fSQJoyFxFFWMtGNBEPcZafnD89xv3RIhaRReq1FMPI76IQ0oRkpLXXPX5UgN/CC9z1xJObyo/Qh32UHXrFp1aww4S+yCVEGBZtf8dHsRTjgJFWZIyo5txcpLkVAUM5JV3ESSGOEh6pOOpiHiRHrp+IsM7mulB4NI6AoVHKu/J1LEpRxxX3fmJ8ppLxf/8zqJCk68lIZxokiIJ4uChEEVwTwS2KOCYMVGmiAsqL4V4gESCCsdXEWHYE+/PEucw/pp3bo6qjbOijTKYAfsgRqwwTFogEvQBA7A4AE8gRfwajwaz8ab8T5pLRnFzDb4A+PjG/sXmG8=</latexit>

G(z)<latexit sha1_base64="fX+87+a7rWwaAIxAqPj+6a0HUzU=">AAAB83icbVBNT8JAFHzFL8Qv1KOXjcQEL6Q1JOqN6EGPmFghgYZsly1s2G7r7pYEG36HFw9qvPpnvPlv3EIPCk6yyWTmvbzZ8WPOlLbtb6uwsrq2vlHcLG1t7+zulfcPHlSUSEJdEvFItn2sKGeCupppTtuxpDj0OW35o+vMb42pVCwS93oSUy/EA8ECRrA2kndT7YZYD/0gfZqe9soVu2bPgJaJk5MK5Gj2yl/dfkSSkApNOFaq49ix9lIsNSOcTkvdRNEYkxEe0I6hAodUeeks9BSdGKWPgkiaJzSaqb83UhwqNQl9M5lFVIteJv7ndRIdXHgpE3GiqSDzQ0HCkY5Q1gDqM0mJ5hNDMJHMZEVkiCUm2vRUMiU4i19eJu5Z7bJm39Urjau8jSIcwTFUwYFzaMAtNMEFAo/wDK/wZo2tF+vd+piPFqx85xD+wPr8AciOkac=</latexit><latexit sha1_base64="fX+87+a7rWwaAIxAqPj+6a0HUzU=">AAAB83icbVBNT8JAFHzFL8Qv1KOXjcQEL6Q1JOqN6EGPmFghgYZsly1s2G7r7pYEG36HFw9qvPpnvPlv3EIPCk6yyWTmvbzZ8WPOlLbtb6uwsrq2vlHcLG1t7+zulfcPHlSUSEJdEvFItn2sKGeCupppTtuxpDj0OW35o+vMb42pVCwS93oSUy/EA8ECRrA2kndT7YZYD/0gfZqe9soVu2bPgJaJk5MK5Gj2yl/dfkSSkApNOFaq49ix9lIsNSOcTkvdRNEYkxEe0I6hAodUeeks9BSdGKWPgkiaJzSaqb83UhwqNQl9M5lFVIteJv7ndRIdXHgpE3GiqSDzQ0HCkY5Q1gDqM0mJ5hNDMJHMZEVkiCUm2vRUMiU4i19eJu5Z7bJm39Urjau8jSIcwTFUwYFzaMAtNMEFAo/wDK/wZo2tF+vd+piPFqx85xD+wPr8AciOkac=</latexit><latexit sha1_base64="fX+87+a7rWwaAIxAqPj+6a0HUzU=">AAAB83icbVBNT8JAFHzFL8Qv1KOXjcQEL6Q1JOqN6EGPmFghgYZsly1s2G7r7pYEG36HFw9qvPpnvPlv3EIPCk6yyWTmvbzZ8WPOlLbtb6uwsrq2vlHcLG1t7+zulfcPHlSUSEJdEvFItn2sKGeCupppTtuxpDj0OW35o+vMb42pVCwS93oSUy/EA8ECRrA2kndT7YZYD/0gfZqe9soVu2bPgJaJk5MK5Gj2yl/dfkSSkApNOFaq49ix9lIsNSOcTkvdRNEYkxEe0I6hAodUeeks9BSdGKWPgkiaJzSaqb83UhwqNQl9M5lFVIteJv7ndRIdXHgpE3GiqSDzQ0HCkY5Q1gDqM0mJ5hNDMJHMZEVkiCUm2vRUMiU4i19eJu5Z7bJm39Urjau8jSIcwTFUwYFzaMAtNMEFAo/wDK/wZo2tF+vd+piPFqx85xD+wPr8AciOkac=</latexit>

True/Fake

VAE

GAN

Fig. 4. Architecture of VAE-GAN

VAE-GAN [38] proposes to combine Variational Auto-Encoder (VAE) [39] with GAN [4] to exploit both of theirbenefits, as GAN can generate sharp images but oftenmiss some modes while images produced by VAE [39]are blurry but have large variety. The architecture of VAE-GAN is shown in Figure 4. The VAE part regularize theencoder E by imposing a prior of normal distribution (e.g.z ∼ N (0, 1)), and the VAE loss term is defined as:

LV AE = −Ez∼q(z|x) log[p(x|z)] +DKL(q(z|x)||p(x)), (3)

where z ∼ E(x) = q(z|x), x ∼ G(z) = p(x|z) and DKL isthe Kullback-Leibler divergence.

Also, VAE-GAN [38] proposes to represent the recon-struction loss of VAE in terms of the discriminator D. LetDl(x) denotes the representation of the l-th layer of thediscriminator, and a Gaussian observation model can bedefined as:

p(D(x)|z) = N (D(x)|D(x), I), (4)

where x ∼ G(z) is a sample from the generator, and I is theidentity matrix. So the new VAE loss is:

LV AE = −Ez∼q(z|x) log[p(D(x)|z)] +DKL(q(z|x)||p(x)),(5)

which is then combined with the GAN loss defined inEquation 1. Experiments demonstrate that VAE-GAN cangenerate better images than VAE or GAN alone.

2.5 Handling Mode CollapseAlthough GAN is very effective in image synthesis, itstraining process is very unstable and requires a lot of tricksto get a good result, as pointed out in [4] [5]. Despiteits instability in training, GAN also suffers from the modecollapse problem, as discussed in [4] [5] [40] . In the originalGAN formulation [4], the discriminator does not need toconsider the variety of synthetic samples, but only focuseson telling whether each sample is realistic or not, whichmakes it possible for the generator to spend efforts ingenerating a few samples that are good enough to foolthe discriminator. For example, although the MNIST [41]dataset contains images of digits from 0 to 9, in an extremecase, a generator only needs to learn to generate one of theten digits perfectly to completely fool the discriminator, andthen the generator stops trying to generate the other ninedigits. The absence of the other nine digits is an exampleof inter-class mode collapse. An example of intra-class modecollapse is, there are many writing styles for each of thedigits, but the generator only learns to generate one perfectsample for each digit to successfully fool the discriminator.

Many methods have been proposed to address themodel collapse problem. One technique is called minibatchfeatures [6], whose idea is to make the discriminator comparean example to a minibatch of true samples as well as a mini-batch of generated samples. In this way, the discriminatorcan learn to tell if a generated sample is too similar to someother generated samples by measuring samples’ distances inlatent space. Although this method works well, as discussedin [14], the performance largely depends on what featuresare used in distance calculation. MRGAN [7] proposes toadd an encoder which transforms a sample in data spaceback to latent space, as in BiGAN [36]. The combinationof encoder and generator acts as an auto-encoder, whosereconstruction loss is added to the adversarial loss to actas a mode regularizer. Meanwhile, the discriminator is alsotrained to discriminate reconstructed samples, which actsas another mode regularizer. WGAN [8] proposes to useWasserstein distance to measure the similarity between truedata distribution and the learned distribution, instead of us-ing Jensen-Shannon divergence as in the original GAN [4].Although it theoretically avoids mode collapse, it takes alonger time for the model to converge than previous GANs.To alleviate this problem, WGAN-GP [9] proposes to usegradient penalty, instead of weight clipping in WGAN.WGAN-GP generally produces good images and greatlyavoid mode collapse, and it is easy to apply this trainingframework to other GAN models. More tips for trainingGANs can be found in Soumith’s NIPS 2016 tutorial “Howto train a GAN”1.

1. https://github.com/soumith/ganhacks

https://github.com/soumith/ganhacks

4

Generator Generator 1Input Output Input Generator 2 Output

Input Generator 1 Generator 2 Generator k Output

Discriminator Discriminator 2Discriminator 1

Discriminator 2Discriminator 1 Discriminator k

(a) Direct method (b) Hierarchical method

(c) Iterative method

Structure/Background Texture/Foreground

(different structures, not share weights)

(same or similar structures, may share weights)

low resolution / rough details high resolution / sharp details

Discriminator k-1

Generator k-1

Fig. 5. Three approaches of image synthesis using Generative Adversarial Networks. The direct method does everything with only one generatorand one discriminator, while the other two methods have multiple generators and discriminators. Hierarchical method usually use two layers ofGANs, where each GAN plays a fundamentally different role than the other one. Iterative method, however, contains multiple GANs that performthe same task but operate at different resolution.

Transposed convolution

Batch normalization

ReLU activation

Convolution

Batch normalization

Leaky ReLU activation

Generator Discriminator

Fig. 6. Building blocks of DCGAN, where the generator uses transposedconvolution, batch-normalization and ReLU activation, while the discrim-inator uses convolution, batch-normalization and LeakyReLU activation

3 GENERAL APPROACHES OF IMAGE SYNTHESISWITH GAN

In this section, we summarize the three main approachesused in generating images, i.e. direct methods, iterative meth-ods and hierarchical methods respectively, which form thebasis of all applications mentioned in this paper. The overallstructures of these three methods are shown in Figure 5.

3.1 Direct Methods

All methods under this category follows the philosophy ofusing one generator and one discriminator in their models,and the structures of the generator and the discriminatorare straight-forward without branches. Many of the earliestGAN models fall into this category, like GAN [4], DCGAN[5], ImprovedGAN [6], InfoGAN [11], f-GAN [42] and GAN-INT-CLS [30]. Among them, DCGAN is one of the mostclassic ones whose structure is used by many later modelssuch as [11] [30] [43] [19]. The general building blocks usedin DCGAN are shown in Figure 6, where the generator usestransposed convolution, batch-normalization and ReLU ac-tivation, while the discriminator uses convolution, batch-normalization and LeakyReLU activation.

This kind of method is relatively more straight-forwardto design and implement when compared with hierarchicaland iterative methods, and it usually achieves good results.

3.2 Hierarchical Methods

Contrary to the Direct Method, algorithms under the Hier-archical Method use two generators and two discriminatorsin their models, where different generators have differentpurposes. The idea behind those methods is to separatean image into two parts, like “styles & structure” and“foreground & background”. The relation between the twogenerators may be either parallel or sequential.

SS-GAN [44] proposes to use two GANs, a Structure-GAN for generating a surface normal map from randomnoise z, and another Style-GAN that takes both the gener-ated surface normal map as well as a noise z as input andoutputs an image. The Structure-GAN uses the same build-ing blocks as DCGAN [5], while the Style-GAN is slightlydifferent. For Style-Generator, the generated surface normalmap and the noise vector go through several convolutionaland transposed convolutional layers respectively, and thenthe results are concatenated into a single tensor which willgo through the remaining layers in Style-Generator. Asfor the Style-Discriminator, each surface normal map andits corresponding image are concatenated at the channeldimension to form a single input to the discriminator. Be-sides, SS-GAN assumes that, a good synthetic image shouldalso be used to reconstruct a good surface normal map.Under this assumption, SS-GAN designs a fully-connectednetwork that transforms an image back to its surface normalmap, and uses a pixel-wise loss that enforces the recon-structed surface normal to approximate the true one. A mainlimitation of SS-GAN is that it requires to use Kinect toobtain groundtruth for surface normal maps.

As a special example, LR-GAN [45] chooses to generatethe foreground and background content using differentgenerator, but only one discriminator is used to judge theimages while the recurrent image generation process isrelated to the iterative method. Nonetheless, experiments ofLR-GAN demonstrate that it is possible to separate the gen-eration of foreground and background content and producesharper images.

5

3.3 Iterative MethodsThis method differentiates itself from Hierarchical Methods intwo ways. First, instead of using two different generatorsthat perform different roles, the models in this category usemultiple generators that have similar or even the same struc-tures, and they generate images from coarse to fine, witheach generator refining the details of the results from theprevious generator. Second, when using the same structuresin the generators, Iterative Methods can use weight-sharingamong the generators [45], while Hierarchical Methods usu-ally can not.

LAPGAN [40] is the first GAN that uses an iterativemethod to generate images from coarse to fine using Lapla-cian pyramid [46]. The multiple generators in LAPGANperform the same task: takes an image from previous gen-erator and a noise vector as input, and then outputs thedetails (a residual image) that can make the image sharperwhen added to the input image. The only difference in thestructures of those generators is the size of input/outputdimension, while an exception is that the generator at thelowest level only takes a noise vector as input and outputsan image. LAPGAN outperforms the original GAN [4] andshows that iterative method can generate sharper imagesthan direct method.

StackGAN [29], as an iterative method, has only twolayers of generators. First generator takes an input (z, c) andthen outputs a blurry image that can show a rough shapeand blurry details of the objects, while the second generatortakes (z, c) and the image generated by the previous genera-tor and then output a larger image with more photo-realisticdetails.

Another example of Iterative Methods is SGAN [47] whichstacks generators that takes lower level feature as input andoutputs higher level features, while the bottom generatortakes a noise vector as input and the top generator outputsan image. The necessity of using separate generators fordifferent levels of features is that SGAN associates an en-coder, a discriminator and a Q-network [47] (which is usedto predict the posterior probability P (zi|hi) for entropymaximization, where hi is the output feature of the i-thlayer) for each generator, so as to constrain and improvethe quality of those features.

An example of using weight-sharing is the GRAN [48]model, which is an extension to the DRAW [49] modelwhich is based on variational autoencoder [39]. As inDRAW, GRAN generates an image in a recurrent way thatfeeds the output of the previous step into the model and theoutput of the current step will be fed back as the input in thenext step. All steps use the same generator, so the weightsare shared among them, just like classic Recurrent NeuralNetwork (RNN).

3.4 Other MethodsPPGN [50] produces impressive images in several tasks,such as class-conditioned image synthesis [28], text-to-image synthesis [30] and image inpainting [16]. Differentfrom other methods mentioned earlier, PPGN uses activationmaximization [51] to generate images, and it is based onsampling with a prior learned with denoising autoencoder(DAE) [52]. To generate an image conditioned on a certain

class label y, instead of using a feed-forward way (e.g.recurrent methods can be seen as feed-forward if unfoldedthrough time), PPGN runs an optimization process thatfinds an input z to the generator that makes the outputimage highly activate a certain neuron in another pretrainedclassifier (in this case, the neuron in the output layer thatcorresponds to its class label y).

In order to generate better higher resolution images, Pro-gressiveGAN [53] proposes to start with training a generatorand discriminator of 4×4 pixels, after which it incrementallyadds extra layers that doubles the output resolution upto 1024 × 1024. This approach allows the model to learncoarse structure first and then focus on refining details later,instead of having to deal with all details at different scalesimultaneously.

4 TEXT-TO-IMAGE SYNTHESIS

When we apply GAN to image synthesis, it is desired tocontrol the content of generated images. Although there arelabel-conditioned GAN models like cGAN [28] which cangenerate images belong to a specific class, it remains a greatchallenge to generate images based on text descriptions.Text-to-image synthesis is kind of the holy grail of computervision, since if an algorithm can generate truly realisticimages from mere text descriptions, we can have a highconfidence that the algorithm actually understands whatis in the images, where computer vision is about teachingcomputers to see and understand visual contents in the realworld.

GAN-INT-CLS [30] provides the first attempt of usingGAN to generate images from text descriptions. The idea issimilar to conditional GAN that concatenates the conditionvector with the noise vector, but with the difference ofusing the embedding of text sentences instead of class labelsor attributes. The embedding method used in GAN-INT-CLS [30] is from another paper [54] that tries to learn ro-bust embeddings of images and sentences given the image-sentence pairs. Except for the sentence embedding method,the generator of GAN-INT-CLS follows the same architec-ture of DCGAN [5]. As for the discriminator, in order totake into account the text description, the text embeddingvector of length K is spatially replicated to become a textembedding tensor of shape [W ×H ×K], where W and Hare the width and height of the generated image’s featuretensor after going through several convolutional layers inthe discriminator. Then the text embedding tensor is com-bined with the image feature tensor of shape [W ×H×C] atthe channel dimension C and forms a new tensor of shape[W×H×(C+K)], which then goes through the rest layers ofthe discriminator. The intuition behind this approach is that,by spatial replication and depth concatenation, each “pixel”(of shape [1 × 1 × (C + K)]) of the image’s feature tensorcontains all the information of text description, and then theconvolutional layers can learn to align the image’s contentwith certain parts of the text feature by using multipleconvolution kernels.

GAN-INT-CLS [30] also proposes to distinguish betweentwo sources of errors: unrealistic image with any text, andrealistic image with mismatched text. To train the discriminatorto distinguish these two kinds of errors, three types of input

6

are fed into the discriminator at every training step: realimage, right text, real image, wrong text and fake image,right text. The experimental results in [30] show that suchtraining technique is important in generating high qualityimages, since it tells the model not only how to generaterealistic images, but also the correspondence between textand images.

In addition, GAN-INT-CLS [30] proposes to use mani-fold interpolation to obtain more text embeddings, by sim-ply interpolating between captions in the training set. Sincethe number of captions for each image is usually no morethan five and an image can be described in many ways,doing such interpolation allows the model to learn frompossible text descriptions that are not in the training set.According to the authors, the interpolated text embeddingsneed not correspond to actual human-written text, so thereis no extra labeling required.

TAC-GAN [34] is a combination of GAN-INT-CLS [30]and AC-GAN [33]. With the auxiliary classifier, TAC-GANis able to achieve higher Inception Score [6] than GAN-INT-CLS [30] and StackGAN [29] on the Oxford-102 [55]daataset.

4.1 Text-to-Image with Location ConstraintsAlthough GAN-INT-CLS [30] and StackGAN [29] can gener-ate images based on text description, they fail to capture thelocalization constraints of the objects in images. To allow en-coding spatial constraints, Reed et al. propose GAWWN [31]that presents two possible solutions.

The first approach proposed in GAWWN [31] is to learna bounding box for an object by putting a spatially repli-cated text embedding tensor through a Spatial TransformerNetwork [56]. Here the spatial replication is the same processmentioned in GAN-INT-CLS [30]. The output of spatialtransformer network is a tensor with the same dimensionas input, but values outside of the bounding are all zeros.The output tensor of the spatial transformer goes throughseveral convolutional layers to reduce its size back to a 1-dimensional vector, which not only preserves the informa-tion of text but also provides a constraint on object’s locationby the bounding box. A benefit of this approach is that it isend-to-end, without requiring additional input.

The second approach proposed in GAWWN [31] is to useuser-specified keypoints to constrain the different parts (e.g.head, leg, arm, tail, etc.) of the object in the image. For eachkeypoint, a mask matrix is generated where the keypointposition is 1 and others 0, and all the matrices are combinedthrough depth concatenation to form a mask tensor of shape[M × M × K], where M is the size of the mask and Kis the number of keypoints. The tensor is then flattenedinto a binary matrix with 1 indicating the presence of akeypoint and 0 otherwise, and then replicated depth-wise tobecome a tensor to be fed into remaining layers. Althoughthis approach allows more detailed constraints on the object,it requires extra user input to specify the keypoints. Even ifGAWWN can infer unobserved keypoints from a few user-specified keypoints so that the user does not need to specifyall keypoints, the cost of extra user input is still non-trivial.

The remaining part of GAWWN [31] is similar to GAN-INT-CLS [30], with the difference of using a separate path-way to process bounding box tensor or keypoint tensor.

Although GAWWN provides two approaches that can en-force location constraints on the generated images, it onlyworks on images with single objects, since neither of theproposed methods is able to handle several different objectsin an image. From the result shown in [31], GAWWN workswell on the CUB dataset [57], while the synthetic imagesgenerated from the model trained on MPII Human Pose(MHP) dataset [58] are very blurry and it is hard to tell whatthe content is. This may be due to the fact that the poses of astanding bird are very similar to each other (note that birdsin the CUB dataset are at standing poses), while a human’sposes can be of uncountable types.

The main benefit of specifying the locations of each partof the objects is that it yields more interpretable results,and that it is desirable that the model can understand theconcepts of different parts of objects, which is one of theultimate goals of computer vision.

4.2 Text-to-Image with Stacked GANs

Instead of using only one generator, StackGAN [29] pro-poses to use two different generators for text-to-image syn-thesis. The first generator is responsible for generating low-resolution images that contain rough shapes and colors ofobjects, while the second generator takes the output of thefirst generator and produces images with higher resolutionand sharper details. Each generator is associated with itsown discriminator. Besides, StackGAN also proposes a con-ditional data augmentation technique to produce more textembeddings for the generator. StackGAN randomly samplesfrom a Gaussian distribution N (µ(φt),Σ(φt)), where themean vector µ(φt) and diagonal variance matrix Σ(φt)are functions of text embedding φt. To further enforce thesmoothness over conditional manifold, N (µ(φt),Σ(φt)) isconstrained to approximate a standard Gaussian distribu-tion N (0, 1) by adding a Kullback-Leibler divergence regu-larization term.

As an improved version of StackGAN, StackGAN++ [59]proposes to have more pairs of generators and discrimi-nators instead of just two, adds an unconditional imagesynthesis loss to the discriminators, and uses a color-consistency regularization term calculated by mean-squareloss of the means and variances between real and fakeimages.

AttnGAN [60] further extends the architecture of Stack-GAN++ [59] by using attention mechanism over image andtext features. In AttnGAM, each sentence is embedded into aglobal sentence vector and each word of the sentence is alsoembedded into a word vector. The global sentence vector isused to generate a low resolution image at the first stage,and then the following stages use the input image featuresfrom the previous stage and the word vectors as input to theattention layer and calculate a word-context vector whichwill be combined with the image features and form theinput to the generator that will generate new image features.Besides, AttnGAN also proposes a Deep Attentional Multi-modal Similarity Model (DAMSM) that uses attention layersto compute the similarity between images and text usingboth global sentence vectors as well as fine-grained wordvectors, which provides an additional fine-grained image-text matching loss for training the generator. Experiments of

7

AttnGAN not only show the effectiveness of using attentionlayers in image synthesis, but also make the model moreinterpretable.

With stacked generators, StackGAN [29], Stack-GAN++ [59] and AttnGAN [60] produce sharper imagesthan GAN-INT-CLS [30] and GAWWN [31] on the CUB [57]and Oxford-102 [55] datasets. Although AttnGAN [60]claims to have significantly higher Inception Score [6] thanPPGN [50] on the COCO [61] dataset, the examples itprovides do not visually look apparently better.

4.3 Text-to-Image by Iterative SamplingDifferent from previous approaches that directly incorporatethe text information in the generation process, PPGN [50]proposes to use activation maximization [51] method to gen-erate images in an iterative sampling way. The modelcontains two separate parts: a pretrained image captioningmodel, and an image generator. The image generator is acombination of denoising auto-encoder and GAN, trainedindependent of the image captioning model. Let p(x) bethe distribution of images, and p(y) be the distributionof text descriptions, we want to sample image from thejoint distribution p(x,y). PPGN factorizes the joint distri-bution into two factors: p(x,y) = p(x)p(y|x), where p(x)is modeled by the generator, and p(y|x) is modeled by theimage captioning model. According to the paper [50], givena trained image generator and image captioning model, thefollowing iterative sampling process is performed to obtainan image x based on the description y∗:

xt+1 = xt + ε1∂ log p(xt)

∂xt+ ε2

∂ log p(y = y∗|xt)∂xt

+N(0, ε23),

(6)

where the ε1 term takes a step from current image xt toa more realistic image (regardless of the text description),the ε2 term takes a step from current image to an imagethat better matches the description y∗ (here the LRCNmodel [62] is used for the p(y = y∗|x) term), and the ε3term adds a small noise to allow for a broader search in thelatent space.

Although such iterative sampling method takes moretime to generate an image in test phrase, it can generatehigher resolution images with better quality than previousmethods like [30] [31] [29], and its performance is among thebest in both class-conditioned and text-conditioned imagesynthesis.

4.4 Limitations of Current Text-to-Image ModelsPresent text-to-image models perform well on datasets withsingle object per image, such as human faces in CelebA [63],birds in CUB [57], flowers in Oxford-102 [55], and someobjects in ImageNet [1]. Also, they can synthesize reasonableimages for scenes like bedrooms and living rooms in theLSUN [64], even though the objects in the scenes lack sharpdetails. However, all present models work badly in situationwhere multiple complicated objects are involved in oneimage, as it is in the MS-COCO dataset [61].

A plausible reason why current models fail to work wellon complicated images is that the models only learn the

overall features of an image, instead of learning the conceptof each kind of objects in it. This gives an explanation whysynthetic scenes of bedrooms and living rooms lack sharpdetails, since the model do not distinguish between a bedand a desk, all it sees is that some patterns of shapes andcolors should be put somewhere in the synthetic image.In other words, the model does not really understand theimage, but just remembers where to put some shapes andcolors.

Generative Adversarial Network certainly provides usa promising way to do text-to-image synthesis, since itproduces sharper images than any other generative methodsso far. To take a further step in text-to-image synthesis, weneed to figure out novel ways to teach the algorithms theconcepts of things. One possible way is to train separatemodels that can generate different kinds of objects, and thentrain another model that learns how to combine differentobjects (i.e. the reasonable relations between objects) intoone image based on the text descriptions. However, suchmethod requires large training sets for different objects, andanother large dataset that contains images of those differentobjects, which is hard to acquire. Another possible directionmay be to make use of the Capsule idea proposed by Hintonet al. [65] [66] since capsules are designed to capture theconcepts of objects, but how to efficiently train such capsule-based network is still a problem to be solved.

5 IMAGE-TO-IMAGE TRANSLATION

In this section, we discuss another type of image synthesis,image-to-image translation, which takes images as conditionalinput. Image-to-image translation is defined as the problemof translating a possible representation of one scene intoanother, such as mapping BW images into RGB images,or the other way around [67]. This problem is related tostyle transfer [68] [69], which takes a content image and astyle image and output an image with the content of thecontent image and the style of the style image. Image-to-image translation can be viewed as a generalization of styletransfer, since it is not limited to transferring the styles ofimages, but can also manipulate attributes of objects (as inapplications of face editing). In this section, we introduceseveral models that work for general image-to-image trans-lation, from supervised methods to unsupervised ones. The“supervision” here means that for each image in the sourcedomain, there is a corresponding ground-truth image inthe target domain. Later we will turn into different specificapplications such as face editing, image super resolution, imagein-painting and video prediction.

5.1 Supervised Image-to-Image TranslationPix2Pix [67] proposes to combine the loss of a conditionalGenerative Adversarial Network (cGAN) with L1 regular-ization loss, so that the generator is not only trained to foolthe discriminator but also to generate images as close toground-truth as possible. The reason for using L1 instead ofL2 is that L1 produces less blurry images [70].

The conditional GAN loss is defined as:

LcGAN (G,D) = Ex,y∼pdata(x,y)[logD(x,y)]+

Ex∼pdata(x),z∼pz(z)[log(1−D(x, G(x, z))],(7)

8

where x, y ∼ p(x, y) are images of the same scene withdifferent styles, z ∼ p(z) is a random noise as in the regularGAN [4].

The L1 loss for constraining self-similarity is defined as:

LL1(G) = Ex,y∼pdata(x,y),z∼pz(z)[||y −G(x, z)||1], (8)

The overall objective is thus given by:

G∗, D∗ = arg minG

maxDLcGAN (G,D) + λLL1(G), (9)

where λ is a hyper-parameter to balance the two loss terms.The authors of Pix2Pix [67] find that the noise z does not

have obvious effect on the output, so they provide the noisein the form of dropout at training and test time instead ofdrawing samples from a random distribution.

The generator structure for Pix2Pix [67] is based on U-Net [71], which belongs to the encoder-decoder frameworkbut adds skip connections from encoder to decoder so asto allow circumventing the bottleneck for sharing low-levelinformation like edges of objects.

Pix2Pix [67] proposes PatchGAN (the patch-based ideawas previously explored in MGAN [72]) as the discrimina-tor, which, instead of classifying the whole image, tries toclassify each N × N path of the image and average all thescores of patches to get the final score for the image. Themotivation of this method is that, although L1 and L2 lossesproduce blurry images and fail to capture high frequencydetails, in many cases they can capture the low frequenciesquite well. In order to capture the high frequency details,Pix2Pix [67] argues that it is sufficient to restrict the discrim-inator to focus only on local patches. From the experiments,it is found that, for an 256×256 image, a patch-size of 70×70works best.

Although Pix2Pix produces very impressive syntheticimages, the major limitation is that it must use pairedimages as supervision, as is shown in Equation 8 that datapair (x,y) is drawn from the joint distribution p(x,y).

5.2 Supervised Image-to-Image Translation with Pair-wise DiscriminationPLDT [35] proposes another method to do supervised image-to-image translation, by adding another discriminator Dpair

that learns to tell whether a pair of images from differentdomains is associated with each other. The architectureof PLDT is shown in Figure 7. Given an input image xsfrom source domain, its ground-truth image xt in the targetdomain, an irrelevant image x−t in the target domain, andthe generator G transfers xs into an image xt in the targetdomain, the loss for Dpair can be defined as:

Lpair = − t · log[Dpair(xs,x)]

+ (t− 1) · log[1−Dpair(xs,x)],

s.t. t =

0 if x = xt0 if x = xt1 if x = x−t .

(10)

The generator of PLDT [35] is implemented in anencoder-decoder fashion using (transposed) convolutions,while the two discriminators are implemented as fully con-volutional networks. As shown in the experimental results,PLDT [35] performs domain transfer that modifies the ge-ometric shapes of objects while trying to keep the textureconsistent among all associated images.

G

D

Dpair

xs xtxt xt

True/Fake

Channel concatenation

Select

Associated/Unassociated

Fig. 7. Architecture of Pixel-Level Domain Transfer (PLDT) [35]. In thisexample, the source domain is BW color space, while the target domainis RGB space. As explained in [35], this model can performs geometricmodifications on the input image, while keeping the object in the outputimage the same as input.

5.3 Unsupervised Image-to-Image Translation withCyclic Loss

Two concurrent works CycleGAN [73] and DualGAN [74]propose to add a self-consistency (reconstruction) loss thattries to preserve the input image after a cycle of transforma-tion. CycleGAN and DualGAN share the same framework,which is shown in Figure 8. As we can see, the two gen-erators GAB and GBA are doing opposite transformations,which can be seen as a kind of dual learning [75]. Besides,DiscoGAN [76] is another model that utilizes the same cyclicframework as Figure 8.

Here we use CycleGAN as an example. In CycleGAN,there are two generators, GAB that transfer an image fromdomain A to B and GBA that performs the opposite trans-formation. Also, there are also two discriminators DA andDB that predicts whether an image belongs to that domain.For a pair of GAB and DB , the adversarial loss function isdefined as:

LGAN (GAB , DB) = Eb∼pB(b)[logDB(b)]

+ Ea∼pA(a)[1− log(DB(GAB(a))], (11)

and similarly for the pair GBA and DA, we can define theadversarial loss as LGAN (GBA, DA).

Besides the adversarial loss, a cycle-consistency loss isdesigned to minimize the reconstruction error after wetranslate an image of one domain to another and thentranslate it back to the original domain, i.e. a→ GAB(a)→GBA(GAB(a)) ≈ a. Since this cycle can be defined from twodirections, the cycle-consistency loss is defined as:

Lcyc(GAB , GBA) = Ea∼pA(a)[||a−GBA(GAB(a))||1]

+ Eb∼pB(b)[||b−GAB(GBA(b))||1].(12)

Then the overall loss function is:

L(GAB , GBA, DA, DB) = LGAN (GAB , DB)

+ LGAN (GBA, DA)

+ λLcyc(GAB , GBA), (13)

9

GAB

GBA

DA DB

a

b

GAB(a)

GBA(b)

GAB(GBA(b))

GBA(GAB(a))

||a GBA(GAB(a))||1

||b GAB(GBA(b))||1

True

Fake True

Fake

Domain A Domain B

Reconstruction

Reconstruction

Fig. 8. Framework of CycleGAN and DualGAN. A and B are two different domains. There is a discriminator for each domain that judges if an imagebelong to that domain. Two generators are designed to translate an image from one domain to another. There are two cycles of data flow, the redone performs a sequence of domain transfer A → B → A, while the blue one is B → A → B. L1 loss is applied on the input a (or b) and thereconstructed input GBA(GAB(a)) (or GAB(GBA(b))) to enforce self-consistency.

where λ is a hyper-parameter to balance the losses. Then theobjective is to solve for:

G∗AB , G∗BA = arg min

GAB ,GBA

maxDB ,DA

L(GAB , GBA, DA, DB).

(14)

Although CycleGAN [73] and DualGAN [74] have thesame objective, they use different implementations for gen-erators. CycleGAN uses the generator structure as proposedin [68], while DualGAN follows the U-Net [71] structureas in [67] [71]. Both CycleGAN and DualGAN use thePatchGAN with size 70× 70 as in [67].

Besides different generator architectures, CycleGAN [73]and DualGAN [74] also use different techniques to sta-bilize the training process. DualGAN follows the trainingprocedure proposed in WGAN [8]. CycleGAN applies twotechniques. First, instead of using the log loss [77] for LGANin Equation 11 with a least square loss that in practiceperforms more stably and produces higher quality images:

LLSGAN (GAB , DB) = Eb∼pB(b)[(DB(b)− 1)2]

+ Ea∼pA(a)[(DB(GAB(a))2]. (15)

The second technique used in CycleGAN [73] is that, inorder to reduce model oscillation [4], CycleGAN followsSimGAN’s [78] strategy and updates discriminators DA

and DB using a history of 50 previously generated imagesinstead of the ones produced by latest generators.

Experiments of CycleGAN demonstrate the potential ofperforming high-quality image-to-image translation usingunpaired data only, even though supervised method likePix2Pix [67] still outperforms CycleGAN by a noticeablemargin. CycleGAN also conducts experiments that showthe importance of using both circles in the circle-consistencyloss defined in Equation 12. However, failure cases providedin [73] show that, just as Pix2Pix [67], CycleGAN does notwork in cases that necessitate geometric transformations,such as apple ↔ orange and cat ↔ dog. More examplesare available on CycleGAN’s project website2.

2. https://junyanz.github.io/CycleGAN/

5.4 Unsupervised Image-to-Image Translation with Dis-tance Constraint

DistanceGAN [79] discovers that, the distance ||xi − xj ||between two images in the source domain A is highlypositively correlated to the distance of their counterparts||GAB(xi)−GAB(xj)|| in the target domainB, which can beseen from Figure 1 of the paper [79]. According to [79], let dkbe the distance ||xi−xj ||, and d′k be ||GAB(xi)−GAB(xj)||,a high correlation indicates that

∑dkd′k should also be high.

The pair-wise distances dk in source domain are fixed, andmaximizing

∑dkd′k causes dk with large value to dominate

the loss, which is undesirable. So the authors propose tominimize

∑ |dk − d′k| instead.DistanceGAN [79] proposes to use a pair-wise distance

loss defined as:

Ldist(GAB , pA) = Exi,xj∼pA |1

σA(||xi − xj ||1 − µA)

− 1

σB(||GAB(xi)−GAB(xj)||1 − µB)|,

(16)

where µA, µB (σA, σB) are the pre-computed means (stan-dard deviations) of pair-wise distances in training sets ofdomain A,B respectively.

In order to support stochastic gradient descent whereonly one data sample is fed into the model at a time,DistanceGAN proposes another self-distance constraint:

Lself -dist(GAB , pA) = Ex∼pA |1

σA(||L(x)−R(x)||1 − µA)

− 1

σB(||GAB(L(x))−GAB(R(x))||1 − µB)|,

(17)

where L(x) and R(x) indicate the left and right half of theimage x, and only the left (right) parts of images are takeninto account when calculating µA, σA (µB , σB).

https://junyanz.github.io/CycleGAN/

10

Thus the overall loss of DistanceGAN is given by:

L = α1ALGAN (GAB , DB) + α1BLGAN (GBA, DA)

+ α2ALdist(GAB , pA) + α2BLdist(GBA, pB)

+ α3ALself -dist(GAB , pA) + α3BLself -dist(GBA, pB)

+ α4Lcyc(GAB , GBA), (18)

where Lcyc(GAB , GBA) is defined in Equation 12 andLGAN (·) is defined in Equation 11.

DistanceGAN conducts extensive experiments using dif-ferent losses: Lcyc alone (DiscoGAN [76] and CycleGAN[73]), one-sided distance loss Ldist (A → B → A orB → A → B), the combination of Lcyc and one-sidedLdist, and one-sided self-distance loss Lself -dist alone. Theimplementation of DistanceGAN [79] is the same as baseline(DiscoGAN [76] or CycleGAN [73]), dependent on whichone is to be compared with. Results show that the one-sideddistance loss or self-distance loss outperforms DiscoGANand CycleGAN in several tasks, and that the combination ofcyclic loss and distance loss achieves the best results in somecases. However, the paper does not explore other possiblecombinations, such as using both distance losses or even thefull loss function defined in Equation 18.

One interesting thing is, as stated in [79], that Distance-GAN computes the distances in raw RGB space and stillachieves better performance than baselines, but it may helpif the distances are calculated in images’ latent feature spacewhere the features can be extracted using pre-trained imageclassifiers.

In DistanceGAN [79], the authors argue that the highpositive correlation between dk and d′k implies that

∑dkd′k

should be high. However, it is unclear how high d′k shouldbe. For example, if dk = 1 and we know that d′k shouldbe high, we still cannot definitely say that d′k = 3 ismore desirable than d′k = 2. The concept of “high” isblurry, so it may be better to use the concept “higher”. Analternative statement could be, let x,y, z be images fromdomain A, if d(x,y) > d(x, z), then d(GAB(x), GAB(y)) >d(GAB(x), GAB(z)). And thus we can design a loss likemax(δ, d(GAB(x), GAB(y)) − d(GAB(x), GAB(z))) for alltriplets x,y, z ∈ A such that d(x,y) > d(x, z).

Regardless of the objective of maximizing∑dkd′k, the

actual loss used in DistanceGAN, i.e.∑ |dk − d′k|, is forcing

d′k to be as close to dk as possible. In other words, howtwo images of a domain differ from each other shouldbe reflected in the same way when they are translatedinto another domain, which can be called “equivariance”.The distance loss and self-distance loss proposed in Dis-tanceGAN [79] is essentially capturing this “equivariance”property.

5.5 Unsupervised Image-to-Image Translation withFeature Constancy

As we mention previously, besides minimizing the recon-struction error at raw pixel level, we can also do this athigher feature level, which is explored in DTN [80]. Thearchitecture of DTN is shown in Figure 9, where the genera-tor G is composed of two neural networks, a convolutionalnetwork f and an transposed convolutional network g suchthat G = g f .

f g

G D

f

LTID

LGANG

LD

f(x)

f(g(f(x)))

LCONST

Fig. 9. Architecture of domain transfer network (DTN). As an examplehere, the model transfers images from BW color space to RGB colorspace. The generator is expected to be an identity matrix to images inthe target domain, so that in this example, an RGB image remains un-changed when put into the generator, while a BW image is transformedinto RGB by the generator.

Here f acts as a feature extractor, and DTN [80] tries topreserve high level features of an input image from sourcedomain after it is transferred into target domain. Let Xs

denote source domain and Xt be target domain. Given aninput image x ∈ Xs, the output of generator is G(x) =g(f(x)), then the feature reconstruction error can be definedwith a distance measure d (DTN uses mean squared error(MSE)):

LCONST =∑x∈Xs

d(f(x, f(g(f(x))))). (19)

Besides, DTN also expects the generator G to act as anidentity matrix to images from the target domain. Given adistance measure d2 (DTN uses mean squared error (MSE)),an identity mapping loss for target domain is defined as:

LTID =∑x∈Xt

d2(x, g(f(x))). (20)

In addition, DTN use a multi-class discriminator insteadof a binary one in regular GAN. Here D is a ternaryclassification function that maps an image to one of the threeclasses 1,2,3, where class 1 means that the image is fromsource domain but transformed by G, class 2 means that theinput is an image from target domain but transformed byG, and class 3 means that the input image is from targetdomain without any transformation. Let Di(x) denotes theprobability of x belongs to class i, the discriminator loss LDis defined as:

LD = − Ex∈XslogD1(g(f(x)))

− Ex∈XtlogD2(g(f(x)))− Ex∈Xt

logD3(x). (21)

Similarly, the generator’s adversarial loss LGANG is definedas:

LGANG = − Ex∈XslogD3(g(f(x)))

− Ex∈XtlogD3(g(f(x))). (22)

In order to slightly smoothen the generated images, DTNadds an anisotropic total variation loss [81] LTV defined ongenerated images z = [zij ] = G(x):

LTV (z) =∑i,j

((zi,j+1 − z2ij + (zi+1,j − zij)2)

12 . (23)

11

Then the overall generator loss LG is defined as:

LG = LGANG + αLCONST + βLTID + γLTV . (24)

Experiments show that DTN [80] produces impressiveimages on face-to-emoji task, which is competitive withsome existing emoji generating programs.

5.6 Unsupervised Image-to-Image Translation withAuxiliary Classifier

Bousmalis et. al [82] propose to use a task-specific auxiliaryclassifier to help unsupervised image-to-image translation. Wedenote their model as DAAC (short for “Domain Adaptionwith Auxiliary Classifier”) here for convenience. DAAC [82]contains a task-specific classifier C(x) → y that assigns alabel vector y to an image in either source domain or targetdomain. This classifier C , for example, can be an imageclassification model. The architecture of DAAC is similarto Figure 2.

Given a training set Xs in which each image x ∈ Xs

has a class label yx, the objective of C is to minimize thecross-entropy loss LC :

Lc(G,C) = Ex∈Xs[−yᵀ

x log[C(G(x))]− yᵀx log[C(x)]].

(25)

Besides, DAAC [82] also proposes a content-similarityloss in cases with prior knowledge on what informationshould be preserved after the domain adaption process.For example, we may expect the hues of the source imageand the adapted image to be the same. In [82], the authorsconsider the case where they render objects with blackbackground and expect the adapted images to have thesame objects but different backgrouds. In this case, eachimage x is associated with a binary mask mx ∈ Rk (kis the number of pixels in x) to separate foreground andbackground, and the content-similarity loss Ls, which is avariation of the pairwise mean squared error (PMSE) [83],can be defined as:

Ls(G) = Ex∈Xs[1

k||(x−G(x)) mx||22

− 1

k2((x−G(x))ᵀmx)2], (26)

where || · ||22 is the squared L2 norm, and is the Hardamardproduct. Note that here the image x is flattened into a vectorform.

The total objective is thus given by:

arg minG,C

maxD

αLGAN (G,D) + βLc(G,C) + γLs(G), (27)

where LGAN (G,D) is the regular GAN objective definedin Equation 1, and α, β, γ are hyper-parameters to balancedifferent terms of the objective.

Although Lc(G,C) and Ls(G) help in generating betteradapted images, the amount of labeled images is limitedin real world, and fine-grained pixel-level masks are evenharder to obtain. Nonetheless, using auxiliary classifier canalso help to avoid mode collapse, as in AC-GAN [33].

EA<latexit sha1_base64="vPm/8SIYgtuOzAcWyfPNOZsUfzw=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9VETxWNLbQhrLZTtqlm03Y3Qgl9Cd48aDi1X/kzX/jts1BWx8MPN6bYWZekAiujet+O4Wl5ZXVteJ6aWNza3unvLv3qONUMfRYLGLVCqhGwSV6hhuBrUQhjQKBzWB4PfGbT6g0j+WDGSXoR7QvecgZNVa6v+ledssVt+pOQRZJLScVyNHolr86vZilEUrDBNW6XXMT42dUGc4EjkudVGNC2ZD2sW2ppBFqP5ueOiZHVumRMFa2pCFT9fdERiOtR1FgOyNqBnrem4j/ee3UhOd+xmWSGpRstihMBTExmfxNelwhM2JkCWWK21sJG1BFmbHplGwItfmXF4l3Ur2ounenlfpVnkYRDuAQjqEGZ1CHW2iABwz68Ayv8OYI58V5dz5mrQUnn9mHP3A+fwBD1I1Q</latexit><latexit sha1_base64="vPm/8SIYgtuOzAcWyfPNOZsUfzw=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9VETxWNLbQhrLZTtqlm03Y3Qgl9Cd48aDi1X/kzX/jts1BWx8MPN6bYWZekAiujet+O4Wl5ZXVteJ6aWNza3unvLv3qONUMfRYLGLVCqhGwSV6hhuBrUQhjQKBzWB4PfGbT6g0j+WDGSXoR7QvecgZNVa6v+ledssVt+pOQRZJLScVyNHolr86vZilEUrDBNW6XXMT42dUGc4EjkudVGNC2ZD2sW2ppBFqP5ueOiZHVumRMFa2pCFT9fdERiOtR1FgOyNqBnrem4j/ee3UhOd+xmWSGpRstihMBTExmfxNelwhM2JkCWWK21sJG1BFmbHplGwItfmXF4l3Ur2ounenlfpVnkYRDuAQjqEGZ1CHW2iABwz68Ayv8OYI58V5dz5mrQUnn9mHP3A+fwBD1I1Q</latexit><latexit sha1_base64="vPm/8SIYgtuOzAcWyfPNOZsUfzw=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9VETxWNLbQhrLZTtqlm03Y3Qgl9Cd48aDi1X/kzX/jts1BWx8MPN6bYWZekAiujet+O4Wl5ZXVteJ6aWNza3unvLv3qONUMfRYLGLVCqhGwSV6hhuBrUQhjQKBzWB4PfGbT6g0j+WDGSXoR7QvecgZNVa6v+ledssVt+pOQRZJLScVyNHolr86vZilEUrDBNW6XXMT42dUGc4EjkudVGNC2ZD2sW2ppBFqP5ueOiZHVumRMFa2pCFT9fdERiOtR1FgOyNqBnrem4j/ee3UhOd+xmWSGpRstihMBTExmfxNelwhM2JkCWWK21sJG1BFmbHplGwItfmXF4l3Ur2ounenlfpVnkYRDuAQjqEGZ1CHW2iABwz68Ayv8OYI58V5dz5mrQUnn9mHP3A+fwBD1I1Q</latexit>

EB<latexit sha1_base64="qAK3M+lNW7ccLia2RndlTnZxaeE=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG+lInisaGyhDWWz3bRLN5uwOxFK6E/w4kHFq//Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WV1bX1jeJmaWt7Z3evvH/waOJUM+6xWMa6HVDDpVDcQ4GStxPNaRRI3gpG11O/9cS1EbF6wHHC/YgOlAgFo2il+5teo1euuFV3BrJMajmpQI5mr/zV7ccsjbhCJqkxnZqboJ9RjYJJPil1U8MTykZ0wDuWKhpx42ezUyfkxCp9EsbalkIyU39PZDQyZhwFtjOiODSL3lT8z+ukGF76mVBJilyx+aIwlQRjMv2b9IXmDOXYEsq0sLcSNqSaMrTplGwItcWXl4l3Vr2qunfnlXojT6MIR3AMp1CDC6jDLTTBAwYDeIZXeHOk8+K8Ox/z1oKTzxzCHzifP0VXjVE=</latexit><latexit sha1_base64="qAK3M+lNW7ccLia2RndlTnZxaeE=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG+lInisaGyhDWWz3bRLN5uwOxFK6E/w4kHFq//Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WV1bX1jeJmaWt7Z3evvH/waOJUM+6xWMa6HVDDpVDcQ4GStxPNaRRI3gpG11O/9cS1EbF6wHHC/YgOlAgFo2il+5teo1euuFV3BrJMajmpQI5mr/zV7ccsjbhCJqkxnZqboJ9RjYJJPil1U8MTykZ0wDuWKhpx42ezUyfkxCp9EsbalkIyU39PZDQyZhwFtjOiODSL3lT8z+ukGF76mVBJilyx+aIwlQRjMv2b9IXmDOXYEsq0sLcSNqSaMrTplGwItcWXl4l3Vr2qunfnlXojT6MIR3AMp1CDC6jDLTTBAwYDeIZXeHOk8+K8Ox/z1oKTzxzCHzifP0VXjVE=</latexit><latexit sha1_base64="qAK3M+lNW7ccLia2RndlTnZxaeE=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG+lInisaGyhDWWz3bRLN5uwOxFK6E/w4kHFq//Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WV1bX1jeJmaWt7Z3evvH/waOJUM+6xWMa6HVDDpVDcQ4GStxPNaRRI3gpG11O/9cS1EbF6wHHC/YgOlAgFo2il+5teo1euuFV3BrJMajmpQI5mr/zV7ccsjbhCJqkxnZqboJ9RjYJJPil1U8MTykZ0wDuWKhpx42ezUyfkxCp9EsbalkIyU39PZDQyZhwFtjOiODSL3lT8z+ukGF76mVBJilyx+aIwlQRjMv2b9IXmDOXYEsq0sLcSNqSaMrTplGwItcWXl4l3Vr2qunfnlXojT6MIR3AMp1CDC6jDLTTBAwYDeIZXeHOk8+K8Ox/z1oKTzxzCHzifP0VXjVE=</latexit>

GB<latexit sha1_base64="6imiINq/UIxFizsj9kWcdpK9u/E=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG+lHvRY0dhCG8pmu2mXbjZhdyKU0J/gxYOKV/+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgorq2vrG8XN0tb2zu5eef/g0cSpZtxjsYx1O6CGS6G4hwIlbyea0yiQvBWMrqd+64lrI2L1gOOE+xEdKBEKRtFK9ze9Rq9ccavuDGSZ1HJSgRzNXvmr249ZGnGFTFJjOjU3QT+jGgWTfFLqpoYnlI3ogHcsVTTixs9mp07IiVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbz0M6GSFLli80VhKgnGZPo36QvNGcqxJZRpYW8lbEg1ZWjTKdkQaosvLxPvrHpVde/OK/VGnkYRjuAYTqEGF1CHW2iCBwwG8Ayv8OZI58V5dz7mrQUnnzmEP3A+fwBIYY1T</latexit><latexit sha1_base64="6imiINq/UIxFizsj9kWcdpK9u/E=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG+lHvRY0dhCG8pmu2mXbjZhdyKU0J/gxYOKV/+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgorq2vrG8XN0tb2zu5eef/g0cSpZtxjsYx1O6CGS6G4hwIlbyea0yiQvBWMrqd+64lrI2L1gOOE+xEdKBEKRtFK9ze9Rq9ccavuDGSZ1HJSgRzNXvmr249ZGnGFTFJjOjU3QT+jGgWTfFLqpoYnlI3ogHcsVTTixs9mp07IiVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbz0M6GSFLli80VhKgnGZPo36QvNGcqxJZRpYW8lbEg1ZWjTKdkQaosvLxPvrHpVde/OK/VGnkYRjuAYTqEGF1CHW2iCBwwG8Ayv8OZI58V5dz7mrQUnnzmEP3A+fwBIYY1T</latexit><latexit sha1_base64="6imiINq/UIxFizsj9kWcdpK9u/E=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG+lHvRY0dhCG8pmu2mXbjZhdyKU0J/gxYOKV/+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgorq2vrG8XN0tb2zu5eef/g0cSpZtxjsYx1O6CGS6G4hwIlbyea0yiQvBWMrqd+64lrI2L1gOOE+xEdKBEKRtFK9ze9Rq9ccavuDGSZ1HJSgRzNXvmr249ZGnGFTFJjOjU3QT+jGgWTfFLqpoYnlI3ogHcsVTTixs9mp07IiVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbz0M6GSFLli80VhKgnGZPo36QvNGcqxJZRpYW8lbEg1ZWjTKdkQaosvLxPvrHpVde/OK/VGnkYRjuAYTqEGF1CHW2iCBwwG8Ayv8OZI58V5dz7mrQUnnzmEP3A+fwBIYY1T</latexit>

GA<latexit sha1_base64="pV0jNsDVYrqSejrVyYSsMUt/gdo=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9VD3qsaGyhDWWznbRLN5uwuxFK6E/w4kHFq//Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgpLyyura8X10sbm1vZOeXfvUcepYuixWMSqFVCNgkv0DDcCW4lCGgUCm8HweuI3n1BpHssHM0rQj2hf8pAzaqx0f9O97JYrbtWdgiySWk4qkKPRLX91ejFLI5SGCap1u+Ymxs+oMpwJHJc6qcaEsiHtY9tSSSPUfjY9dUyOrNIjYaxsSUOm6u+JjEZaj6LAdkbUDPS8NxH/89qpCc/9jMskNSjZbFGYCmJiMvmb9LhCZsTIEsoUt7cSNqCKMmPTKdkQavMvLxLvpHpRde9OK/WrPI0iHMAhHEMNzqAOt9AADxj04Rle4c0Rzovz7nzMWgtOPrMPf+B8/gBG3o1S</latexit><latexit sha1_base64="pV0jNsDVYrqSejrVyYSsMUt/gdo=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9VD3qsaGyhDWWznbRLN5uwuxFK6E/w4kHFq//Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgpLyyura8X10sbm1vZOeXfvUcepYuixWMSqFVCNgkv0DDcCW4lCGgUCm8HweuI3n1BpHssHM0rQj2hf8pAzaqx0f9O97JYrbtWdgiySWk4qkKPRLX91ejFLI5SGCap1u+Ymxs+oMpwJHJc6qcaEsiHtY9tSSSPUfjY9dUyOrNIjYaxsSUOm6u+JjEZaj6LAdkbUDPS8NxH/89qpCc/9jMskNSjZbFGYCmJiMvmb9LhCZsTIEsoUt7cSNqCKMmPTKdkQavMvLxLvpHpRde9OK/WrPI0iHMAhHEMNzqAOt9AADxj04Rle4c0Rzovz7nzMWgtOPrMPf+B8/gBG3o1S</latexit><latexit sha1_base64="pV0jNsDVYrqSejrVyYSsMUt/gdo=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9VD3qsaGyhDWWznbRLN5uwuxFK6E/w4kHFq//Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgpLyyura8X10sbm1vZOeXfvUcepYuixWMSqFVCNgkv0DDcCW4lCGgUCm8HweuI3n1BpHssHM0rQj2hf8pAzaqx0f9O97JYrbtWdgiySWk4qkKPRLX91ejFLI5SGCap1u+Ymxs+oMpwJHJc6qcaEsiHtY9tSSSPUfjY9dUyOrNIjYaxsSUOm6u+JjEZaj6LAdkbUDPS8NxH/89qpCc/9jMskNSjZbFGYCmJiMvmb9LhCZsTIEsoUt7cSNqCKMmPTKdkQavMvLxLvpHpRde9OK/WrPI0iHMAhHEMNzqAOt9AADxj04Rle4c0Rzovz7nzMWgtOPrMPf+B8/gBG3o1S</latexit>

DA<latexit sha1_base64="7WG4Yta5oX1tepL8p54kEPacNkY=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG/14+CxorGFNpTNdtIu3WzC7kYooT/BiwcVr/4jb/4bt20O2vpg4PHeDDPzgkRwbVz32yksLa+srhXXSxubW9s75d29Rx2niqHHYhGrVkA1Ci7RM9wIbCUKaRQIbAbD64nffEKleSwfzChBP6J9yUPOqLHS/U33sluuuFV3CrJIajmpQI5Gt/zV6cUsjVAaJqjW7ZqbGD+jynAmcFzqpBoTyoa0j21LJY1Q+9n01DE5skqPhLGyJQ2Zqr8nMhppPYoC2xlRM9Dz3kT8z2unJjz3My6T1KBks0VhKoiJyeRv0uMKmREjSyhT3N5K2IAqyoxNp2RDqM2/vEi8k+pF1b07rdSv8jSKcACHcAw1OIM63EIDPGDQh2d4hTdHOC/Ou/Mxay04+cw+/IHz+QNCT41P</latexit><latexit sha1_base64="7WG4Yta5oX1tepL8p54kEPacNkY=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG/14+CxorGFNpTNdtIu3WzC7kYooT/BiwcVr/4jb/4bt20O2vpg4PHeDDPzgkRwbVz32yksLa+srhXXSxubW9s75d29Rx2niqHHYhGrVkA1Ci7RM9wIbCUKaRQIbAbD64nffEKleSwfzChBP6J9yUPOqLHS/U33sluuuFV3CrJIajmpQI5Gt/zV6cUsjVAaJqjW7ZqbGD+jynAmcFzqpBoTyoa0j21LJY1Q+9n01DE5skqPhLGyJQ2Zqr8nMhppPYoC2xlRM9Dz3kT8z2unJjz3My6T1KBks0VhKoiJyeRv0uMKmREjSyhT3N5K2IAqyoxNp2RDqM2/vEi8k+pF1b07rdSv8jSKcACHcAw1OIM63EIDPGDQh2d4hTdHOC/Ou/Mxay04+cw+/IHz+QNCT41P</latexit><latexit sha1_base64="7WG4Yta5oX1tepL8p54kEPacNkY=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG/14+CxorGFNpTNdtIu3WzC7kYooT/BiwcVr/4jb/4bt20O2vpg4PHeDDPzgkRwbVz32yksLa+srhXXSxubW9s75d29Rx2niqHHYhGrVkA1Ci7RM9wIbCUKaRQIbAbD64nffEKleSwfzChBP6J9yUPOqLHS/U33sluuuFV3CrJIajmpQI5Gt/zV6cUsjVAaJqjW7ZqbGD+jynAmcFzqpBoTyoa0j21LJY1Q+9n01DE5skqPhLGyJQ2Zqr8nMhppPYoC2xlRM9Dz3kT8z2unJjz3My6T1KBks0VhKoiJyeRv0uMKmREjSyhT3N5K2IAqyoxNp2RDqM2/vEi8k+pF1b07rdSv8jSKcACHcAw1OIM63EIDPGDQh2d4hTdHOC/Ou/Mxay04+cw+/IHz+QNCT41P</latexit>

DB<latexit sha1_base64="jp4lNJx0P3g2angb/IVoZhV2ZkM=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG+levBY0dhCG8pmu2mXbjZhdyKU0J/gxYOKV/+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgorq2vrG8XN0tb2zu5eef/g0cSpZtxjsYx1O6CGS6G4hwIlbyea0yiQvBWMrqd+64lrI2L1gOOE+xEdKBEKRtFK9ze9Rq9ccavuDGSZ1HJSgRzNXvmr249ZGnGFTFJjOjU3QT+jGgWTfFLqpoYnlI3ogHcsVTTixs9mp07IiVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbz0M6GSFLli80VhKgnGZPo36QvNGcqxJZRpYW8lbEg1ZWjTKdkQaosvLxPvrHpVde/OK/VGnkYRjuAYTqEGF1CHW2iCBwwG8Ayv8OZI58V5dz7mrQUnnzmEP3A+fwBD0o1Q</latexit><latexit sha1_base64="jp4lNJx0P3g2angb/IVoZhV2ZkM=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG+levBY0dhCG8pmu2mXbjZhdyKU0J/gxYOKV/+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgorq2vrG8XN0tb2zu5eef/g0cSpZtxjsYx1O6CGS6G4hwIlbyea0yiQvBWMrqd+64lrI2L1gOOE+xEdKBEKRtFK9ze9Rq9ccavuDGSZ1HJSgRzNXvmr249ZGnGFTFJjOjU3QT+jGgWTfFLqpoYnlI3ogHcsVTTixs9mp07IiVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbz0M6GSFLli80VhKgnGZPo36QvNGcqxJZRpYW8lbEg1ZWjTKdkQaosvLxPvrHpVde/OK/VGnkYRjuAYTqEGF1CHW2iCBwwG8Ayv8OZI58V5dz7mrQUnnzmEP3A+fwBD0o1Q</latexit><latexit sha1_base64="jp4lNJx0P3g2angb/IVoZhV2ZkM=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG+levBY0dhCG8pmu2mXbjZhdyKU0J/gxYOKV/+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgorq2vrG8XN0tb2zu5eef/g0cSpZtxjsYx1O6CGS6G4hwIlbyea0yiQvBWMrqd+64lrI2L1gOOE+xEdKBEKRtFK9ze9Rq9ccavuDGSZ1HJSgRzNXvmr249ZGnGFTFJjOjU3QT+jGgWTfFLqpoYnlI3ogHcsVTTixs9mp07IiVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbz0M6GSFLli80VhKgnGZPo36QvNGcqxJZRpYW8lbEg1ZWjTKdkQaosvLxPvrHpVde/OK/VGnkYRjuAYTqEGF1CHW2iCBwwG8Ayv8OZI58V5dz7mrQUnnzmEP3A+fwBD0o1Q</latexit>

zA EA(xA)<latexit sha1_base64="lK4zaL13R3HoA9hyMLbCAUwy/zo=">AAACC3icbVDNSsNAGNz4W+tf1aOXxSLUS0lEUG+tInisYGyhCWGz3bRLd5OwuxFryAN48VW8eFDx6gt4823ctBG0dWBhmJmP/b7xY0alMs0vY25+YXFpubRSXl1b39isbG3fyCgRmNg4YpHo+EgSRkNiK6oY6cSCIO4z0vaH57nfviVC0ii8VqOYuBz1QxpQjJSWvErV4UgN/CC9z7wmdCTl8MJr1n7UO6+ZHeiUWTfHgLPEKkgVFGh5lU+nF+GEk1BhhqTsWmas3BQJRTEjWdlJJIkRHqI+6WoaIk6km46PyeC+VnowiIR+oYJj9fdEiriUI+7rZL6knPZy8T+vm6jgxE1pGCeKhHjyUZAwqCKYNwN7VBCs2EgThAXVu0I8QAJhpfsr6xKs6ZNniX1YP62bV0fVxlnRRgnsgj1QAxY4Bg1wCVrABhg8gCfwAl6NR+PZeDPeJ9E5o5jZAX9gfHwDUAuatQ==</latexit><latexit sha1_base64="lK4zaL13R3HoA9hyMLbCAUwy/zo=">AAACC3icbVDNSsNAGNz4W+tf1aOXxSLUS0lEUG+tInisYGyhCWGz3bRLd5OwuxFryAN48VW8eFDx6gt4823ctBG0dWBhmJmP/b7xY0alMs0vY25+YXFpubRSXl1b39isbG3fyCgRmNg4YpHo+EgSRkNiK6oY6cSCIO4z0vaH57nfviVC0ii8VqOYuBz1QxpQjJSWvErV4UgN/CC9z7wmdCTl8MJr1n7UO6+ZHeiUWTfHgLPEKkgVFGh5lU+nF+GEk1BhhqTsWmas3BQJRTEjWdlJJIkRHqI+6WoaIk6km46PyeC+VnowiIR+oYJj9fdEiriUI+7rZL6knPZy8T+vm6jgxE1pGCeKhHjyUZAwqCKYNwN7VBCs2EgThAXVu0I8QAJhpfsr6xKs6ZNniX1YP62bV0fVxlnRRgnsgj1QAxY4Bg1wCVrABhg8gCfwAl6NR+PZeDPeJ9E5o5jZAX9gfHwDUAuatQ==</latexit><latexit sha1_base64="lK4zaL13R3HoA9hyMLbCAUwy/zo=">AAACC3icbVDNSsNAGNz4W+tf1aOXxSLUS0lEUG+tInisYGyhCWGz3bRLd5OwuxFryAN48VW8eFDx6gt4823ctBG0dWBhmJmP/b7xY0alMs0vY25+YXFpubRSXl1b39isbG3fyCgRmNg4YpHo+EgSRkNiK6oY6cSCIO4z0vaH57nfviVC0ii8VqOYuBz1QxpQjJSWvErV4UgN/CC9z7wmdCTl8MJr1n7UO6+ZHeiUWTfHgLPEKkgVFGh5lU+nF+GEk1BhhqTsWmas3BQJRTEjWdlJJIkRHqI+6WoaIk6km46PyeC+VnowiIR+oYJj9fdEiriUI+7rZL6knPZy8T+vm6jgxE1pGCeKhHjyUZAwqCKYNwN7VBCs2EgThAXVu0I8QAJhpfsr6xKs6ZNniX1YP62bV0fVxlnRRgnsgj1QAxY4Bg1wCVrABhg8gCfwAl6NR+PZeDPeJ9E5o5jZAX9gfHwDUAuatQ==</latexit>

xA<latexit sha1_base64="smwP4WDhMr/JSl2s73gibUxySac=">AAAB8nicbVBNSwMxFHzrZ61fVY9egkXwVHZFUG9VLx4ruLbQXUo2zbah2WRJsmJZ+je8eFDx6q/x5r8x2+5BWwcCw8x7vMlEKWfauO63s7S8srq2Xtmobm5t7+zW9vYftMwUoT6RXKpOhDXlTFDfMMNpJ1UUJxGn7Wh0U/jtR6o0k+LejFMaJnggWMwINlYKggSbYRTnT5PeVa9WdxvuFGiReCWpQ4lWr/YV9CXJEioM4VjrruemJsyxMoxwOqkGmaYpJiM8oF1LBU6oDvNp5gk6tkofxVLZJwyaqr83cpxoPU4iO1lk1PNeIf7ndTMTX4Q5E2lmqCCzQ3HGkZGoKAD1maLE8LElmChmsyIyxAoTY2uq2hK8+S8vEv+0cdlw787qzeuyjQocwhGcgAfn0IRbaIEPBFJ4hld4czLnxXl3PmajS065cwB/4Hz+AKw0kaM=</latexit><latexit sha1_base64="smwP4WDhMr/JSl2s73gibUxySac=">AAAB8nicbVBNSwMxFHzrZ61fVY9egkXwVHZFUG9VLx4ruLbQXUo2zbah2WRJsmJZ+je8eFDx6q/x5r8x2+5BWwcCw8x7vMlEKWfauO63s7S8srq2Xtmobm5t7+zW9vYftMwUoT6RXKpOhDXlTFDfMMNpJ1UUJxGn7Wh0U/jtR6o0k+LejFMaJnggWMwINlYKggSbYRTnT5PeVa9WdxvuFGiReCWpQ4lWr/YV9CXJEioM4VjrruemJsyxMoxwOqkGmaYpJiM8oF1LBU6oDvNp5gk6tkofxVLZJwyaqr83cpxoPU4iO1lk1PNeIf7ndTMTX4Q5E2lmqCCzQ3HGkZGoKAD1maLE8LElmChmsyIyxAoTY2uq2hK8+S8vEv+0cdlw787qzeuyjQocwhGcgAfn0IRbaIEPBFJ4hld4czLnxXl3PmajS065cwB/4Hz+AKw0kaM=</latexit><latexit sha1_base64="smwP4WDhMr/JSl2s73gibUxySac=">AAAB8nicbVBNSwMxFHzrZ61fVY9egkXwVHZFUG9VLx4ruLbQXUo2zbah2WRJsmJZ+je8eFDx6q/x5r8x2+5BWwcCw8x7vMlEKWfauO63s7S8srq2Xtmobm5t7+zW9vYftMwUoT6RXKpOhDXlTFDfMMNpJ1UUJxGn7Wh0U/jtR6o0k+LejFMaJnggWMwINlYKggSbYRTnT5PeVa9WdxvuFGiReCWpQ4lWr/YV9CXJEioM4VjrruemJsyxMoxwOqkGmaYpJiM8oF1LBU6oDvNp5gk6tkofxVLZJwyaqr83cpxoPU4iO1lk1PNeIf7ndTMTX4Q5E2lmqCCzQ3HGkZGoKAD1maLE8LElmChmsyIyxAoTY2uq2hK8+S8vEv+0cdlw787qzeuyjQocwhGcgAfn0IRbaIEPBFJ4hld4czLnxXl3PmajS065cwB/4Hz+AKw0kaM=</latexit>

zB EB(xB)<latexit sha1_base64="h+r0dy70lBBdBPngCYmtHTZ5Cyc=">AAACC3icbVDNSsNAGNzUv1r/qh69LBahXkoqgnorEcFjBWMLTQib7aZdutmE3Y1YQx7Ai6/ixYOKV1/Am2/jpo2grQMLw8x87PeNHzMqlWl+GaWFxaXllfJqZW19Y3Orur1zI6NEYGLjiEWi6yNJGOXEVlQx0o0FQaHPSMcfned+55YISSN+rcYxcUM04DSgGCktedWaEyI19IP0PvMs6EgawgvPqv+od56VHeqU2TAngPOkWZAaKND2qp9OP8JJSLjCDEnZa5qxclMkFMWMZBUnkSRGeIQGpKcpRyGRbjo5JoMHWunDIBL6cQUn6u+JFIVSjkNfJ/Ml5ayXi/95vUQFp25KeZwowvH0oyBhUEUwbwb2qSBYsbEmCAuqd4V4iATCSvdX0SU0Z0+eJ/ZR46xhXh3XWlbRRhnsgX1QB01wAlrgErSBDTB4AE/gBbwaj8az8Wa8T6Mlo5jZBX9gfHwDVLuauA==</latexit><latexit sha1_base64="h+r0dy70lBBdBPngCYmtHTZ5Cyc=">AAACC3icbVDNSsNAGNzUv1r/qh69LBahXkoqgnorEcFjBWMLTQib7aZdutmE3Y1YQx7Ai6/ixYOKV1/Am2/jpo2grQMLw8x87PeNHzMqlWl+GaWFxaXllfJqZW19Y3Orur1zI6NEYGLjiEWi6yNJGOXEVlQx0o0FQaHPSMcfned+55YISSN+rcYxcUM04DSgGCktedWaEyI19IP0PvMs6EgawgvPqv+od56VHeqU2TAngPOkWZAaKND2qp9OP8JJSLjCDEnZa5qxclMkFMWMZBUnkSRGeIQGpKcpRyGRbjo5JoMHWunDIBL6cQUn6u+JFIVSjkNfJ/Ml5ayXi/95vUQFp25KeZwowvH0oyBhUEUwbwb2qSBYsbEmCAuqd4V4iATCSvdX0SU0Z0+eJ/ZR46xhXh3XWlbRRhnsgX1QB01wAlrgErSBDTB4AE/gBbwaj8az8Wa8T6Mlo5jZBX9gfHwDVLuauA==</latexit><latexit sha1_base64="h+r0dy70lBBdBPngCYmtHTZ5Cyc=">AAACC3icbVDNSsNAGNzUv1r/qh69LBahXkoqgnorEcFjBWMLTQib7aZdutmE3Y1YQx7Ai6/ixYOKV1/Am2/jpo2grQMLw8x87PeNHzMqlWl+GaWFxaXllfJqZW19Y3Orur1zI6NEYGLjiEWi6yNJGOXEVlQx0o0FQaHPSMcfned+55YISSN+rcYxcUM04DSgGCktedWaEyI19IP0PvMs6EgawgvPqv+od56VHeqU2TAngPOkWZAaKND2qp9OP8JJSLjCDEnZa5qxclMkFMWMZBUnkSRGeIQGpKcpRyGRbjo5JoMHWunDIBL6cQUn6u+JFIVSjkNfJ/Ml5ayXi/95vUQFp25KeZwowvH0oyBhUEUwbwb2qSBYsbEmCAuqd4V4iATCSvdX0SU0Z0+eJ/ZR46xhXh3XWlbRRhnsgX1QB01wAlrgErSBDTB4AE/gBbwaj8az8Wa8T6Mlo5jZBX9gfHwDVLuauA==</latexit>

GA(zA)<latexit sha1_base64="swYQOKqSou2Adgv9au0i76EDA9E=">AAAB+XicbVDLTsJAFL3FF+Kr6NLNRGKCG1KMiboDXegSEysk0DTTYQoTpo/MTDVY+RQ3LtS49U/c+TdOoQsFTzLJyTn35p45XsyZVJb1bRSWlldW14rrpY3Nre0ds7x7J6NEEGqTiEei42FJOQuprZjitBMLigOP07Y3usz89j0VkkXhrRrH1AnwIGQ+I1hpyTXLV26z2guwGnp++ug2J0euWbFq1hRokdRzUoEcLdf86vUjkgQ0VIRjKbt1K1ZOioVihNNJqZdIGmMywgPa1TTEAZVOOo0+QYda6SM/EvqFCk3V3xspDqQcB56ezELKeS8T//O6ifLPnJSFcaJoSGaH/IQjFaGsB9RnghLFx5pgIpjOisgQC0yUbqukS6jPf3mR2Me185p1c1JpXORtFGEfDqAKdTiFBlxDC2wg8ADP8ApvxpPxYrwbH7PRgpHv7MEfGJ8/y/eTQA==</latexit><latexit sha1_base64="swYQOKqSou2Adgv9au0i76EDA9E=">AAAB+XicbVDLTsJAFL3FF+Kr6NLNRGKCG1KMiboDXegSEysk0DTTYQoTpo/MTDVY+RQ3LtS49U/c+TdOoQsFTzLJyTn35p45XsyZVJb1bRSWlldW14rrpY3Nre0ds7x7J6NEEGqTiEei42FJOQuprZjitBMLigOP07Y3usz89j0VkkXhrRrH1AnwIGQ+I1hpyTXLV26z2guwGnp++ug2J0euWbFq1hRokdRzUoEcLdf86vUjkgQ0VIRjKbt1K1ZOioVihNNJqZdIGmMywgPa1TTEAZVOOo0+QYda6SM/EvqFCk3V3xspDqQcB56ezELKeS8T//O6ifLPnJSFcaJoSGaH/IQjFaGsB9RnghLFx5pgIpjOisgQC0yUbqukS6jPf3mR2Me185p1c1JpXORtFGEfDqAKdTiFBlxDC2wg8ADP8ApvxpPxYrwbH7PRgpHv7MEfGJ8/y/eTQA==</latexit><latexit sha1_base64="swYQOKqSou2Adgv9au0i76EDA9E=">AAAB+XicbVDLTsJAFL3FF+Kr6NLNRGKCG1KMiboDXegSEysk0DTTYQoTpo/MTDVY+RQ3LtS49U/c+TdOoQsFTzLJyTn35p45XsyZVJb1bRSWlldW14rrpY3Nre0ds7x7J6NEEGqTiEei42FJOQuprZjitBMLigOP07Y3usz89j0VkkXhrRrH1AnwIGQ+I1hpyTXLV26z2guwGnp++ug2J0euWbFq1hRokdRzUoEcLdf86vUjkgQ0VIRjKbt1K1ZOioVihNNJqZdIGmMywgPa1TTEAZVOOo0+QYda6SM/EvqFCk3V3xspDqQcB56ezELKeS8T//O6ifLPnJSFcaJoSGaH/IQjFaGsB9RnghLFx5pgIpjOisgQC0yUbqukS6jPf3mR2Me185p1c1JpXORtFGEfDqAKdTiFBlxDC2wg8ADP8ApvxpPxYrwbH7PRgpHv7MEfGJ8/y/eTQA==</latexit>

GA(zB)<latexit sha1_base64="Kcx21S6Flv8wEyRG1uUJ1awuCPM=">AAAB+XicbVDLTsJAFL3FF+Kr6NLNRGKCG1KMibpDXOgSEysk0DTTYQoTpo/MTDVY+RQ3LtS49U/c+TdOoQsFTzLJyTn35p45XsyZVJb1bRSWlldW14rrpY3Nre0ds7x7J6NEEGqTiEei42FJOQuprZjitBMLigOP07Y3usz89j0VkkXhrRrH1AnwIGQ+I1hpyTXLV+5FtRdgNfT89NFtTo5cs2LVrCnQIqnnpAI5Wq751etHJAloqAjHUnbrVqycFAvFCKeTUi+RNMZkhAe0q2mIAyqddBp9gg610kd+JPQLFZqqvzdSHEg5Djw9mYWU814m/ud1E+WfOSkL40TRkMwO+QlHKkJZD6jPBCWKjzXBRDCdFZEhFpgo3VZJl1Cf//IisY9r5zXr5qTSaOZtFGEfDqAKdTiFBlxDC2wg8ADP8ApvxpPxYrwbH7PRgpHv7MEfGJ8/zXyTQQ==</latexit><latexit sha1_base64="Kcx21S6Flv8wEyRG1uUJ1awuCPM=">AAAB+XicbVDLTsJAFL3FF+Kr6NLNRGKCG1KMibpDXOgSEysk0DTTYQoTpo/MTDVY+RQ3LtS49U/c+TdOoQsFTzLJyTn35p45XsyZVJb1bRSWlldW14rrpY3Nre0ds7x7J6NEEGqTiEei42FJOQuprZjitBMLigOP07Y3usz89j0VkkXhrRrH1AnwIGQ+I1hpyTXLV+5FtRdgNfT89NFtTo5cs2LVrCnQIqnnpAI5Wq751etHJAloqAjHUnbrVqycFAvFCKeTUi+RNMZkhAe0q2mIAyqddBp9gg610kd+JPQLFZqqvzdSHEg5Djw9mYWU814m/ud1E+WfOSkL40TRkMwO+QlHKkJZD6jPBCWKjzXBRDCdFZEhFpgo3VZJl1Cf//IisY9r5zXr5qTSaOZtFGEfDqAKdTiFBlxDC2wg8ADP8ApvxpPxYrwbH7PRgpHv7MEfGJ8/zXyTQQ==</latexit><latexit sha1_base64="Kcx21S6Flv8wEyRG1uUJ1awuCPM=">AAAB+XicbVDLTsJAFL3FF+Kr6NLNRGKCG1KMibpDXOgSEysk0DTTYQoTpo/MTDVY+RQ3LtS49U/c+TdOoQsFTzLJyTn35p45XsyZVJb1bRSWlldW14rrpY3Nre0ds7x7J6NEEGqTiEei42FJOQuprZjitBMLigOP07Y3usz89j0VkkXhrRrH1AnwIGQ+I1hpyTXLV+5FtRdgNfT89NFtTo5cs2LVrCnQIqnnpAI5Wq751etHJAloqAjHUnbrVqycFAvFCKeTUi+RNMZkhAe0q2mIAyqddBp9gg610kd+JPQLFZqqvzdSHEg5Djw9mYWU814m/ud1E+WfOSkL40TRkMwO+QlHKkJZD6jPBCWKjzXBRDCdFZEhFpgo3VZJl1Cf//IisY9r5zXr5qTSaOZtFGEfDqAKdTiFBlxDC2wg8ADP8ApvxpPxYrwbH7PRgpHv7MEfGJ8/zXyTQQ==</latexit>

GB(zB)<latexit sha1_base64="R4acDCaxELVBYhSKL828qo+H28g=">AAAB+XicbVDLSsNAFL3xWesr1aWbwSLUTUlEUHelLnRZwdhCG8JkOmmHTh7MTJQa+yluXKi49U/c+TdO2iy09cDA4Zx7uWeOn3AmlWV9G0vLK6tr66WN8ubW9s6uWdm7k3EqCHVIzGPR8bGknEXUUUxx2kkExaHPadsfXeZ++54KyeLoVo0T6oZ4ELGAEay05JmVK69Z64VYDf0ge/Sak2PPrFp1awq0SOyCVKFAyzO/ev2YpCGNFOFYyq5tJcrNsFCMcDop91JJE0xGeEC7mkY4pNLNptEn6EgrfRTEQr9Ioan6eyPDoZTj0NeTeUg57+Xif143VcG5m7EoSRWNyOxQkHKkYpT3gPpMUKL4WBNMBNNZERligYnSbZV1Cfb8lxeJc1K/qFs3p9VGs2ijBAdwCDWw4QwacA0tcIDAAzzDK7wZT8aL8W58zEaXjGJnH/7A+PwBzw2TQg==</latexit><latexit sha1_base64="R4acDCaxELVBYhSKL828qo+H28g=">AAAB+XicbVDLSsNAFL3xWesr1aWbwSLUTUlEUHelLnRZwdhCG8JkOmmHTh7MTJQa+yluXKi49U/c+TdO2iy09cDA4Zx7uWeOn3AmlWV9G0vLK6tr66WN8ubW9s6uWdm7k3EqCHVIzGPR8bGknEXUUUxx2kkExaHPadsfXeZ++54KyeLoVo0T6oZ4ELGAEay05JmVK69Z64VYDf0ge/Sak2PPrFp1awq0SOyCVKFAyzO/ev2YpCGNFOFYyq5tJcrNsFCMcDop91JJE0xGeEC7mkY4pNLNptEn6EgrfRTEQr9Ioan6eyPDoZTj0NeTeUg57+Xif143VcG5m7EoSRWNyOxQkHKkYpT3gPpMUKL4WBNMBNNZERligYnSbZV1Cfb8lxeJc1K/qFs3p9VGs2ijBAdwCDWw4QwacA0tcIDAAzzDK7wZT8aL8W58zEaXjGJnH/7A+PwBzw2TQg==</latexit><latexit sha1_base64="R4acDCaxELVBYhSKL828qo+H28g=">AAAB+XicbVDLSsNAFL3xWesr1aWbwSLUTUlEUHelLnRZwdhCG8JkOmmHTh7MTJQa+yluXKi49U/c+TdO2iy09cDA4Zx7uWeOn3AmlWV9G0vLK6tr66WN8ubW9s6uWdm7k3EqCHVIzGPR8bGknEXUUUxx2kkExaHPadsfXeZ++54KyeLoVo0T6oZ4ELGAEay05JmVK69Z64VYDf0ge/Sak2PPrFp1awq0SOyCVKFAyzO/ev2YpCGNFOFYyq5tJcrNsFCMcDop91JJE0xGeEC7mkY4pNLNptEn6EgrfRTEQr9Ioan6eyPDoZTj0NeTeUg57+Xif143VcG5m7EoSRWNyOxQkHKkYpT3gPpMUKL4WBNMBNNZERligYnSbZV1Cfb8lxeJc1K/qFs3p9VGs2ijBAdwCDWw4QwacA0tcIDAAzzDK7wZT8aL8W58zEaXjGJnH/7A+PwBzw2TQg==</latexit>

GB(zA)<latexit sha1_base64="eOGl60T/kcj9Gi6dHxosv3pb3wQ=">AAAB+XicbVDLTsJAFL3FF+Kr6NLNRGKCG1KMibpDXOgSEysk0DTTYQoTpo/MTDVY+RQ3LtS49U/c+TdOoQsFTzLJyTn35p45XsyZVJb1bRSWlldW14rrpY3Nre0ds7x7J6NEEGqTiEei42FJOQuprZjitBMLigOP07Y3usz89j0VkkXhrRrH1AnwIGQ+I1hpyTXLV26z2guwGnp++uheTI5cs2LVrCnQIqnnpAI5Wq751etHJAloqAjHUnbrVqycFAvFCKeTUi+RNMZkhAe0q2mIAyqddBp9gg610kd+JPQLFZqqvzdSHEg5Djw9mYWU814m/ud1E+WfOSkL40TRkMwO+QlHKkJZD6jPBCWKjzXBRDCdFZEhFpgo3VZJl1Cf//IisY9r5zXr5qTSaOZtFGEfDqAKdTiFBlxDC2wg8ADP8ApvxpPxYrwbH7PRgpHv7MEfGJ8/zYiTQQ==</latexit><latexit sha1_base64="eOGl60T/kcj9Gi6dHxosv3pb3wQ=">AAAB+XicbVDLTsJAFL3FF+Kr6NLNRGKCG1KMibpDXOgSEysk0DTTYQoTpo/MTDVY+RQ3LtS49U/c+TdOoQsFTzLJyTn35p45XsyZVJb1bRSWlldW14rrpY3Nre0ds7x7J6NEEGqTiEei42FJOQuprZjitBMLigOP07Y3usz89j0VkkXhrRrH1AnwIGQ+I1hpyTXLV26z2guwGnp++uheTI5cs2LVrCnQIqnnpAI5Wq751etHJAloqAjHUnbrVqycFAvFCKeTUi+RNMZkhAe0q2mIAyqddBp9gg610kd+JPQLFZqqvzdSHEg5Djw9mYWU814m/ud1E+WfOSkL40TRkMwO+QlHKkJZD6jPBCWKjzXBRDCdFZEhFpgo3VZJl1Cf//IisY9r5zXr5qTSaOZtFGEfDqAKdTiFBlxDC2wg8ADP8ApvxpPxYrwbH7PRgpHv7MEfGJ8/zYiTQQ==</latexit><latexit sha1_base64="eOGl60T/kcj9Gi6dHxosv3pb3wQ=">AAAB+XicbVDLTsJAFL3FF+Kr6NLNRGKCG1KMibpDXOgSEysk0DTTYQoTpo/MTDVY+RQ3LtS49U/c+TdOoQsFTzLJyTn35p45XsyZVJb1bRSWlldW14rrpY3Nre0ds7x7J6NEEGqTiEei42FJOQuprZjitBMLigOP07Y3usz89j0VkkXhrRrH1AnwIGQ+I1hpyTXLV26z2guwGnp++uheTI5cs2LVrCnQIqnnpAI5Wq751etHJAloqAjHUnbrVqycFAvFCKeTUi+RNMZkhAe0q2mIAyqddBp9gg610kd+JPQLFZqqvzdSHEg5Djw9mYWU814m/ud1E+WfOSkL40TRkMwO+QlHKkJZD6jPBCWKjzXBRDCdFZEhFpgo3VZJl1Cf//IisY9r5zXr5qTSaOZtFGEfDqAKdTiFBlxDC2wg8ADP8ApvxpPxYrwbH7PRgpHv7MEfGJ8/zYiTQQ==</latexit>

XB<latexit sha1_base64="MWgpp+3amfbHcttx9lAQFcyW078=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQiqLtSNy4rGFtoQplMJ+3QySTMQyihv+HGhYpbv8adf+OkzUJbDwwczrmXe+ZEGWdKu+63U1lb39jcqm7Xdnb39g/qh0ePKjWSUJ+kPJW9CCvKmaC+ZprTXiYpTiJOu9HktvC7T1QqlooHPc1omOCRYDEjWFspCBKsx1Gc92aD9qDecJvuHGiVeCVpQInOoP4VDFNiEio04VipvudmOsyx1IxwOqsFRtEMkwke0b6lAidUhfk88wydWWWI4lTaJzSaq783cpwoNU0iO1lkVMteIf7n9Y2Or8OcicxoKsjiUGw40ikqCkBDJinRfGoJJpLZrIiMscRE25pqtgRv+curxL9o3jTd+8tGq122UYUTOIVz8OAKWnAHHfCBQAbP8ApvjnFenHfnYzFaccqdY/gD5/MHfPeRhA==</latexit><latexit sha1_base64="MWgpp+3amfbHcttx9lAQFcyW078=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQiqLtSNy4rGFtoQplMJ+3QySTMQyihv+HGhYpbv8adf+OkzUJbDwwczrmXe+ZEGWdKu+63U1lb39jcqm7Xdnb39g/qh0ePKjWSUJ+kPJW9CCvKmaC+ZprTXiYpTiJOu9HktvC7T1QqlooHPc1omOCRYDEjWFspCBKsx1Gc92aD9qDecJvuHGiVeCVpQInOoP4VDFNiEio04VipvudmOsyx1IxwOqsFRtEMkwke0b6lAidUhfk88wydWWWI4lTaJzSaq783cpwoNU0iO1lkVMteIf7n9Y2Or8OcicxoKsjiUGw40ikqCkBDJinRfGoJJpLZrIiMscRE25pqtgRv+curxL9o3jTd+8tGq122UYUTOIVz8OAKWnAHHfCBQAbP8ApvjnFenHfnYzFaccqdY/gD5/MHfPeRhA==</latexit><latexit sha1_base64="MWgpp+3amfbHcttx9lAQFcyW078=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQiqLtSNy4rGFtoQplMJ+3QySTMQyihv+HGhYpbv8adf+OkzUJbDwwczrmXe+ZEGWdKu+63U1lb39jcqm7Xdnb39g/qh0ePKjWSUJ+kPJW9CCvKmaC+ZprTXiYpTiJOu9HktvC7T1QqlooHPc1omOCRYDEjWFspCBKsx1Gc92aD9qDecJvuHGiVeCVpQInOoP4VDFNiEio04VipvudmOsyx1IxwOqsFRtEMkwke0b6lAidUhfk88wydWWWI4lTaJzSaq783cpwoNU0iO1lkVMteIf7n9Y2Or8OcicxoKsjiUGw40ikqCkBDJinRfGoJJpLZrIiMscRE25pqtgRv+curxL9o3jTd+8tGq122UYUTOIVz8OAKWnAHHfCBQAbP8ApvjnFenHfnYzFaccqdY/gD5/MHfPeRhA==</latexit>

Shared latent space

True/Fake True/Fake

Fig. 10. Architecture of UNIT. The two encoders share weights at thelast few layers, while the two generators share weights at the first fewlayers, as indicated by the dashed lines.

5.7 Unsupervised Image-to-Image Translation withVAE and Weight Sharing

UNIT [84] proposes to add VAE to CoGAN [85] for unsu-pervised image-to-image translation, as illustrated in Figure10. In addition, UNIT assumes that both encoders share thesame latent space, which means, let xA,xB be the sameimage in different domains, and then the shared latent spaceimplies that EA(xA) = EB(BB). Based on the shared-latentspace assumption, UNIT enforces weight sharing betweenthe last few layers of the encoders and between the first fewlayers of the generators. The objective function for UNITis a combination of the objectives of GAN and VAE, withthe difference of using two sets of GANs/VAEs and addinghyper-parameters λs to balance different loss terms. Also,UNIT states that the shared latent space assumption impliesthe cycle-consistency [73] [76] [74], so it adds another con-straint to its objective which is VAE-like:

Lcc1 =λ3DKL(qA(zA|xA)||pη(z)) (28)+ λ3DKL(qB(zB |FAB(xA))||pη(z))

− λ4EzB∼qB(zB |FAB(xA))[log pGA(xA|zB)],

where qA(zA|xA) = N (zA|EA(xA), I), qB(zA|xB) =N (zB |EA(xB), I), pη(z) = N (z|0, I), and FAB(x) =GB(EA(x)). And Lcc2 can be defined similarly by reversingsubscripts of A and B.

Although UNIT performs better than models likeDTN [80] and CoGAN [85] on MNIST [41], Street ViewHouse Number (SVHN) [86] datasets in terms of cross-domain classification accuracy, it does not compare withother unsupervised methods such as CycleGAN [73] andDiscoGAN [76], nor does it use other widely adopted eval-uation metrics like Inception Score.

12

5.8 Unsupervised Multi-domain Image-to-Image Trans-lation

Previous models can only transform images between twodomains, but if we want to transform an image amongseveral domains, we need to train a separate generator foreach pair of domains, which is costly. To deal with thisproblem, StarGAN [87] proposes to use one generator thatcan generate images of all domains. Instead of taking onlyan image as conditional input, StarGAN also takes the labelof target domain as input, and the generator is designed totransform the input image to the target domain indicatedby the input label. Similar to DAAC [82] and AC-GAN [33],StarGAN uses an auxiliary domain classifier which clas-sifies an image into its belonged domain. In addition, acycle-consistency loss [73] is used to preserve the contentsimilarity between input and output images. In order toallow StarGAN to train on multiple datasets that may havedifferent sets of labels, StarGAN uses an additional one-hot vector to indicate the datset and concatenates all labelvectors into one vector, setting the unspecified labels as zerofor each dataset.

5.9 Summary on General Image-to-Image Translation

So far we have discussed some general image-to-image trans-lation methods, the different losses they use are summarizedin Table 1.

The simplest loss is the pixel-wise L1 reconstruction lossoperates in the target domain, which requires paired train-ing samples. Both one-sided and bi-directional reconstructionloss can treated as the unsupervised version of pixel-wiseL1 reconstruction loss, since they enforce cycle-consistencyand do not require paired training samples. The additionalVAE loss is based on the assumption of shared latent spaceof both source and target domain, and it also implies thebi-directional cycle-consistency loss. The equivariance loss,however, does not try to reconstruct images, but to preservethe difference between images across source and targetdomain.

Different from previously mentioned losses that workon the generators directly, pair-wise discriminator loss, ternarydiscriminator loss and auxiliary classifier loss work on thediscriminator side and make the discriminator better atdistinguishing real and fake samples, and then the generatorcan also learn better with the enhanced discriminator.

Among all mentioned models, Pix2Pix [67] producessharpest images, even though the L1 loss is just a simpleadd-on component to the original GAN model. It maybe interesting to combine L1 loss with the pair-wise dis-criminator in PLDT [35] which may improve the model’sperformance on image-to-image translations that involvegeometric changes on images. Also, Pix2Pix may benefitfrom preserving similarity information between images insource and target domains, as done in some unsupervisedmethods like CycleGAN [73] and DistanceGAN [79]. Asfor unsupervised methods, although their results are notas sharp as supervised methods like Pix2Pix [67], they area promising research direction, since they do not requirepaired data and collecting labeled data is very costly in realworld.

TABLE 1Summary of losses used in image image-to-image translation, “sup” isshort for “supervision”. “

√” in both “source domain” and “target domain”

columns indicates that the loss requires images from both domains (butnot necessarily paired).

Loss Sourcedomain

Targetdomain Sup Models

Pixel-level L1√ √

Pix2Pix [67]Pair-wisediscriminator

√ √PLDT [35]

Bi-directionalreconstruction

√ √CycleGAN [73],DualGAN [74],DiscoGAN [76],StarGAN [87]

One-sidedreconstruction

√DTN [80]

Ternarydiscriminator

√DTN [80]

Equivariance√ √

DistanceGAN [79]Auxiliaryclassifier

√ √ DAAC [82],StarGAN [87]

VAE√ √

UNIT [84]

5.10 Task-Specific Image-to-Image Translation

In this section, we are going to introduce some modelsthat work on more specific image-to-image translation appli-cations.

5.10.1 Face Editing

Face editing is closely related to image-to-image translation,since it also takes images as input and produces images.However, Face editing focuses more on manipulating theattributes of humans’ faces while image-to-image translationis a more general scope.

IcGAN [43] proposes to learn two separate encoders, Ez

that maps an image to its latent vector z and Ey that learnsthe attribute information vector y. Attribute manipulationis performed by tuning the attribute vector y, concatenatingit with z and then putting the combined vector as input tothe generator.

Instead of generating images directly, Shen et. al [88]propose to learn residual images of attributes. To add anattribute to an image x, a residual imageG(x) is obtained byputting it through a generator G, and then the manipulatedimage is obtained by x+G(x). The framework of this modelfollows the dual learning approach we discuss earlier, and thediscriminator is similar to the ternary classifier in DTN [80],which distinguishes whether the image is from ground-truth or is manipulated by either of the two generators(GAB , GBA).

Recently, Brock et. al propose Introspective AdversarialNetwork (IAN) [89] for neural photo editing, which, similarto VAE-GAN [38], is also a combination of GAN [4] andVAE [39]. However, unlike VAE-GAN that uses differentnetworks for discriminator and encoder, IAN combinesthe discriminator and encoder into a single network. Theintuition behind it is that features learned by a well traineddiscriminator tend to be more expressive than those learnedby maximum likelihood, which is the reason why theyare more suitable for inference. Specifically, the encoder isimplemented as a fully-connected layer on top of the lastconvolution layer of the discriminator.

13

Besides manipulating face attributes, there are also otherforms of “face editing” such as generating the frontal viewof a person’s face based on pictures of the face’s side view.Huang et. al propose Two-Pathway Generative AdversarialNetwork (TP-GAN) [90] that performs such task. TP-GANconsists of two pathways, a global structure pathway anda local texture pathway, where each pathway contains apair of encoder and decoder. The global pathway constructsa blurry frontal view that captures the global structure ofthe face, while the local pathway with four sub-networksattends to local texture details around four facial landmarks,which are left eye center, right eye center, nose tip and mouthcenter. Outputs of four sub-networks in the local pathwayare concatenated with the output of global pathway, andthen the combined tensor is fed into successive convolutionlayers to produce the final synthetic image. In addition toL1 pixel loss, TP-GAN proposes to use a symmetry lossto constrain the symmetry of human faces, and an identitypreserving loss to preserve the identity of input face. Theidentity loss is implemented as a perceptual loss [68].

5.10.2 Image Super-ResolutionAnother application of GAN that is related to image-to-imagetranslation is image super-resolution, which is the task oftaking a low resolution image as input and outputs a highresolution one with sharp details. SRGAN [91] proposesto use a residual block [92] based generator and a fullyconvolutional discriminator [5] to do single image super-resolution. Besides adversarial loss, SRGAN also combinespixel-wise MSE loss, perceptual loss [68] and regularizationloss [68]. SRGAN outperforms several baselines on somemetrics by a small margin, but the difference in syntheticimages is not easy to tell without zooming in.

5.10.3 Video PredictionA special kind of related application is video prediction, whichaims to predict the next frame of a video given the currentframe (or a history of frames).

VGAN [93] proposes to use a hierarchical model forvideo prediction. The generator has two data streams, a fore-ground stream f consisting of 3D transposed convolutions,and a background stream b consisting of 2D transposedconvolutions. The foreground stream is also responsible forgenerating a mask m that is used to merged the results oftwo streams: m f + (1−m) b. The discriminator is a setof spatial-temporal convolution layers.

Mathieu et. al propose Adv-GDL [94] which generatesfuture frames in an iterative way, from low to high res-olution. Let s1, s2, ..., sk be a set of sizes and uk be theupsampling operator from size sk−1 to sk. Let Xk,Yk

be ground-truth current and future frames of size sk andGk be the generator that produces images of size sk,thenthe predicted future frame Yk is calculated by Yk =uk(Yk−1)+Gk(Xk, uk(Yk−1)). The discriminator is a seriesof multi-scale convolutional networks with scalar output.

Although both methods produce reasonable outputvideos when the time interval is short, video quality be-comes worse as the time increases. Objects in syntheticvideos lose their original shapes and get morphed to in-distinguishable objects in some cases, which may be due tothe models’ inability to learn legal movements of objects.

6 EVALUATION METRICS ON SYNTHETIC IMAGES

It is very hard to quantify the quality of synthetic images,and metrics like RMSE are not suitable since there is noabsolute one-to-one correspondence between synthetic andreal images. A commonly used subjective metric is to usethe Amazon Mechanical Turk (AMT)3 that hires humans toscore synthetic and real images according to how realisticthey think the images are. However, people often havedifferent opinions of what is good or bad, so we also needobjective metrics to evaluate the quality of images.

Inception score (IS) [6] evaluates an image based on theentropy in class probability distribution when it is putinto a pre-trained image classifier. One intuition behindInception score is that the better an image x is, the lowerthe entropy of conditional distribution p(y|x) should be,which means the classifier have high confidence of whatthe image is about. Also, to encourage the model to gen-erate various classes of images, the marginal distributionp(y) =

∫p(y|x = G(z))dz should have high entropy. Com-

bining these two intuition, the Inception score is calculatedby exp(Ex∼G(z)DKL(p(y|x)||p(y)). As discussed in [95],Inception score is neither sensitive to prior distribution oflabels, nor a proper distance measure. Also, Inception scoresuffers from intra-class mode collapse, since a model onlyneeds to generate one perfect sample for each class to geta perfect Inception score.

Similar to Inception score, FCN-score [67] adopts the ideathat if the synthetic images are realistic, classifiers trainedon real images will be able to classify the synthetic imagescorrectly. However, an image classifier does not require theinput image to be very sharp as to give a correct classifi-cation, which means that metrics based on image classifiermay not be able to tell between two images with only smalldifference in details. Worse still, research in adversarialexamples [96] shows that the decision of a classifier doesnot necessarily depend on visual content of images but canbe highly influenced by noise invisible to humans, whichraises more questions on this metric.

Frechet Inception Distance (FID) [97] provides a differentapproach. First, generated images are embedded into a la-tent feature space of a chosen layer of the Inception Net. Sec-ond, embeddings of generated and real images are treatedas samples from two continuous multivariate Gaussians sothat their means and covariances can be calculated. Thenthe quality of generated images can be determined by theFrechet Distance between the two Gaussians:

FID(x, g) = ||µx − µg||22 + Tr(Σx + Σg − 2(ΣxΣg)12 ), (29)

where (µx, µg) and (Σx,Σg) are the means and covari-ances of the samples from the true data distribution andgenerator’s learned distribution respectively. The authors of[97] show that FID is consistent with human judgment andthat there is a strong negative correlation between FID andthe quality of generated images. Furthermore, FID is lesssensitive to noise than IS and can detect intra-class modecollapse.

Besides Inception score (IS), FCN-score and Frechet Incep-tion Distance (FID), there are also other metrics like GaussianParzen Window [4], Generative Adversarial Metric (GAM) [48]

3. https://www.mturk.com/

https://www.mturk.com/

14

and MODE Score [7]. Among all these metrics, Inception scoreis the most widely adopted one for quantitatively evaluatingsynthetic images, and there is a recent study [98] that uses ISto compare several GAN models. Although FID is relativelynew, it has been shown to be better than IS [97] [95].

7 DISCRIMINATORS AS LEARNED LOSS FUNC-TIONS

Generative adversarial network (GAN) is powerful andeffective in that the discriminator acts as a learned lossfunction instead of a fixed one designed carefully for eachspecific task. This is particularly important for image syn-thesis tasks whose loss functions are hard to be explicitlydefined in math. For example, in style transfer task, it is hardto write down a math equation that evaluates how well animage matches a certain painting style. For image synthesistasks, each input may have many legal outputs, but samplesin training set cannot cover all situations. In this case, it is in-appropriate to only minimize the distance between syntheticand ground-truth images, since we want the generator tolearn the data distribution instead of remembering trainingsamples. Although we can design feature-based losses thattry to preserve feature consistency instead of at raw pixellevel, as done in the perceptual loss [68] for image style, suchlosses are constrained by pre-trained image classificationmodels they use, and it remains a question of which layersto pick for calculating feature loss when we switch toanother pre-trained model. A discriminator, on the otherhand, does not require explicit definition of the loss, sinceit learns how to evaluate a data sample as it trains againstthe generator. Thus the discriminator is able to learn a betterloss function given enough training data.

The fact that the discriminator acts as a learned lossfunction has significant meaning for general artificial intelli-gence. Traditional pattern recognition and machine learningrequire us to define what features to be used (e.g. SIFT [99]and HOG [100] descriptors), and we design specific lossfunctions and decide what optimization methods to beapplied. Deep learning free us from carefully designingfeatures, by learning low-level and high-level feature repre-sentations by itself during training (e.g. CNN kernels), butwe still need to work hard at designing loss functions thatwork well. GAN takes us one step forward on our path to-wards artificial intelligence, in that it learns how to evaluatedata samples instead of being told how to do so, althoughwe still need to design the adversarial loss and combineit with other auxiliary losses. In other words, previouslywe design how to calculate how close an output is to thecorresponding ground-truth (L(x, x)), but the discriminatorlearns how to calculate how well an output matches the truedata distribution (L(x)). Such property allows models to bemore flexible and more likely to generalize well. Further-more, with learn2learn [101] which allows neural networksto learn to optimize themselves, there is a possibility thatwe may no longer need to choose what optimizers (suchas RMSprop [102], Adam [103] etc.) to use and let modelshandle everything themselves.

8 DISCUSSION AND CONCLUSION

In this paper, we review some basics of Generative Adver-sarial Nets (GAN) [4], and classify image synthesis methodsinto three main approaches, i.e. direct method, hierarchicalmethod and iterative method, and mention some other gen-eration methods such as iterative sampling [50]. We alsodiscuss in details two main forms of image synthesis, i.e.text-to-image synthesis and image-to-image translation.

For text-to-image synthesis, current methods work wellon datasets where each image contains single object suchas CUB [57] and Oxford-102 [55], but the performance oncomplex datasets such as MSCOCO [61] is much worse.Although some models can produce realistic images ofrooms in LSUN [64], it should be noted that rooms donot contain living things, and a living thing is certainlymuch more complicated than static objects. This limitationprobably stems from the models’ inability to learn differentconcepts of objects. We also propose that one possible way toimprove GAN’s performance in this task is to train differentmodels that generate single object well and train anothermodel that learns to combine different objects according totext descriptions, and that CapsNet [66] may be useful insuch tasks.

For image-to-image translation, we review some gen-eral methods from supervised to unsupervised settings,such as pixel-wise loss [67], cyclic loss [73] and self-distance loss [79]. Besides, we also introduce some task-specific image-to-image translation models for face edit-ing, video prediction and image super-resolution. Image-to-image translation is certainly an interesting applicationof GAN, which has great potential to be incorporated intoother software products, especially mobile apps. Althoughresearch in unsupervised methods seems more popular,supervised methods may be more practical since they stillproduce better synthetic images than unsupervised meth-ods.

Finally, we review some evaluation metrics for syntheticimages, and discuss GAN’s role on our path towards ar-tificial intelligence. The power of GAN largely lies in itsdiscriminator’s acting as a learned loss function, whichmakes the model perform better on tasks whose output ishard to evaluate by designing an explicit math equation.

ACKNOWLEDGMENTS

He Huang would like to thank Chenwei Zhang and BokaiCao for the valuable discussions and feedback.

REFERENCES

[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, andL. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.211–252, 2015.

[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershel-vam, M. Lanctot et al., “Mastering the game of go with deepneural networks and tree search,” Nature, vol. 529, no. 7587, pp.484–489, 2016.

[3] N. Brown and T. Sandholm, “Safe and nested subgamesolving for imperfect-information games,” arXiv preprintarXiv:1705.02955, 2017.

15

[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-sarial nets,” in Advances in neural information processing systems,2014, pp. 2672–2680.

[5] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-tation learning with deep convolutional generative adversarialnetworks,” arXiv preprint arXiv:1511.06434, 2015.

[6] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford,and X. Chen, “Improved techniques for training gans,” in Ad-vances in Neural Information Processing Systems, 2016, pp. 2226–2234.

[7] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li, “Moderegularized generative adversarial networks,” arXiv preprintarXiv:1612.02136, 2016.

[8] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXivpreprint arXiv:1701.07875, 2017.

[9] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, andA. Courville, “Improved training of wasserstein gan,” arXivpreprint arXiv:1704.00028, 2017.

[10] D. Berthelot, T. Schumm, and L. Metz, “Began: Boundaryequilibrium generative adversarial networks,” arXiv preprintarXiv:1703.10717, 2017.

[11] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, andP. Abbeel, “Infogan: Interpretable representation learning by in-formation maximizing generative adversarial nets,” in AdvancesIn Neural Information Processing Systems, 2016, pp. 2172–2180.

[12] J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generativeadversarial network,” arXiv preprint arXiv:1609.03126, 2016.

[13] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence gener-ative adversarial nets with policy gradient.” in AAAI, 2017, pp.2852–2858.

[14] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial net-works,” arXiv preprint arXiv:1701.00160, 2016.

[15] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sen-gupta, and A. A. Bharath, “Generative adversarial networks: Anoverview,” arXiv preprint arXiv:1710.07035, 2017.

[16] R. A. Yeh, C. Chen, T. Lim, M. Hasegawa-Johnson, and M. N.Do, “Semantic image inpainting with perceptual and contextuallosses,” CoRR, vol. abs/1607.07539, 2016. [Online]. Available:http://arxiv.org/abs/1607.07539

[17] T.-H. Chen, Y.-H. Liao, C.-Y. Chuang, W.-T. Hsu, J. Fu, andM. Sun, “Show, adapt and tell: Adversarial training of cross-domain image captioner,” arXiv preprint arXiv:1705.00930, 2017.

[18] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing, “Recur-rent topic-transition gan for visual paragraph generation,” arXivpreprint arXiv:1703.07022, 2017.

[19] W. Zhao, W. Xu, M. Yang, J. Ye, Z. Zhao, Y. Feng, and Y. Qiao,“Dual learning for cross-domain image captioning,” in CIKM, 112017, pp. 29–38.

[20] X. Wang, A. Shrivastava, and A. Gupta, “A-fast-rcnn: Hard posi-tive generation via adversary for object detection,” arXiv preprintarXiv:1704.03414, 2017.

[21] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Seman-tic segmentation using adversarial networks,” arXiv preprintarXiv:1611.08408, 2016.

[22] S. Rajeswar, S. Subramanian, F. Dutil, C. J. Pal, andA. C. Courville, “Adversarial generation of natural language,”CoRR, vol. abs/1705.10929, 2017. [Online]. Available: http://arxiv.org/abs/1705.10929

[23] J. Guo, S. Lu, H. Cai, W. Zhang, Y. Yu, and J. Wang,“Long text generation via adversarial training with leakedinformation,” CoRR, vol. abs/1709.08624, 2017. [Online].Available: http://arxiv.org/abs/1709.08624

[24] J. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky, “Adver-sarial learning for neural dialogue generation,” arXiv preprintarXiv:1701.06547, 2017.

[25] Z. Yang, J. Hu, R. Salakhutdinov, and W. W. Cohen, “Semi-supervised qa with generative domain-adaptive nets,” arXivpreprint arXiv:1702.02206, 2017.

[26] Z. Yang, W. Chen, F. Wang, and B. Xu, “Improving neuralmachine translation with conditional sequence generative adver-sarial nets,” arXiv preprint arXiv:1703.04887, 2017.

[27] V. Dumoulin and F. Visin, “A guide to convolution arithmetic fordeep learning,” arXiv preprint arXiv:1603.07285, 2016.

[28] M. Mirza and S. Osindero, “Conditional generative adversarialnets,” arXiv preprint arXiv:1411.1784, 2014.

[29] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, andD. Metaxas, “Stackgan: Text to photo-realistic image synthesiswith stacked generative adversarial networks,” arXiv preprintarXiv:1612.03242, 2016.

[30] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,“Generative adversarial text to image synthesis,” arXiv preprintarXiv:1605.05396, 2016.

[31] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee,“Learning what and where to draw,” in Advances in NeuralInformation Processing Systems, 2016, pp. 217–225.

[32] A. Odena, “Semi-supervised learning with generative adversarialnetworks,” arXiv preprint arXiv:1606.01583, 2016.

[33] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesiswith auxiliary classifier gans,” arXiv preprint arXiv:1610.09585,2016.

[34] A. Dash, J. C. B. Gamboa, S. Ahmed, M. Z. Afzal, and M. Liwicki,“Tac-gan-text conditioned auxiliary classifier generative adver-sarial network,” arXiv preprint arXiv:1703.06412, 2017.

[35] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon, “Pixel-level domain transfer,” in European Conference on Computer Vision.Springer, 2016, pp. 517–532.

[36] J. Donahue, P. Krahenbuhl, and T. Darrell, “Adversarial featurelearning,” arXiv preprint arXiv:1605.09782, 2016.

[37] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mas-tropietro, and A. Courville, “Adversarially learned inference,”arXiv preprint arXiv:1606.00704, 2016.

[38] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,“Autoencoding beyond pixels using a learned similarity metric,”arXiv preprint arXiv:1512.09300, 2015.

[39] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013.

[40] E. L. Denton, S. Chintala, a. szlam, and R. Fergus, “Deep gen-erative image models using a laplacian pyramid of adversarialnetworks,” in Advances in Neural Information Processing Systems28. Curran Associates, Inc., 2015, pp. 1486–1494.

[41] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proceedings of theIEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[42] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training genera-tive neural samplers using variational divergence minimization,”arXiv preprint arXiv:1606.00709, 2016.

[43] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Alvarez,“Invertible conditional gans for image editing,” arXiv preprintarXiv:1611.06355, 2016.

[44] X. Wang and A. Gupta, “Generative image modeling us-ing style and structure adversarial networks,” arXiv preprintarXiv:1603.05631, 2016.

[45] J. Yang, A. Kannan, D. Batra, and D. Parikh, “Lr-gan: Layeredrecursive generative adversarial networks for image generation,”arXiv preprint arXiv:1703.01560, 2017.

[46] P. J. Burt and E. H. Adelson, “The laplacian pyramid as a compactimage code,” in Readings in Computer Vision. Elsevier, 1987, pp.671–679.

[47] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Be-longie, “Stacked generative adversarial networks,” arXiv preprintarXiv:1612.04357, 2016.

[48] D. J. Im, C. D. Kim, H. Jiang, and R. Memisevic,“Generating images with recurrent adversarial networks,”CoRR, vol. abs/1602.05110, 2016. [Online]. Available: http://arxiv.org/abs/1602.05110

[49] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wier-stra, “Draw: A recurrent neural network for image generation,”arXiv preprint arXiv:1502.04623, 2015.

[50] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune,“Plug & play generative networks: Conditional iterative gener-ation of images in latent space,” arXiv preprint arXiv:1612.00005,2016.

[51] A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune,“Synthesizing the preferred inputs for neurons in neural net-works via deep generator networks,” in Advances in Neural In-formation Processing Systems, 2016, pp. 3387–3395.

[52] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Ex-tracting and composing robust features with denoising autoen-coders,” in Proceedings of the 25th international conference on Ma-chine learning. ACM, 2008, pp. 1096–1103.

http://arxiv.org/abs/1607.07539






16

[53] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growingof gans for improved quality, stability, and variation,” arXivpreprint arXiv:1710.10196, 2017.

[54] S. Reed, Z. Akata, B. Schiele, and H. Lee, “Learning deep rep-resentations of fine-grained visual descriptions,” arXiv preprintarXiv:1605.05395, 2016.

[55] M.-E. Nilsback and A. Zisserman, “Automated flower classifica-tion over a large number of classes,” in Proceedings of the IndianConference on Computer Vision, Graphics and Image Processing, Dec2008.

[56] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial trans-former networks,” in Advances in Neural Information ProcessingSystems, 2015, pp. 2017–2025.

[57] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “Thecaltech-ucsd birds-200-2011 dataset,” 2011.

[58] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d humanpose estimation: New benchmark and state of the art analysis,”in Proceedings of the IEEE Conference on computer Vision and PatternRecognition, 2014, pp. 3686–3693.

[59] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.Metaxas, “Stackgan++: Realistic image synthesis with stackedgenerative adversarial networks,” CoRR, vol. abs/1710.10916,2017. [Online]. Available: http://arxiv.org/abs/1710.10916

[60] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang,and X. He, “Attngan: Fine-grained text to image generationwith attentional generative adversarial networks,” arXiv preprintarXiv:1711.10485, 2017.

[61] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objectsin context,” in European conference on computer vision. Springer,2014, pp. 740–755.

[62] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach,S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrentconvolutional networks for visual recognition and description,”in Proceedings of the IEEE conference on computer vision and patternrecognition, 2015, pp. 2625–2634.

[63] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning faceattributes in the wild,” in Proceedings of International Conferenceon Computer Vision (ICCV), 2015.

[64] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao, “Lsun: Constructionof a large-scale image dataset using deep learning with humansin the loop,” arXiv preprint arXiv:1506.03365, 2015.

[65] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-encoders,” in International Conference on Artificial Neural Networks.Springer, 2011, pp. 44–51.

[66] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing betweencapsules,” arXiv preprint arXiv:1710.09829v2, 2017.

[67] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-imagetranslation with conditional adversarial networks,” arXiv preprintarXiv:1611.07004, 2016.

[68] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conferenceon Computer Vision. Springer, 2016, pp. 694–711.

[69] Y. Jing, Y. Yang, Z. Feng, J. Ye, and M. Song, “Neural styletransfer: A review,” CoRR, vol. abs/1705.04058, 2017. [Online].Available: http://arxiv.org/abs/1705.04058

[70] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functionsfor neural networks for image processing,” CoRR, vol.abs/1511.08861, 2015. [Online]. Available: http://arxiv.org/abs/1511.08861

[71] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in InternationalConference on Medical Image Computing and Computer-Assisted In-tervention. Springer, 2015, pp. 234–241.

[72] C. Li and M. Wand, “Precomputed real-time texture synthesiswith markovian generative adversarial networks,” in EuropeanConference on Computer Vision. Springer, 2016, pp. 702–716.

[73] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,”arXiv preprint arXiv:1703.10593, 2017.

[74] Z. Yi, H. Zhang, P. T. Gong et al., “Dualgan: Unsuperviseddual learning for image-to-image translation,” arXiv preprintarXiv:1704.02510, 2017.

[75] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma,“Dual learning for machine translation,” in Advances in NeuralInformation Processing Systems, 2016, pp. 820–828.

[76] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, “Learning to discovercross-domain relations with generative adversarial networks,”arXiv preprint arXiv:1703.05192, 2017.

[77] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley,“Least squares generative adversarial networks,” in 2017 IEEEInternational Conference on Computer Vision (ICCV). IEEE, 2017,pp. 2813–2821.

[78] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, andR. Webb, “Learning from simulated and unsupervised imagesthrough adversarial training,” arXiv preprint arXiv:1612.07828,2016.

[79] S. Benaim and L. Wolf, “One-sided unsupervised domain map-ping,” arXiv preprint arXiv:1706.00826, 2017.

[80] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domainimage generation,” arXiv preprint arXiv:1611.02200, 2016.

[81] A. Mahendran and A. Vedaldi, “Understanding deep imagerepresentations by inverting them,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2015, pp.5188–5196.

[82] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krish-nan, “Unsupervised pixel-level domain adaptation with genera-tive adversarial networks,” arXiv preprint arXiv:1612.05424, 2016.

[83] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction froma single image using a multi-scale deep network,” in Advances inneural information processing systems, 2014, pp. 2366–2374.

[84] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-imagetranslation networks,” in Advances in Neural Information ProcessingSystems, 2017, pp. 700–708.

[85] M.-Y. Liu and O. Tuzel, “Coupled generative adversarial net-works,” in Advances in neural information processing systems, 2016,pp. 469–477.

[86] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.Ng, “Reading digits in natural images with unsupervised featurelearning,” in NIPS workshop on deep learning and unsupervisedfeature learning, vol. 2011, no. 2, 2011, p. 5.

[87] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Star-gan: Unified generative adversarial networks for multi-domainimage-to-image translation,” arXiv preprint arXiv:1711.09020,2017.

[88] W. Shen and R. Liu, “Learning residual images for face attributemanipulation,” CoRR, vol. abs/1612.05363, 2016. [Online].Available: http://arxiv.org/abs/1612.05363

[89] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Neural photoediting with introspective adversarial networks,” arXiv preprintarXiv:1609.07093, 2016.

[90] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation:Global and local perception GAN for photorealistic and identitypreserving frontal view synthesis,” CoRR, vol. abs/1704.04086,2017. [Online]. Available: http://arxiv.org/abs/1704.04086

[91] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken,A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realisticsingle image super-resolution using a generative adversarialnetwork,” CoRR, vol. abs/1609.04802, 2016. [Online]. Available:http://arxiv.org/abs/1609.04802

[92] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2016, pp. 770–778.

[93] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videoswith scene dynamics,” in Advances In Neural Information Process-ing Systems, 2016, pp. 613–621.

[94] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scalevideo prediction beyond mean square error,” arXiv preprintarXiv:1511.05440, 2015.

[95] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet,“Are GANs Created Equal? A Large-Scale Study,” ArXiv e-prints,Nov. 2017.

[96] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining andharnessing adversarial examples,” arXiv preprint arXiv:1412.6572,2014.

[97] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler,G. Klambauer, and S. Hochreiter, “Gans trained bya two time-scale update rule converge to a nashequilibrium,” CoRR, vol. abs/1706.08500, 2017. [Online].Available: http://arxiv.org/abs/1706.08500

[98] W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mo-hamed, and I. Goodfellow, “Many Paths to Equilibrium: GANs









17

Do Not Need to Decrease a Divergence At Every Step,” ArXive-prints, Oct. 2017.

[99] D. G. Lowe, “Object recognition from local scale-invariant fea-tures,” in Computer vision, 1999. The proceedings of the seventh IEEEinternational conference on, vol. 2. Ieee, 1999, pp. 1150–1157.

[100] N. Dalal and B. Triggs, “Histograms of oriented gradients forhuman detection,” in Computer Vision and Pattern Recognition,2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.IEEE, 2005, pp. 886–893.

[101] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau,T. Schaul, and N. de Freitas, “Learning to learn by gradientdescent by gradient descent,” in Advances in Neural InformationProcessing Systems, 2016, pp. 3981–3989.

[102] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gra-dient by a running average of its recent magnitude,” COURSERA:Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.

[103] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980, 2014.

He Huang is a second-year PhD student from the department of Com-puter Science, University of Illinois at Chicago, USA. He received hisbachelor’s degree in Software Engineering from Sun Yat-sen Univer-sity, China. His research interests are GAN, computer vision and datamining, especially cross-model tasks (such as image captioning, cross-model retrieval and text-to-image). He is a Bayesian person, and lovesprobabilistic graphical models for their elegance.

Phillip S. Yu is a Disthinguished Professor in the Department of Com-puter Science at UIC and also holds the Wexler Chair in Information andTechnology. Before joining UIC, he was with IBM Thomas J. WatsonResearch Center, where he was manager of the Software Tools andTechniques department. His main research interests include big data,data mining (especially on graph/network mining), social network, pri-vacy preserving data publishing, data stream, database systems, andInternet applications and technologies.

Changhu Wang is currently a technical director of Toutiao AI Lab,Beijing, China. He is leading the visual computing team in Toutiao AILab, working on computer vision, multimedia analysis, and machinelearning. Before joining Toutiao AI Lab, he worked as a lead researcherin Microsoft Research from 2009 to 2017. He was a research engineerat the department of Electrical and Computer Engineering in NationalUniversity of Singapore in 2008.

Documents

1 An Introduction to Image Synthesis with Generative Adversarial … · 2018. 11. 20. · 1 An Introduction to Image Synthesis with Generative Adversarial Nets He Huang, Philip S