Triple Generative Adversarial NetsTriple-GAN, the shared discriminator enforces the mixture distributions deﬁned by the classiﬁer and generator to be similar to the true data distribution,

Triple Generative Adversarial Nets

Chongxuan Li [email protected] Xu [email protected] Zhu [email protected] Zhang [email protected] University, China

Abstract

Generative adversarial nets (GANs) are good atgenerating realistic images and have been ex-tended for semi-supervised classification. How-ever, under a two-player formulation, existingwork shares competing roles of identifying fakesamples and predicting labels via a single dis-criminator network, which can lead to unde-sirable incompatibility. We present triple gen-erative adversarial net (Triple-GAN), a flexi-ble game-theoretical framework for classifica-tion and class-conditional generation in semi-supervised learning. Triple-GAN consists ofthree players—a generator, a discriminator and aclassifier, where the generator and classifier char-acterize the conditional distributions between im-ages and labels, and the discriminator solely fo-cuses on identifying fake image-label pairs. Withdesigned utilities, the distributions characterizedby the classifier and generator both concentrateto the data distribution under nonparametric as-sumptions. We further propose unbiased regular-ization terms to make the classifier and genera-tor strongly coupled and some biased techniquesto boost the performance of Triple-GAN in prac-tice. Our results on several datasets demon-strate the promise in semi-supervised learning,where Triple-GAN achieves comparable or su-perior performance than state-of-the-art classifi-cation results among DGMs; it is also able todisentangle the classes and styles and transfersmoothly on the data level via interpolation onthe latent space class-conditionally.

1. IntroductionDeep generative models (DGMs) can capture the underly-ing distributions of the input data and synthesize new sam-ples. Recently, significant progress has been made on gen-erating realistic images based on Generative AdversarialNets (GANs) (Goodfellow et al., 2014; Denton et al., 2015;Radford et al., 2015). GAN is formulated as a two-playergame, where the generator G takes a random noise z as in-put and produces a sampleG(z) in the data space while thediscriminator D identifies whether a certain sample comesfrom the true data distribution p(x) or the generator. BothG and D are parameterized as deep neural networks andthe training procedure is to solve a minimax problem:

minG

maxD

U(D,G) = Ex∼p(x)[log(D(x)) ]

+Ez∼pg(z)[log(1−D(G(z)))],

where pg(z) is a simple distribution (e.g., uniform or nor-mal) and U(·) denotes the utilities. Given a generator andthe defined distribution pg , the optimal discriminator isD(x) = p(x)pg(x)+p(x) under the assumption of infinite ca-pacity, and the global equilibrium of this game achieves ifand only if pg(x) = p(x) (Goodfellow et al., 2014), whichis desired in terms of image generation.

GANs and DGMs in general have also proven effectivein semi-supervised learning (Kingma et al., 2014), whileretaining the generative capability. Under the same two-player game framework, Cat-GAN (Springenberg, 2015)generalizes GANs with a categorical discriminative net-work and an objective function that minimizes the condi-tional entropy of predictions given data while maximizesthe conditional entropy of predictions given generated sam-ples. Odena (2016) and Salimans et al. (2016) augment thecategorical discriminator with one more class, correspond-ing to the fake data generated by the generator. Salimanset al. (2016) further propose two alternative training objec-tives that work well for either semi-supervised classifica-tion or image generation, but not both. Specifically, the ob-jective of feature matching works well in semi-supervised

arX

iv:1

703.

0229

1v2

[cs

.LG

] 3

Apr

201

7


classification but fails to generate indistinguishable sam-ples (See Sec.4.2 for examples), while the other objectiveof minibatch discrimination is good at realistic image gen-eration but cannot predict labels accurately.

This incompatibility essentially arises from their two-player formulation, where a single discriminator networkhas to play two competing roles—identifying fake samplesand predicting labels. Such a shared architecture can pre-vent the model from achieving a desirable equilibrium. As-sume that both G and D converge finally and G concen-trates to the true data distribution, namely p(x) = pg(x).Given a generated sample, as a discriminator, D shouldidentify it as fake data with probability 12 ; while as a clas-sifier, D should also predict it as the correct class confi-dently. Obviously, the two roles conflict, hence either Gor D could not achieve its optimum. In addition, disentan-gling meaningful physical factors like object category fromlatent representations with limited supervision is of gen-eral interest (Yang et al., 2015). However, existing semi-supervised GANs focus on the marginal distribution of dataand hence G cannot generate data in a specific class.

We address these issues by presenting Triple-GAN, a flex-ible game-theoretical framework for both classificationand class-conditional image generation in semi-supervisedlearning, where the data is characterized by a joint distri-bution p(x, y) of input x and label y. Based on the alter-native factorizations p(x, y) = p(x)p(y|x) and p(x, y) =p(y)p(x|y), we build two separate conditional models–agenerator and a classifier, to characterize the conditionaldistributions p(x|y) and p(y|x), respectively. By mildly as-suming that samples can be drawn from the marginal distri-butions p(x) and p(y), we can draw input-label pairs (x, y)from our generator and classifier. Then, we define an ad-versarial game by introducing a discriminator which hasthe single role of determining whether a sample (x, y) isfrom the model distribution or the data distribution. Con-sequently, we naturally arrive at a game with tripartite cor-relations of a generator, a classifier and a discriminator. InTriple-GAN, the shared discriminator enforces the mixturedistributions defined by the classifier and generator to besimilar to the true data distribution, and the equilibrium isunique under a proper regularizer with a supervised lossand a nonparametric assumption (See Sec. 3.2). Therefore,a good classifier will result in a good generator via teach-ing the common discriminator and vice versa. We furtherintroduce some unbiased regularization terms and biasedtechniques to boost the performance of Triple-GAN. Es-pecially, we propose a pseudo discriminative loss, whichenforces the classifier to predict the generated images cor-rectly, to improve the classifier via the generative modelexplicitly (See details in Sec. 3.3).

Our results on the widely adopted MNIST (LeCun

et al., 1998), SVHN (Netzer et al., 2011) and CI-FAR10 (Krizhevsky & Hinton, 2009) datasets demonstratethat Triple-GAN can make accurate predictions withoutsacrificing generation quality. We advance the state-of-the-art semi-supervised classification results on MNIST andSVHN among DGMs. We further show that Triple-GANis able to disentangle classes and styles and perform class-conditional interpolation given partially labeled data.

2. Related WorkVarious approaches have been developed to learn DGMs,including MLE-based models such as Variational Au-toencoders (VAEs) (Kingma & Welling, 2013; Rezendeet al., 2014), Generative Moment Matching Networks(GMMNs) (Li et al., 2015; Dziugaite et al., 2015) and Gen-erative Adversarial Nets (GANs) (Goodfellow et al., 2014),which can be viewed as an instance of Noise ContrastiveEstimation (NCE) (Gutmann & Hyvärinen, 2010). Thesecriteria are systematically compared in (Theis et al., 2015).

One primal goal of DGMs is to generate realistic samples,for which GANs have proven effective. Specifically, LAP-GAN (Denton et al., 2015) leverages a series of GANs toupscale the generated samples to high resolution imagesthrough the Laplacian pyramid framework (Burt & Adel-son, 1983). DCGAN (Radford et al., 2015) adopts (frac-tionally) strided convolution networks and batch normal-ization (Ioffe & Szegedy, 2015) in GANs and generate real-istic natural images. The architecture of DCGAN is widelyadopted in various GANs, including ours.

Recent work has introduced inference networks in GANs,which can infer latent variables given data and train thewhole model jointly in an adversarial process. For in-stance, InfoGAN (Chen et al., 2016) learns interpretablelatent codes from unlabeled data by regularizing the orig-inal GANs via variational mutual information maximiza-tion. In ALI (Dumoulin et al., 2016; Donahue et al., 2016),the inference network infers the latent variables from truedata and the discriminator estimates the probability that apair of latent variable and data comes from the inferencenetwork instead of the generator, which make the joint dis-tributions defined by the generator and inference network tobe same. We draw inspiration from the architecture of ALIbut there exist two main differences in global equilibria andutilities: (1) Triple-GAN focuses on learning good gener-ator and classifier given partially labeled data, while ALIaims to learn latent features from unlabeled data; (2) boththe classifier (inference network) and generator attempt tofool the discriminator in Triple-GAN, while the discrimi-nator would like to accept the samples from the inferencenetwork but reject samples from the generator in ALI.

To handle partially labeled data, the class-conditionalVAE (Kingma et al., 2014) treats the missing labels


as latent variables and infer them for unlabeled data.ADGM (Maaløe et al., 2016) introduces auxiliary vari-ables to build a more expressive variational distributionand improve the predictive performance. The Ladder Net-work (Rasmus et al., 2015) employs lateral connectionsbetween a variation of denoising autoencoders and ob-tains excellent semi-supervised classification results. Cat-GAN (Springenberg, 2015) generalizes GANs with a cat-egorical discriminative network and an objective function.Salimans et al. (2016) propose empirical techniques to sta-bilize the training of GANs and improve the performanceon semi-supervised learning and image generation underincompatible learning criteria. Triple-GAN differs signifi-cantly from these methods, as stated in the introduction.

3. MethodWe consider learning deep generative models in the semi-supervised setting,1 where we have a partially labeleddataset with x denoting input and y denoting the outputlabel. The goal is to predict the labels y for unlabeleddata as well as to generate new samples x. This is differ-ent from the unsupervised learning setting for pure genera-tion, where the primary goal is to identify whether a sam-ple x is generated from the true distribution p(x) of inputor not; thus a two-player game with an input generator anda discriminator is sufficient to describe the process as inGANs. In our setting, as the label information y is incom-plete (thus uncertain), our density model should character-ize the uncertainty of both x and y, therefore a joint distri-bution p(x, y) of input-label pairs.

A straightforward application of the two-player GAN is in-feasible because of the missing values on y. Unlike theprevious work on semi-supervised GANs (Springenberg,2015; Salimans et al., 2016), which is restricted to the two-player framework and can lead to incompatible objectives,we build our game-theoretic objective based on the insightthat the joint distribution can be factorized in two ways,namely, p(x, y) = p(x)p(y|x) and p(x, y) = p(y)p(x|y),and that the conditional distributions p(y|x) and p(x|y) areof interest for classification and class-conditional genera-tion, respectively. To jointly estimate these conditional dis-tributions, which are characterized by a class-conditionalgenerator network and a classifier network, we define a sin-gle discriminator network which has the sole role of distin-guishing whether a sample is from the true data distributionor the models. Hence, we naturally extend GANs to Triple-GAN, a three-player game to characterize the process ofsemi-supervised classification and class-conditional gener-ation, as detailed below.

1Supervised learning is an extreme case, where the training setis fully labeled while the testing set is unlabeled.

D

D

D Y

N / Y

N / Y∼ 𝒑𝒈(𝒙, 𝒚)G

C ∼ 𝒑𝒄(𝒙, 𝒚)

∼ 𝒑(𝒙, 𝒚)C

“0”

“0”

“0”

CE

“0”

Figure 1. An illustration of Triple-GAN (best view in color). Theobjectives of C, G and D are colored in blue, green and red re-spectively, with “N” denoting rejection, “Y” denoting acceptanceand “CE” denoting the cross entropy loss for supervised learning.

3.1. A Game with Three Players

Triple-GAN consists of three components: (1) a classi-fier C that (approximately) characterizes the conditionaldiscriminative distribution pc(y|x) ≈ p(y|x); (2) a class-conditional generator G that (approximately) character-izes the conditional distribution in the other directionpg(x|y) ≈ p(x|y); and (3) a discriminator D that distin-guishes whether a pair of data (x, y) comes from the truedistribution p(x, y). All the components are parameter-ized as neural networks. Our desired equilibrium is thatthe joint distributions defined by the classifier and genera-tor will concentrate to the true data distributions. To thisend, we design a game with compatible utilities for tripleplayers as follows.

We make the mild assumption that the samples from bothp(x) and p(y) can be easily obtained.2 In the game, aftera sample x is drawn from p(x), the classifier C producesa pseudo label y given x following the conditional distri-bution pc(y|x). Hence, the pseudo input-label pair is asample from the joint distribution pc(x, y) = p(x)pc(y|x).Similarly, a pseudo input-label pair can be sampled fromthe generator G by first drawing y ∼ p(y) and thendrawing x|y ∼ pg(x|y); hence from the joint distributionpg(x, y) = p(y)pg(x|y). For pg(x|y), we assume that x istransformed by the latent style variables z given the label y,namely, x = G(y, z), z ∼ pg(z), where pg(z) is a simpledistribution (e.g., uniform or standard normal). Then, thepseudo input-label pairs (x, y) generated by both C and Gare sent to the single discriminator D for judgement. Thediscriminator D can also access the input-label pairs fromthe true data distribution as positive samples. The utilitiescan be formulated as a minimax game:

minC,G

maxD

U(C,G,D) = E(x,y)∼p(x,y)[logD(x, y)]

+(1− α)Ey∼p(y),z∼pg(z)[log(1−D(G(y, z), y))]+αEx∼p(x),y∼pc(y|x)[log(1−D(x, y))], (1)

2In semi-supervised learning, p(x) is the empirical distribu-tion of inputs and p(y) is assumed same to the distribution oflabels on labeled data, which is uniform in our experiment.


where α ∈ (0, 1) is a constant that controls the relative im-portance of generation and classification. Note that pc(y|x)should be a deterministic mapping (or delta distribution) toallow the training signal from D back-propagated to C andwe simply choose the label with the maximum probabilityinstead of sampling one in our experiment.

The minimax game defined in Eqn. (1) achieves the equi-librium if and only if p(x, y) = (1−α)pg(x, y)+αpc(x, y)(See details in Sec. 3.2). The equilibrium indicates that ifone of C and G tends to the data distribution, the other willbecome better. However, unfortunately, it cannot guaranteethat p(x, y) = pg(x, y) = pc(x, y) is the unique global op-timum, which is not desirable. To address this problem, weadd the standard supervised loss (i.e., cross-entropy loss)for the classifier C, RL = E(x,y)∼p(x,y)[− log pc(y|x)],which is equivalent to the KL-divergence between pc(x, y)and p(x, y). Consequently, we define the game as:

minC,G

maxD

Ũ(C,G,D) = E(x,y)∼p(x,y)[logD(x, y)]

+(1− α)Ey∼p(y),z∼pg(z)[log(1−D(G(y, z), y))]+αEx∼p(x),y∼pc(y|x)[log(1−D(x, y))] +RL. (2)

It will be proven that the game trained under Ũ has theunique global optimum for C and G. See Fig. 1 for anillustration of the whole model.

3.2. Theoretical Analysis

We now provide a formal theoretical analysis of Triple-GAN under nonparametric assumptions. For clarity of themain text, we defer the proof details to Appendix A.

First, we can show that the optimalD balances between thetrue data distribution and the mixture distribution definedby C and G, as summarized in Lemma 3.1.

Lemma 3.1. For any fixed C and G, the optimal dis-criminator D of the game defined by the utility functionU(C,G,D) is:

D∗C,G(x, y) =p(x, y)

p(x, y) + pα(x, y), (3)

where pα(x, y) := (1 − α)pg(x, y) + αpc(x, y) is a validdistribution.

Given D∗C,G, we can omit the discriminator and reformu-late the minimax game with value function U as:

V (C,G) = maxD

U(C,G,D),

whose optimum is summarized as in Lemma 3.2.

Lemma 3.2. The global minimum value of V (C,G) is− log 4 and it is achieved if and only if p(x, y) = pα(x, y).

We can further show that C and G can at least capturethe marginal distribution of data, especially for pg(x), even

there may exist multiple global equilibria, as summarizedin Corollary A.Corollary 3.2.1. Given p(x, y) = pα(x, y), the marginaldistributions are the same for any pairs of p, pc and pg .

Given the above result that p(x, y) = pα(x, y), C and Gdo not compete explicitly as in the two-player based for-mulation and it is easy to verify that p(x, y) = pc(x, y) =pg(x, y) is a global equilibrium point. However, it may notbe unique and we should minimize an additional objectiveto ensure the uniqueness. In fact, this is true for the utilityfunction Ũ(C,G,D) in problem (2), as stated below.Theorem 3.3. The equilibrium of Ũ(C,G,D) is achievedif and only if p(x, y) = pg(x, y) = pc(x, y) withD∗C,G(x, y) =

12 and the optimum value is − log 4.

The conclusion essentially motivates our design of Triple-GAN, as we can ensure that both C and G will concentrateto the true data distribution if the model has been trained toachieve the optimum.

We can further show another nice property of Ũ , whichallows us to regularize our model for stable and better con-vergence in practice without bias, as summarized below.Corollary 3.3.1. Any additional regularization on the dis-tances between the marginal, conditional and joint distri-butions of any two players, will not change the global equi-librium of Ũ .

3.3. Unbiased Regularization

Despite the discriminative loss for the classifier, we addother unbiased regularization terms to boost the perfor-mance of our algorithm, following Corollary A.

Pseudo discriminative loss We treat the samples gener-ated by the generator as labeled data and optimize the clas-sifier on the pseudo data to make the generator and clas-sifier strongly coupled. Intuitively, a good generator canprovide meaningful labeled data beyond the training set asextra side information for the classifier, which will boostthe predictive performance. We refer to this regularizationas Pseudo discriminative loss and it is in the form:

RP = Epg [− log pc(y|x)],

and the right hand side can be rewritten as:

DKL(pg(x, y)||pc(x, y))+Hpg (y|x)−DKL(pg(x)||p(x)),

where Hpg (y|x) denotes the conditional entropy over pg(See Appendix A for proof). Note that we do not train thegenerator to minimize this loss and the last two terms canbe viewed as constants with respect to C. Therefore, mini-mizing RP is equivalent to minimizing the KL-divergencebetween pg(x, y) and pc(x, y); hence Corollary A appliesto ensure that the global equilibrium is unchanged.


Algorithm 1 Minibatch stochastic gradient descent training of Triple-GAN in semi-supervised learning.for number of training iterations do• Sample a batch of labels yg ∼ p(y) and a batch of noise zg ∼ pg(z) of size mg , and generate pseudo dataxg ∼ pg(x|y) = G(zg, yg).• Sample a batch of unlabeled data xc ∼ p(x) of size mc, and compute pseudo labels yc = arg maxy pc(y|x) for xc.• Sample a batch of labeled data (xl, yl) ∼ p(x, y) and a batch of unlabeled data xd ∼ p(x).• Get pseudo labels yd=arg maxy pc(y|x) for xd and concatenate (xl,yl) and (xd,yd) together as a batch of size md.• Update D by ascending along its stochastic gradient:

∇θd

1md

(∑

(xl,yl)

logD(xl, yl) +∑

(xd,yd)

logD(xd, yd))+α

mc

∑(xc,yc)

log(1−D(xc, yc))+1− αmg

∑(xg,yg)

log(1−D(xg, yg))

.• Compute the unbiased estimators R̃L, R̃P and R̃U of the corresponding lossesRL,RP andRU respectively.• Update C by descending along its stochastic gradient:

∇θc

αmc

∑(xc,yc)

log(1−D(xc, yc)) + R̃L + αPR̃P + αUR̃U

.• Update G by descending along its stochastic gradient:

∇θg

1− αmg

∑(xg,yg)

log(1−D(xg, yg))

.end for

Regularization on the generator It is natural to regularizethe generator because it will in turn benefit the classifier.In our early experiment, we tried the maximum mean dis-crepancy loss (Gretton et al., 2012) between the marginaldistributions pg(x) and p(x). However, we did not observesignificant improvement and hence omited it for efficiency.Corollary A may explain this as Triple-GAN ensures thatthe players share same marginal distributions.

3.4. Practical Techniques

In this subsection we introduce several practical techniquesfor Triple-GAN, which may lead to a biased solution theo-retically but work well in practice.

One crucial problem of Triple-GAN in semi-supervisedlearning is that the discriminator may collapse to the deltadistribution on the labeled data of small size, and rejectother types of samples, which may come from the true datadistribution indeed. Consequently, the generator wouldalso concentrate to the empirical distribution and generatecertain labeled data no matter what the latent variable is.To address this problem, we sample some unlabeled dataand generate pseudo labels through the classifier and usethese data as true labeled data for the training of the dis-criminator. Note that we just use pc(y|x) to approximatep(y|x) but do not train C to optimize this term. This ap-proach introduces slight bias to Triple-GAN because thetarget distribution shifts a little towards pc(x, y). However,it remains that a good classifier will result in a good gener-

ator and this technique works well in practice.

As only C can leverage the unlabeled data directly, the per-formance of the whole system highly relies on the goodnessof C. Consequently, it is necessary to regularize C heuris-tically as in recent advances (Springenberg, 2015; Laine &Aila, 2016) to make more accurate predictions. Note thatthe separated pathway of C and D in Triple-GAN providesextra freedom to regularize C biased or not without anynegative effect on D intuitively. We consider two alterna-tive losses on the unlabeled data as follows.

Confidence and balance loss Springenberg (2015) mini-mizes the conditional entropy of pc(y|x) and the cross en-tropy between p(y) and pc(y), weighted by a hyperparam-eter αB, as follows:

RU = Hpc(y|x) + αBEp[− log pc(y)

],

which encourages the classifier to make predictions confi-dently and be balanced on unlabeled data. The similar ideahas been proven effective in (Li et al., 2016) with a largemargin classifier.

Consistency loss Laine & Aila (2016) penalize the networkif it predicts the same unlabeled data inconsistently givendifferent noise �, e.g., dropout masks, as follows:

RU ′ = Ex∼p(x)||pc(y|x, �)− pc(y|x, �′)||2,

where ||·||2 is the square of the l2-norm, which is chosen forits smoothness but can also be replaced with other choices.


We use the confidence and balance loss by default excepton the CIFAR10 dataset because the consistency loss worksbetter for complicated data. The whole training procedureof Triple-GAN is presented in Alg. 1.

4. ExperimentsWe now present results on the widely adopted MNIST (Le-Cun et al., 1998), SVHN (Netzer et al., 2011), and CI-FAR10 (Krizhevsky & Hinton, 2009) datasets. MNISTconsists of 60,000 training samples and 10,000 testing sam-ples of handwritten digits of size 28 × 28 pixels. SVHNconsists of 73,257 training samples and 26,032 testing sam-ples and each is a colored image of size 32×32, containinga sequence of digits with various backgrounds. CIFAR10consists of colored images distributed across 10 generalclasses—airplane, automobile, bird, cat, deer, dog, frog,horse, ship and truck. There are 50,000 training samplesand 10,000 samples of size 32 × 32 in CIFAR10. We fol-low (Salimans et al., 2016) to rescale the pixels of SVHNand CIFAR10 data into (−1, 1). The labeled data is dis-tributed equally across classes and the results are averagedover 10 times with different random splits of training data,following (Springenberg, 2015; Salimans et al., 2016).

We implement our method on Theano (Theano Develop-ment Team, 2016) and here we briefly summarize ourexperimental settings.3 Our network architectures arehighly referred to existing work on GANs (Radford et al.,2015; Springenberg, 2015; Salimans et al., 2016) and thedetails are listed in Appendix E. We train G to max-imize logD(G(y, z), y) instead of minimizing log(1 −D(G(y, z), y)) as suggested in (Goodfellow et al., 2014;Radford et al., 2015). We optimize all of our networks withAdam (Kingma & Ba, 2014) where the first order of mo-mentum is 0.5 and others are default values. The modelis trained for 1000 epochs with a global learning rate in{0.0003, 0.001}, which is annealed by a factor of 0.995 af-ter 300 epochs. Each batch of data for C contains an equalnumber of labeled and unlabeled data for fast convergence.However, for the training of D, the unlabeled data (withlabels generated by the classifier) is more than the labeleddata in a batch to avoid collapsing and the ratio betweenthem is nearly proportional to that of the whole dataset.

For the hyperparameters in the objectives, we simply setα = 1/2, which means that D treats C and G equally. Werefer (Li et al., 2016) to set αB = 1/300 and αU = 0.3and fix them across all experiments if not mentioned. Thepseudo discriminative loss is not applied until a thresholdthat the generator could generate meaningful data. Thethreshold is chosen from {200, 300} and αP is chosen from

3See https://github.com/zhenxuan00/triple-gan for the sourcecode.

{0.1, 0.03}, depending on the data. We keep the movingaverage of the parameters in the classifier for stable eval-uation as in (Salimans et al., 2016). In our experiments,we find that these training techniques for the original two-player GANs are sufficient to stabilize the optimization ofTriple-GAN and the convergence speed is comparable toprevious work (Salimans et al., 2016).

4.1. Classification

We first evaluate our method with 20, 50, 100 and 200 la-beled samples on MNIST for a systematical comparisonwith previous methods. We pretrain C solely with αB = 1for 30 epochs to reduce the variance given no more than100 labels. Table 1 summarizes the quantitative results.Given 100 labels, our method is competitive to the state-of-the-art among a large body of approaches. Under othersettings, Triple-GAN consistently outperforms Improved-GAN, as shown in the last two rows. Besides, we can seethat Triple-GAN achieves more significant improvement asthe number of labeled data decreases, suggesting the effec-tiveness of the pseudo discriminative loss. We also evaluateour Triple-GAN in supervised learning, where we omit allregularization terms and techniques except the pseudo dis-criminative loss. The result again confirms that the com-patible utilities help Triple-GAN converge better than two-player GANs, as shown in the last column in Table 1.

Table 2 presents the results on SVHN with 1,000 labels.For a fair comparison with previous GANs, we do notleverage the extra unlabeled data while some baselines(e.g., ADGM, SDGM and MMCVA) do. In this setting,Triple-GAN substantially outperforms the state-of-the-artmethods (e.g., ALI and Improved-GAN).

Table 3 compares the Triple-GAN with baselines on the CI-FAR10 dataset with 4,000 labels. We perform ZCA whiten-ing on the data for C following (Laine & Aila, 2016) butstill generate and estimate the raw images using G andD. We found that the consistency loss used works betterfor Triple-GAN and we implement a simple version of theΠ model in (Laine & Aila, 2016) with the learning rateannealing strategy mentioned in the general setting as thebaseline, which achieves an error rate of 19.2%. The smallmargin between Triple-GAN and the baseline may indicatethat the generator on CIFAR10 is far from optimal and can-not help the classifier as much as it does on MNIST andSVHN. Nevertheless, Triple-GAN is still competitive to thestate-of-the-art DGMs.

4.2. Generation

We demonstrate that Triple-GAN tends to achieve the de-sirable equilibrium by generating samples in various wayswith the exact models used in the semi-supervised classifi-cation. The generative model and the number of labels are


Table 1. Error rates (%) on the (partially) labeled MNIST dataset.Algorithm n = 20 n = 50 n = 100 n = 200 AllM1+M2 (Kingma et al., 2014) 3.33 (±0.14) 0.96VAT (Miyato et al., 2015) 2.33 0.64Ladder (Rasmus et al., 2015) 1.06 (±0.37) 0.57Conv-Ladder (Rasmus et al., 2015) 0.89 (±0.50)ADGM (Maaløe et al., 2016) 0.96 (±0.02)SDGM (Maaløe et al., 2016) 1.32 (±0.07)MMCVA (Li et al., 2016) 1.24 (±0.54) 0.31CatGAN (Springenberg, 2015) 1.39 (±0.28) 0.48Improved-GAN (Salimans et al., 2016) 16.77 (±4.52) 2.21 (±1.36) 0.93 (±0.07) 0.90 (±0.04)Triple-GAN (ours) 5.40 (±6.53) 1.59 (±0.69) 0.92 (±0.58) 0.66 (±0.16) 0.32

Table 2. Error rates (%) on the partially labeled SVHN dataset.The results with † are trained with more than 500,000 extra unla-beled data.

Algorithm n = 1000

M1+M2 (Kingma et al., 2014) 36.02 (±0.10)VAT (Miyato et al., 2015) 24.63

ADGM (Maaløe et al., 2016) 22.86 †

SDGM (Maaløe et al., 2016) 16.61(±0.24)†

MMCVA (Li et al., 2016) 4.95 (±0.18) †

Improved-GAN (Salimans et al., 2016) 8.11 (±1.3)ALI (Dumoulin et al., 2016) 7.3

Triple-GAN (ours) 5.83 (±0.20)

Table 3. Error rates (%) on the partially labeled CIFAR10 dataset.

Algorithm n = 4000

Ladder (Rasmus et al., 2015) 20.40 (±0.47)CatGAN (Springenberg, 2015) 19.58 (±0.58)Improved-GAN (Salimans et al., 2016) 18.63 (±2.32)ALI (Dumoulin et al., 2016) 18.3Triple-GAN (ours) 18.82 (±0.32)

the same to the previous method (Salimans et al., 2016).

In Fig. 6, we first compare the quality of images gener-ated by Triple-GAN on SVHN and the Improved-GANwith feature matching (Salimans et al., 2016)4, whichworks well for semi-supervised classification. We can seethat Triple-GAN outperforms the baseline by generatingfewer meaningless samples and clearer digits. Further-more, Triple-GAN avoids repeating to generate strange pat-terns as shown in the figure. The comparison on MNISTand CIFAR10 is presented in Appendix B. We also evaluate

4Though the Improved-GAN trained with minibatch discrimi-nation (Salimans et al., 2016) can generate good samples, it failsto predict labels accurately.

(a) Feature Matching (b) Triple-GANFigure 2. (a) Samples generated from Improved-GAN trainedwith feature matching. The strange pattern labeled with the redrectangle appears four times in the generation. (b) Samples gen-erated from Triple-GAN. We randomly sample equally distributedlabels and unshared latent variables for fair comparison in vision.

the samples on CIFAR10 quantitatively via inception scorefollowing (Salimans et al., 2016). The value of Triple-GANis 5.08 ± 0.09 while that of the Improved-GAN trainedwithout minibatch discrimination (Salimans et al., 2016) is3.87± 0.03, which agrees with the visual comparison.

Then, we show the ability of Triple-GAN to disentangleclasses and styles in Fig. 3. It can be seen that Triple-GANcan generate realistic data in a specific class and the la-tent factors encode meaningful physical factors like: scale,intensity, orientation, color and so on. DCGAN (Radfordet al., 2015) and ALI (Dumoulin et al., 2016) can gener-ate data class-conditionally given full labels, while Triple-GAN can do similar thing given incomplete label infor-mation. We illustrate images generated from four specificclasses on CIFAR10 in Fig. 4 and see more in Appendix C.In most cases, Triple-GAN is able to generate meaningfulimages with correct semantics.

Finally, we demonstrate the generalization capability of ourTriple-GAN on class-conditional latent space interpolationas in Fig. 5. Triple-GAN can transit smoothly from onesample to another with totally different visual factors with-out losing label semantics, which proves that Triple-GANs


(a) MNIST data (b) MNIST samples

(c) SVHN data (d) SVHN samples

(e) CIFAR10 data (f) CIFAR10 samples

Figure 3. (a), (c) and (e) are randomly selected labeled data. (b),(d) and (f) are samples from Triple-GAN, where each row sharesthe same label and each column shares the same latent variables.

can learn meaningful latent spaces class-conditionally in-stead of overfitting to the training data, especially labeleddata (See results on MNIST in Appendix D).

Overall, these results confirm that Triple-GAN avoid thecompetition between C and G and can lead to a situationwhere both the generation and classification are good insemi-supervised learning.

5. ConclusionsWe present triple generative adversarial networks (Triple-GAN), a unified game-theoretical framework with threeplayers—a generator, a discriminator and a classifier, todo semi-supervised learning with compatible utilities. Thedistributions characterized by the classifier and the class-conditional generator concentrate to the data distribution

(a) Airplane (b) Automobile

(c) Bird (d) Horse

Figure 4. Samples from four specific classes on CIFAR10.

under nonparametric assumptions. The classifier benefitsfrom the generator by the enforce of the discriminator im-plicitly and the pseudo discriminative loss explicitly, andthe generator generates images class-conditionally in semi-supervised learning thanks to the classifier, which can in-fer the labels for unlabeled data. Our empirical resultson several datasets demonstrate the promise—Triple-GANachieves comparable or superior classification performancethan state-of-the-art DGMs, while retaining the capabil-ity to generate meaningful images when the label informa-tion is incomplete. Moreover, Triple-GAN can disentanglestyles and classes and transfer smoothly on the data levelvia interpolation on the latent space.

ReferencesBurt, P. and Adelson, E. The Laplacian pyramid as a com-

pact image code. IEEE Transactions on communica-tions, 1983.

Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever,I., and Abbeel, P. InfoGAN: Interpretable representa-tion learning by information maximizing generative ad-versarial nets. In NIPS, 2016.

Denton, E. L., Chintala, S., and Fergus, R. Deep generativeimage models using a Laplacian pyramid of adversarialnetworks. In NIPS, 2015.

Donahue, J., Krähenbühl, P., and Darrell, T. Adversar-


(a) SVHN (b) CIFAR10

Figure 5. Class-conditional latent space interpolation. We first sample two random vectors in the latent space and interpolate linearlyfrom one to another. Then, we map these vectors to the data level given a fixed label for each class. Totally, 20 images are shown foreach class. We select two endpoints with clear semantics on CIFAR10 for better illustration.

ial feature learning. arXiv preprint arXiv:1605.09782,2016.

Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky,M., Mastropietro, O., and Courville, A. Adversariallylearned inference. arXiv preprint arXiv:1606.00704,2016.

Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. Traininggenerative neural networks via maximum mean discrep-ancy optimization. arXiv preprint arXiv:1505.03906,2015.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In NIPS, 2014.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf,B., and Smola, A. A kernel two-sample test. JMLR, 13(Mar):723–773, 2012.

Gutmann, M. and Hyvärinen, A. Noise-contrastive estima-tion: A new estimation principle for unnormalized sta-tistical models. In AISTATS, 2010.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerat-ing deep network training by reducing internal covariateshift. arXiv preprint arXiv:1502.03167, 2015.

Kingma, D. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variationalBayes. arXiv preprint arXiv:1312.6114, 2013.

Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling,M. Semi-supervised learning with deep generative mod-els. In NIPS, 2014.

Krizhevsky, A. and Hinton, G. Learning multiple layers offeatures from tiny images. Citeseer, 2009.

Laine, S. and Aila, T. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242,2016.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Pro-ceedings of the IEEE, 86(11):2278–2324, 1998.

Li, C., Zhu, J., and Zhang, B. Max-margin deep generativemodels for (semi-) supervised learning. arXiv preprintarXiv:1611.07119, 2016.

Li, Y., Swersky, K., and Zemel, R. S. Generative momentmatching networks. In ICML, 2015.

Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther,O. Auxiliary deep generative models. arXiv preprintarXiv:1602.05473, 2016.

Miyato, T., Maeda, S.-i., Koyama, M., Nakae, K., andIshii, S. Distributional smoothing with virtual adversar-ial training. arXiv preprint arXiv:1507.00677, 2015.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., andNg, A. Y. Reading digits in natural images with unsu-pervised feature learning. In NIPS workshop on deeplearning and unsupervised feature learning, 2011.

Odena, A. Semi-supervised learning with generative adver-sarial networks. arXiv preprint arXiv:1606.01583, 2016.

Radford, A., Metz, L., and Chintala, S. Unsupervised rep-resentation learning with deep convolutional generativeadversarial networks. arXiv preprint arXiv:1511.06434,2015.

Rasmus, A., Berglund, M., Honkala, M., Valpola, H., andRaiko, T. Semi-supervised learning with ladder net-works. In NIPS, 2015.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochasticbackpropagation and approximate inference in deep gen-erative models. arXiv preprint arXiv:1401.4082, 2014.


Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,Radford, A., and Chen, X. Improved techniques fortraining GANs. In NIPS, 2016.

Springenberg, J. T. Unsupervised and semi-supervisedlearning with categorical generative adversarial net-works. arXiv preprint arXiv:1511.06390, 2015.

Theano Development Team. Theano: A Python frameworkfor fast computation of mathematical expressions. arXive-prints, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688.

Theis, L., Oord, A. v. d., and Bethge, M. A note onthe evaluation of generative models. arXiv preprintarXiv:1511.01844, 2015.

Yang, J., Reed, S. E., Yang, M.-H., and Lee, H. Weakly-supervised disentangling with recurrent transformationsfor 3d view synthesis. In NIPS, 2015.

http://arxiv.org/abs/1605.02688http://arxiv.org/abs/1605.02688


A. Detailed Theoretical AnalysisLemma 3.1. For any fixed C and G, the optimal dis-criminator D of the game defined by the utility functionU(C,G,D) is

D∗C,G(x, y) =p(x, y)

p(x, y) + pα(x, y), (4)

where pα(x, y) := (1 − α)pg(x, y) + αpc(x, y) is a validdistribution .

Proof. Given the classifier and generator, the utility func-tion can be rewritten as

U(C,G,D) =

∫∫p(x, y) logD(x, y)dydx

+(1− α)∫∫

p(y)pg(z) log(1−D(G(z, y), y))dydz

+α

∫∫p(x)pc(y|x) log(1−D(x, y))dydx

=

∫∫pα(x, y) log(1−D(x, y))dydx

+

∫∫p(x, y) logD(x, y)dydx = f(D(x, y)).

Note that the function f(D(x, y)) achieves the maximumat p(x,y)p(x,y)+pα(x,y) .

Lemma 3.2. The global minimum value of V (C,G) is− log 4 and it is achieved if and only if p(x, y) = pα(x, y).

Proof. GivenD∗C,G, we can reformulate the minimax gamewith value function U as:

V (C,G)=maxD

U(C,G,D)

=

∫∫p(x, y) log

p(x, y)

p(x, y) + pα(x, y)dydx

+

∫∫pα(x, y) log

pα(x, y)

p(x, y) + pα(x, y)dydx.

Following the proof in GAN, the V (C,G) can be rewrittenas

V (C,G) = − log 4 + 2JSD(p(x, y)||pα(x, y)), (5)

where JSD is the Jensen-Shannon divergence, which isalways non-negative and the unique optimum is achievedif and only if p(x, y) = pα(x, y) = (1 − α)pg(x, y) +αpc(x, y).

Corollary 3.2.1. Given p(x, y) = pα(x, y), the marginaldistributions are the same for any pairs of p, pc and pg .

Proof. Remember that pg(x, y) = p(y)pg(x|y) andpc(x, y) = p(x)pc(y|x). Take integral with respect to xon both sides of p(x, y) = pα(x, y) to get∫

p(x, y)dx = (1− α)∫pg(x, y)dx+ α

∫pc(x, y)dx,

which indicates that

p(y) = (1−α)p(y) +αpc(y), i.e. pc(y) = p(y) = pg(y).

Similarly, it can be shown that pg(x) = p(x) = pc(x) bytaking integral with respect to y.

Theorem 3.3. The equilibrium of Ũ(C,G,D) is achievedif and only if p(x, y) = pg(x, y) = pc(x, y) withD∗C,G(x, y) =

12 and the optimum value is − log 4.

Proof. According to the definition, Ũ(C,G,D) =U(C,G,D) +RL, where

RL = Ep[− log pc(y|x)],

which can be rewritten as:

DKL(p(x, y)||pc(x, y)) +Hp(y|x).

Namely, minimizing RL is equivalent to minimizingDKL(p(x, y)||pc(x, y)), which is always non-negative andzero if and only if p(x, y) = pc(x, y). Besides, the pre-vious lemmas can also be applied to Ũ(C,G,D), whichindicates that p(x, y) = pα(x, y) at the global equilibrium,concluding the proof.

Corollary 3.3.1. Any additional regularization on the dis-tances between the marginal, conditional and joint distri-butions of any two players, will not change the global equi-librium of Ũ .

Proof. This conclusion is straightforward derived by theglobal equilibrium point of Ṽ .

Pseudo data loss We prove the equivalence of the twoforms of pseudo data loss in the main text as follows:

DKL(pg(x, y)||pc(x, y)) +Hpg (y|x)−DKL(pg(x)||p(x))

=

∫∫pg(x, y) log

pg(x, y)

pc(x, y)+ pg(x, y) log

1

pg(y|x)dxdy

−∫pg(x) log

pg(x)

p(x)dx

=

∫∫pg(x, y) log

pg(x, y)

pc(x, y)pg(y|x)dxdy

−∫∫

pg(x, y) logpg(x)

p(x)dxdy

=

∫∫pg(x, y) log

pg(x, y)p(x)

pc(x, y)pg(y|x)pg(x)dxdy

=Epg [− log pc(y|x)].

Note that the last equality holds as pc(x) = p(x).


(a) Feature Matching (b) Triple-GAN

(c) Feature Matching (d) Triple-GAN

Figure 6. (a) and (c): Samples generated from Improved-GANtrained with feature matching on MNIST and CIFAR10 datasets.Strange patterns repeat on CIFAR10. (b) and (d): Samples gener-ated from Triple-GAN.

B. Unconditional GenerationWe compare the samples generated from Triple-GAN andImproved-GAN on the MNIST and CIFAR10 datasets asin Fig. 6, where Triple-GAN shares the same architectureof generator and number of labeled data with the baseline.It can be seen that Triple-GAN outperforms the GANs thatare trained with the feature matching criterion on generat-ing indistinguishable samples.

C. Class-conditional Generation on CIFAR10We show more class-conditional generation results on CI-FAR10 in Fig. 7. Again, we can see that Triple-GAN cangenerate meaningful images in specific classes.

D. Interpolation on the MNIST datasetWe present the class-conditional interpolation on theMNIST dataset as in Fig. 8. We have the same conclusionas in main text that Triple-GAN is able to transfer smoothlyon the data level with clear semantics.

E. Detailed ArchitecturesWe list the detailed architectures of Triple-GAN onMNIST, SVHN and CIFAR10 datasets in Table 4, Table 5

(a) Cat (b) Deer

(c) Dog (d) Frog

(e) Ship (f) Truck

Figure 7. Samples from Triple-GAN given certain class on CI-FAR10.

Figure 8. Class-conditional interpolation for Triple-GAN onMNIST.

and Table 6, respectively.


Table 4. MNIST

Classifier C Discriminator D Generator G

Input 28×28 Gray Image Input 28×28 Gray Image, Ont-hot Class representation Input Class y, Noise z5×5 conv. 32 ReLU MLP 1000 units, lReLU, gaussian noise, weight norm MLP 500 units,

2×2 max-pooling, 0.5 dropout MLP 500 units, lReLU, gaussian noise, weight norm softplus, batch norm3×3 conv. 64 ReLU MLP 250 units, lReLU, gaussian noise, weight norm3×3 conv. 64 ReLU MLP 250 units, lReLU, gaussian noise, weight norm MLP 500 units,

2×2 max-pooling, 0.5 dropout MLP 250 units, lReLU, gaussian noise, weight norm softplus, batch norm3×3 conv. 128 ReLU MLP 1 unit, sigmoid, gaussian noise, weight norm3×3 conv. 128 ReLU MLP 784 units, sigmoid

Global pool10-class Softmax

Table 5. SVHN


Input: 32×32 Colored Image Input: 32×32 colored image, class y Input: Class y, Noise z0.2 dropout 0.2 dropout MP 8192 units,

3×3 conv. 128 lReLU, batch norm 3×3 conv. 32, lReLU, weight norm ReLU, batch norm3×3 conv. 128 lReLU, batch norm 3×3 conv. 32, lReLU, weight norm, stride 2 Reshape 512×4×43×3 conv. 128 lReLU, batch norm 5×5 deconv. 256. stride 2,

2×2 max-pooling, 0.5 dropout 0.2 dropout ReLU, batch norm3×3 conv. 256 lReLU, batch norm 3×3 conv. 64, lReLU, weight norm3×3 conv. 256 lReLU, batch norm 3×3 conv. 64, lReLU, weight norm, stride 23×3 conv. 256 lReLU, batch norm 5×5 deconv. 128. stride 2,

2×2 max-pooling, 0.5 dropout 0.2 dropout ReLU, batch norm3×3 conv. 512 lReLU, batch norm 3×3 conv. 128, lReLU, weight norm

NIN, 256 lReLU, batch norm 3×3 conv. 128, lReLU, weight normNIN, 128 lReLU, batch norm

Global pool Global pool 5×5 deconv. 3. stride 2,10-class Softmax, batch norm MLP 1 unit, sigmoid sigmoid, weight norm

Table 6. CIFAR10


Input: 32×32 Colored Image Input: 32×32 colored image, class y Input: Class y, Noise zGaussian noise 0.2 dropout MLP 8192 units,

3×3 conv. 128 lReLU, weight norm 3×3 conv. 32, lReLU, weight norm ReLU, batch norm3×3 conv. 128 lReLU, weight norm 3×3 conv. 32, lReLU, weight norm, stride 2 Reshape 512×4×43×3 conv. 128 lReLU, weight norm 5×5 deconv. 256.stride 2

2×2 max-pooling, 0.5 dropout 0.2 dropout ReLU, batch norm3×3 conv. 256 lReLU, weight norm 3×3 conv. 64, lReLU, weight norm3×3 conv. 256 lReLU, weight norm 3×3 conv. 64, lReLU, weight norm, stride 23×3 conv. 256 lReLU, weight norm 5×5 deconv. 128. stride 2

2×2 max-pooling, 0.5 dropout 0.2 dropout ReLU, batch norm3×3 conv. 512 lReLU, weight norm 3×3 conv. 128, lReLU, weight norm

NIN, 256 lReLU, weight norm 3×3 conv. 128, lReLU, weight normNIN, 128 lReLU, weight norm

Global pool Global pool 5×5 deconv. 3. stride 210-class Softmax wieh weight norm MLP 1 unit, sigmoid, weight norm tanh, weight norm

Documents

Triple Generative Adversarial NetsTriple-GAN, the shared discriminator enforces the mixture distributions deﬁned by the classiﬁer and generator to be similar to the true data distribution,