AdaGAN: Boosting Generative Models€¦ · AdaGAN: Boosting Generative Models Ilya Tolstikhin [email protected] joint work with Gelly2, Bousquet2, Simon-Gabriel1, Sch olkopf1

AdaGAN: Boosting Generative Models

Ilya [email protected]

joint work with Gelly2, Bousquet2, Simon-Gabriel1, Scholkopf1

1MPI for Intelligent Systems 2Google Brain

(Radford et al., 2015)

(Radford et al., 2015)

(Ngyuen et al., 2016)

(Youtube: Yota Ishida)

Motivation

I Generative Adversarial Networks, Variational Autoencoders, andother unsupervised representation learning/generative modellingtechniques ROCK!

I Dozens of papers being submitted, discussed, published, . . .

I Among papers submitted to ICLR2017 15 have titles matching‘Generative Adversarial*’.

I NIPS2016 Tutorial on GANs.

I GAN Workshop at NIPS2016 was the second-largest.

That’s all very exciting. However:

I GANs are very hard to train.

I The field is, apparently, in its “trials-and-errors” phase. People tryall kinds of heuristic hacks. Sometimes they work, sometimes not. . .

I Very few papers manage to go beyond heuristics and provide adeeper conceptual understanding of what’s going on.

AdaGAN: adaptive mixtures of GANs


Contents

1. Generative Adversarial Networks and f-divergences

GANs are trying to minimize specific f -divergence

minPmodel∈P

Df (Pdata, Pmodel).

At least, this is one way to explain what they are doing.

2. Minimizing f-divergences with mixtures of distributions

We study theoretical properties of the following problem:

minαT ,PT∈P

Df

(Pdata,

T∑t=1

αtPt

)and show that it can be reduced to

minPT∈P

Df

(Pdata, PT

). (*)

3. Back to GANs

Now we use GANs to train PT in (*), which results in AdaGAN.

Contents



minPmodel∈P

Df (Pdata‖Pmodel).




minαT ,PT∈P

Df

(Pdata

∥∥∥ T∑t=1

αtPt


minPT∈P

Df

(Pdata‖PT

). (*)

3. Back to GANs


f -GAN: Adversarial training for f -divergence minimization

For any probability distributions P and Q, f -divergence is

Df (Q‖P ) :=∫f

(dQ

dP(x)

)dP (x),

where f : (0,∞)→ R is convex and satisfies f(1) = 0.

Properties of f -divergence:

1. Df (P‖Q) ≥ 0 and Df (P‖Q) = 0 when P = Q;

2. Df (P‖Q) = Df◦(Q‖P ) for f◦(x) := xf(1/x);

3. Df (P‖Q) is symmetric when f(x) = xf(1/x) for all x;

4. Df (P‖Q) = Dg(P‖Q) for any g(x) = f(x) + C(x− 1);

5. Taking f0(x) := f(x)− (x− 1)f ′(1) is a good choice. It isnonnegative, nonincreasing on (0, 1], and nondecreasing on [1,∞).


For any probability distributions P and Q, f -divergence is

Df (Q‖P ) :=∫f

(dQ

dP(x)

)dP (x),

where f : (0,∞)→ R is convex and satisfies f(1) = 0.

Examples of f -divergences:

1. Kullback-Leibler divergence for f(x) = x log x;

2. Reverse Kullback-Leibler divergence for f(x) = − log x;

3. Total Variation distance for f(x) = |x− 1|;

4. Jensen-Shannon divergence for f(x) = −(x+ 1) log x+12 + x log x:

JS(P‖Q) = KL

(P∥∥∥P +Q

2

)+KL

(Q∥∥∥P +Q

2

).

f -GAN: Adversarial training for f -divergence minimizationKullback-Leibler minimization: assume X1, . . . , Xn ∼ Q

minP

KL(Q‖P ) ⇔ minP

∫log

(dQ

dP(x)

)dQ(x)

⇔ minP−∫

log(dP (x))dQ(x)

⇔ maxP

∫log(dP (x))dQ(x)

≈ maxP

1

N

N∑i=1

log(dP (Xi)).

Reverse Kullback-Leibler minimization:

minP

KL(P‖Q) ⇔ minP

∫log

(dP

dQ(x)

)dP (x)

⇔ minP−H(P )−

∫log(dQ(x))dP (x)

⇔ maxP

H(P ) +

∫log(dQ(x))dP (x).


Now we will follow (Nowozin et al., NIPS2016) .

For any probability distributions P and Q over X , f -divergence is

Df (Q‖P ) :=∫f

(dQ

dP(x)

)dP (x).

Variational representation: a very neat result, saying that

Df (Q‖P ) = supg : X→dom(f∗)

EX∼Q[g(X)]− EY∼P[f∗(g(Y )

)],

where f∗(x) := supu x · u− f(u) is a convex conjugate of f .

Df can be computed by solving an optimization problem!

The f -divergence minimization can be now written as:

infPDf (Q‖P ) ⇔ inf

Psupg : ...


)].

f -GAN: Adversarial training for f -divergence minimizationThe f -divergence minimization can be now written as:


Psup

g : X→dom(f∗)


)].

1. Take P to be a distribution of Gθ(Z), where Z ∼ PZ is a latentnoise and Gθ is a deep net;

2. JS divergence corresponds to f(x) = −(x+1) log x+12 + x log x and

f∗(t) = − log(2− et

)and thus domain of f∗ is (−∞, log 2);

3. Take g = gf ◦Dω, where gf (v) = log 2− log(1 + e−v) and Dω is adeep net (check that g(x) < log 2 for all x).

Then up to additive 2 log 2 term we get:

infGθ

supDω

EX∼Q log1

1 + e−Dω(X)+ EZ∼PZ log

(1− 1

1 + e−Dω(Gθ(Z)

))Compare to the original GAN objective (Goodfellow et al., NIPS2014)

infGθ

supDω

EX∼Pd [logDω(X)] + EZ∼PZ [log(1−Dω(Gθ(Z))

)].


Then we get the original GAN objective (Goodfellow et al., NIPS2014)

infGθ

supDω


)]. (*)

Let’s denote a distribution of Gθ(Z) when Z ∼ PZ with Pθ.

I For any fixed generator Gθ, the theoretical-optimal D∗ is:

D∗θ(x) =dPd(x)

dPd(x) + dPθ(x).

I If we insert this back to (*) we get

infGθ

2 · JS(Pd‖Pθ)− log(4).

f -GAN: Adversarial training for f -divergence minimizationThe f -divergence minimization can be now written as:


Psup

g : X→dom(f∗)


)].

This is GREAT, because:

1. Estimate EX∼Q[g(X)] using an i.i.d. sample (data)X1, . . . , XN ∼ Q:

EX∼Q[g(X)] ≈ 1

N

N∑i=1

g(Xi);

2. Often we don’t know the analytical form of P , but can only samplefrom P . No problems! Sample i.i.d. Y1, . . . , YM ∼ P and write:

EY∼P[f∗(g(Y )

)]≈ 1

M

M∑j=1

f∗(g(Yj)

).

This results in implicit generative modelling: using complex modeldistributions, which are analytically intractable, but easy to sample.


Note that there is still a difference between doing GAN:

infGθ

supDω


)]

and minimizing JS(Pdata, Pmodel):

infPmodel

supg : ...

EX∼Pdata [g(X)]− EY∼Pmodel[f∗JS(g(Y )

)],

because

1. We replaced expectations with sample averages. Uniform lows oflarge numbers may not apply, as our function classes are huge;

2. Instead of taking supremum over all possible witness functions g weoptimize over restricted class of DNNs (which is still huge);

3. In practice we never optimize g/Dω “to the end”. This leads tovanishing gradients and other bad things.

Contents



minPmodel∈P





minαT ,PT∈P

Df

(Pdata

∥∥∥ T∑t=1

αtPt


minPT∈P

Df

(Pdata‖PT

). (*)

3. Back to GANs


Minimizing f -divergences with mixtures of distributionsAssumption: Let’s assume we know more or less how to solve

minQ∈G

Df (Q ‖P )

for any f and any P , given we have an i.i.d. sample from P .

New problem: Now we want to approximate an unknown Pd in Df

using a mixture model of the form

PTmodel :=

T∑i=1

αiPi.

Greedy strategy:

1. Train P1 to minimize Df (P1‖Pd), set α1 = 1, and get P 1model = P1;

2. Assume we trained P1, . . . , Pt with weights α1, . . . , αt. We aregoing to train one more component Q, choose weight β and update

P t+1model =

t∑i=1

(1− β)αiPi + βQ.

Minimizing f -divergences with mixtures of distributions

1. Train P1 to minimize Df (P1‖Pd), set α1 = 1, and get P 1model = P1;

2. Assume we trained P1, . . . , Pt with weights α1, . . . , αt. We aregoing to train one more component Q, choose weight β and update

P t+1model =

t∑i=1

(1− β)αiPi + βQ.

The goal: we want to choose β ∈ [0, 1] and Q greedily, so as to minimize

Df ((1− β)Pg + βQ ‖Pd),

where we denoted Pg := P tmodel.

A modest goal: find β ∈ [0, 1] and Q, such that, for c < 1:

Df ((1− β)Pg + βQ ‖Pd) ≤ c ·Df (Pg ‖Pd).

Inefficient in practice: as t→∞ we expect β → 0. Only β fraction of

points gets sampled from Q ⇒ gets harder and harder to tune Q.


Df ((1− β)Pg + βQ ‖Pd) (*)

Inefficient in practice: as t→∞ we expect β → 0. Only β fraction ofpoints gets sampled from Q ⇒ gets harder and harder to tune Q.

Workaround: instead optimize an upper bound on (*), which contains aterm Df (Q,Q0) for some Q0:

Lemma Given Pd, Pg, and β ∈ [0, 1], for any Q and any R with β dR ≤ dPd

(∗) ≤ βD(Q ‖R) + (1− β)Df

(Pg ‖

Pd − βR1− β

).

If furthermore Df is a Hilbertian metric, then, for any R, we have√(∗) ≤

√βDf (Q ‖R) +

√Df ((1− β)Pg + βR ‖Pd) .

JS-divergence, TV, Hellinger, and others are Hilbertian metrics.


Df ((1− β)Pg + βQ ‖Pd) (*)


(∗) ≤ βD(Q ‖R) + (1− β)Df

(Pg ‖

Pd − βR1− β

).

If furthermore Df is a Hilbertian metric, then, for any R, we have

(∗) ≤(√

βDf (Q ‖R) +√Df ((1− β)Pg + βR ‖Pd)

)2

.

Recipe: choose R∗ to minimize the right-most terms and then minimize

Df (Q‖R∗), which we can do given we have an i.i.d. sample from R∗.

Minimizing f -divergences with mixtures of distributionsFirst surrogate upper bound

Lemma Given Pd, Pg, and β ∈ [0, 1], for any Q and R, if Df is Hilbertian

(∗) ≤(√


)2

.

The optimal R∗ is given in the next result:

Theorem Given Pd, Pg, and β ∈ (0, 1], the solution to

minR∈P

Df ((1− β)Pg + βR ‖Pd),

where P is a class of all probability distributions, has the density

dR∗β(x) =1

β(λ∗dPd(x)− (1− β)dPg(x))+

for some unique λ∗ satisfying∫dR∗β = 1.

Also, λ∗ = 1 if and only if Pd((1− β)dPg > dPd) = 0, which is equivalentto βdQ∗β = dPd − (1− β)dPg.


Second surrogate upper bound


Df ((1− β)Pg + βQ ‖Pd) ≤ βD(Q ‖R) + (1− β)Df

(Pg ‖

Pd − βR1− β

).

The optimal R∗ is given in the next result:

Theorem Given Pd, Pg, and β ∈ (0, 1], assume Pd (dPg = 0) < β. Thenthe solution to

minR:βdR≤dPd

Df

(Pg ‖

Pd − βR1− β

)is given by the density

dR†β(x) =1

β

(dPd(x)− λ†(1− β)dPg(x)

)+

for a unique λ† ≥ 1 satisfying∫dR†β = 1.


Two different optimal R∗:

dR∗β(x) =1

β(λ∗dPd(x)− (1− β)dPg(x))+

dR†β(x) =1

β

(dPd(x)− λ†(1− β)dPg(x)

)+

I Very symmetrically looking, but no obvious connections found. . .

I Both do not depend on f !

I λ∗ and λ† are implicitly defined via fixed-point equations.

I dR∗β(x) is also a solution to the original problem, that is of

minQ∈R

Df ((1− β)Pg + βQ ‖Pd).

We see that setting β = 1 is the best possible choice in theory. Sowe’ll have to use various heuristics, including uniform, constant, . . .


Df ((1− β)Pg + βQ ‖Pd) (*)


(∗) ≤ βD(Q ‖R) + (1− β)Df

(Pg ‖

Pd − βR1− β

).

If furthermore Df is a Hilbertian metric, then, for any R, we have

(∗) ≤(√


)2

.

Recipe: choose R∗ to minimize the right-most terms and then minimize

Df (Q‖R∗), which we can do given we have an i.i.d. sample from R∗.

Minimizing f -divergences with mixtures of distributionsRecipe: choose R∗ to minimize the right-most terms in the surrogateupper bounds and then minimize the first term Df (Q‖R∗).

I We found optimal R∗ distributions. Let’s now assume we are superpowerful and managed to find Q with Df (Q‖R∗) = 0.

I In other words, we are going to update our model:

Pg ⇒ (1− β)Pg + βQ

with Q either R∗β or R†β .

I How well are we doing in terms of the original objective

Df ((1− β)Pg + βQ ‖Pd)?

Theorem We have:

Df

((1−β)Pg+βR∗β

∥∥Pd) ≤ Df

((1−β)Pg+βPd

∥∥Pd) ≤ (1−β)Df (Pg ‖Pd)

andDf

((1− β)Pg + βR†β

∥∥Pd) ≤ (1− β)Df (Pg ‖Pd).


Other highlights:

I More careful convergence analysis. In particular we provide anecessary and sufficient condition for the iterative process toconverge in a finite number of steps: there exists M > 0 such that

Pd((1− β)dP1 > MdPd) = 0.

In other words, P1 � Pd

I In practice we don’t attain Df (Q‖R∗) = 0. We prove that given

Df (Q‖R∗β) ≤ γDd(Pg‖Pd) or Df (Q‖R†β) ≤ γDd(Pg‖Pd)

we achieve

Df ((1− β)Pg + βQ ‖Pd) ≤ c ·Df (Pg ‖Pd)

with c = c(γ, β) < 1. This is kind of similar to “weak learnability”in AdaBoost.

Contents



minPmodel∈P





minαT ,PT∈P

Df

(Pdata

∥∥∥ T∑t=1

αtPt


minPT∈P

Df

(Pdata‖PT

). (*)

3. Back to GANs



I We want to find Q so as to minimize Df (Q‖R∗β), where

dR∗β(x) =1

β(λ∗dPd(x)− (1− β)dPg(x))+ .

I We would like to use GAN for that. How do we sample from R∗β?

I Let’s rewrite:

dR∗β(x) =dPd(x)

β

(λ∗ − (1− β)dPg

dPd(x)

)+

.

I Recall: EPd [logD(X)] + EPg [log(1−D(Y )

)] is maximized by:

D∗(x) =dPd(x)

dPd(x) + dPg(x)⇔ dPg

dPd(x) =

1−D∗(x)D∗(x)

:= h(x).

I Each training sample end up with a weight

wi =piβ

(λ∗ − (1− β)h(xi)

)+≈ 1

Nβ

(λ∗ − (1− β)h(xi)

)+. (**)

I Finally, we compute λ∗ using (**).


Each training sample end up with a weight

wi =piβ

(λ∗ − (1− β)h(xi)

)+≈ 1

Nβ

(λ∗ − (1− β)1−D(xi)

D(xi)

)+

.

Now we run a usual GAN with an importance-weighted sample.

I D(xi) ≈ 1: our current model Pg is “missing” this point. Big wi.

I D(xi) ≈ 0: xi is covered by Pg nicely. Zero wi.

I The better our model gets, the more data points will be zeroed.

I This leads to an aggressive adaptive iterative procedure.

I β can be also chosen so as to force #{wi : wi > 0} ≈ r ·N .


Coverage metric C: mass of the true data “covered” by the model:

C := Pdata(dPmodel(X) > t) where Pmodel(dPmodel > t) = 0.95.

Experimental setting: Mixture of 5 two-dimensional Gaussians.

“Best of T GANs” “Bagging” “AdaGAN’

Blue points: median over 35 random runs;

Green error bars: 90% intervals. Not the confidence intervals for medians!

AdaGAN: adaptive mixtures of GANsCoverage metric C: mass of the true data “covered” by the model:


TopKLast r: βt is chosen to zero out r ·N of the training;

Boosted: AdaGAN with βt = 1/t.

AdaGAN: adaptive mixtures of GANsCoverage metric C: mass of the true data “covered” by the model:


Experimental setting: Mixture of 10 two-dimensional Gaussians.


Conclusions

Pros:

I It seems that AdaGAN is solving the missing modes problem;

I AdaGAN is compatible with any implementation of GAN and moregenerally with any other implicit generative modelling techniques;

I Still easy to sample from;

I Easy to implement and great freedom with heuristics;

I Theoretically grounded!

Cons:

I We loose the nice “manifold” structure in the latent space;

I Increased computational complexity;

I Individual GANs are still very unstable.

Thank you!

AdaGAN: Boosting Generative Models. Tolstikhin, Gelly, Bousquet,Simon-Gabriel, Schoelkopf, NIPS 2017.https://github.com/tolstikhin/adagan

https://github.com/tolstikhin/adagan

Minimizing the optimal transportOptimal transport for a cost function c(x, y) : X × X → R+ is

Wc(PX , PG) := infΓ∈P(X∼PX ,Y∼PG)

E(X,Y )∼Γ[c(X,Y )],

If PG(Y |Z = z) = δG(z) for all z ∈ Z, where G : Z → X , we have

Wc(PX , PG) = infQ : QZ=PZ

EPXEQ(Z|X)

[c(X,G(Z)

)],

where QZ is the marginal distribution of Z when X ∼ PX , Z ∼ Q(Z|X).

Minimizing the optimal transportOptimal transport for a cost function c(x, y) : X × X → R+ is

Wc(PX , PG) := infΓ∈P(X∼PX ,Y∼PG)

E(X,Y )∼Γ[c(X,Y )],

If PG(Y |Z = z) = δG(z) for all z ∈ Z, where G : Z → X , we have


EPXEQ(Z|X)

[c(X,G(Z)

)],

where QZ is the marginal distribution of Z when X ∼ PX , Z ∼ Q(Z|X).

Relaxing the constraint


EPXEQ(Z|X)

[c(X,G(Z)

)],

Penalized Optimal Transport replaces the constraint with a penalty:

POT(PX , PG) := infQ

EPXEQ(Z|X)

[c(X,G(Z)

)]+ λ ·D(QZ , PZ)

and uses the adversarial training in the Z space to approximate D.

I For the 2-Wasserstein distance c(X,Y ) = ‖X − Y ‖22 POT recoversAdversarial Auto-Encoders (Makhzani et al., 2016)

I For the 1-Wasserstein distance c(X,Y ) = ‖X − Y ‖2 POT andWGAN are solving the same problem from the primal and dualforms respectively.

I Importantly, unlike VAE, POT does not force Q(Z|X = x) tointersect for different x, which is known to lead to the blurriness.

(Bousquet et al., 2017)

Further details on GAN: “the log trick”

infGθ

supTω

EX∼Pd [log Tω(X)] + EZ∼PZ [log(1− Tω(Gθ(Z))

)]

In practice (verify this!) Gθ is often trained to instead minimize

−EZ∼PZ [log(Tω(Gθ(Z))

)] (“The log trick”)

One possible explanation: For any fr and PG the optimal T∗ in the variational representation ofDfr (PX‖PG) has the form

T∗(x) := f

′r

(pX (x)

pG(x)

)⇔

pX (x)

pG(x)= (f

′r)−1(

T∗(x))

1. Approximate T∗ ≈ Tω∗ using the adversarial training with fr ;

2. ExpresspXpG

(x) = (f ′r)−1(Tw∗ (x));

3. Approximate the desired f -divergence Df (PX‖PG) using

∫X

f

(pX (x)

pG(x)

)pG(x)dx ≈

∫X

f((f′r)−1(

T∗(x)))

pG(x)dx

Applying this argument with the JS fr and f(x) = log(1 + x−1) recovers the log trick.

The corresponding f -divergence is much more mode-seeking than the reverse KL divergence.

Documents

AdaGAN: Boosting Generative Models€¦ · AdaGAN: Boosting Generative Models Ilya Tolstikhin [email protected] joint work with Gelly2, Bousquet2, Simon-Gabriel1, Sch olkopf1