Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
AdaGAN: Boosting Generative Models
Ilya [email protected]
joint work with Gelly2, Bousquet2, Simon-Gabriel1, Scholkopf1
1MPI for Intelligent Systems 2Google Brain
(Radford et al., 2015)
(Radford et al., 2015)
(Ngyuen et al., 2016)
(Youtube: Yota Ishida)
Motivation
I Generative Adversarial Networks, Variational Autoencoders, andother unsupervised representation learning/generative modellingtechniques ROCK!
I Dozens of papers being submitted, discussed, published, . . .
I Among papers submitted to ICLR2017 15 have titles matching‘Generative Adversarial*’.
I NIPS2016 Tutorial on GANs.
I GAN Workshop at NIPS2016 was the second-largest.
That’s all very exciting. However:
I GANs are very hard to train.
I The field is, apparently, in its “trials-and-errors” phase. People tryall kinds of heuristic hacks. Sometimes they work, sometimes not. . .
I Very few papers manage to go beyond heuristics and provide adeeper conceptual understanding of what’s going on.
AdaGAN: adaptive mixtures of GANs
AdaGAN: adaptive mixtures of GANs
Contents
1. Generative Adversarial Networks and f-divergences
GANs are trying to minimize specific f -divergence
minPmodel∈P
Df (Pdata, Pmodel).
At least, this is one way to explain what they are doing.
2. Minimizing f-divergences with mixtures of distributions
We study theoretical properties of the following problem:
minαT ,PT∈P
Df
(Pdata,
T∑t=1
αtPt
)and show that it can be reduced to
minPT∈P
Df
(Pdata, PT
). (*)
3. Back to GANs
Now we use GANs to train PT in (*), which results in AdaGAN.
Contents
1. Generative Adversarial Networks and f-divergences
GANs are trying to minimize specific f -divergence
minPmodel∈P
Df (Pdata‖Pmodel).
At least, this is one way to explain what they are doing.
2. Minimizing f-divergences with mixtures of distributions
We study theoretical properties of the following problem:
minαT ,PT∈P
Df
(Pdata
∥∥∥ T∑t=1
αtPt
)and show that it can be reduced to
minPT∈P
Df
(Pdata‖PT
). (*)
3. Back to GANs
Now we use GANs to train PT in (*), which results in AdaGAN.
f -GAN: Adversarial training for f -divergence minimization
For any probability distributions P and Q, f -divergence is
Df (Q‖P ) :=∫f
(dQ
dP(x)
)dP (x),
where f : (0,∞)→ R is convex and satisfies f(1) = 0.
Properties of f -divergence:
1. Df (P‖Q) ≥ 0 and Df (P‖Q) = 0 when P = Q;
2. Df (P‖Q) = Df◦(Q‖P ) for f◦(x) := xf(1/x);
3. Df (P‖Q) is symmetric when f(x) = xf(1/x) for all x;
4. Df (P‖Q) = Dg(P‖Q) for any g(x) = f(x) + C(x− 1);
5. Taking f0(x) := f(x)− (x− 1)f ′(1) is a good choice. It isnonnegative, nonincreasing on (0, 1], and nondecreasing on [1,∞).
f -GAN: Adversarial training for f -divergence minimization
For any probability distributions P and Q, f -divergence is
Df (Q‖P ) :=∫f
(dQ
dP(x)
)dP (x),
where f : (0,∞)→ R is convex and satisfies f(1) = 0.
Examples of f -divergences:
1. Kullback-Leibler divergence for f(x) = x log x;
2. Reverse Kullback-Leibler divergence for f(x) = − log x;
3. Total Variation distance for f(x) = |x− 1|;
4. Jensen-Shannon divergence for f(x) = −(x+ 1) log x+12 + x log x:
JS(P‖Q) = KL
(P∥∥∥P +Q
2
)+KL
(Q∥∥∥P +Q
2
).
f -GAN: Adversarial training for f -divergence minimizationKullback-Leibler minimization: assume X1, . . . , Xn ∼ Q
minP
KL(Q‖P ) ⇔ minP
∫log
(dQ
dP(x)
)dQ(x)
⇔ minP−∫
log(dP (x))dQ(x)
⇔ maxP
∫log(dP (x))dQ(x)
≈ maxP
1
N
N∑i=1
log(dP (Xi)).
Reverse Kullback-Leibler minimization:
minP
KL(P‖Q) ⇔ minP
∫log
(dP
dQ(x)
)dP (x)
⇔ minP−H(P )−
∫log(dQ(x))dP (x)
⇔ maxP
H(P ) +
∫log(dQ(x))dP (x).
f -GAN: Adversarial training for f -divergence minimization
Now we will follow (Nowozin et al., NIPS2016) .
For any probability distributions P and Q over X , f -divergence is
Df (Q‖P ) :=∫f
(dQ
dP(x)
)dP (x).
Variational representation: a very neat result, saying that
Df (Q‖P ) = supg : X→dom(f∗)
EX∼Q[g(X)]− EY∼P[f∗(g(Y )
)],
where f∗(x) := supu x · u− f(u) is a convex conjugate of f .
Df can be computed by solving an optimization problem!
The f -divergence minimization can be now written as:
infPDf (Q‖P ) ⇔ inf
Psupg : ...
EX∼Q[g(X)]− EY∼P[f∗(g(Y )
)].
f -GAN: Adversarial training for f -divergence minimizationThe f -divergence minimization can be now written as:
infPDf (Q‖P ) ⇔ inf
Psup
g : X→dom(f∗)
EX∼Q[g(X)]− EY∼P[f∗(g(Y )
)].
1. Take P to be a distribution of Gθ(Z), where Z ∼ PZ is a latentnoise and Gθ is a deep net;
2. JS divergence corresponds to f(x) = −(x+1) log x+12 + x log x and
f∗(t) = − log(2− et
)and thus domain of f∗ is (−∞, log 2);
3. Take g = gf ◦Dω, where gf (v) = log 2− log(1 + e−v) and Dω is adeep net (check that g(x) < log 2 for all x).
Then up to additive 2 log 2 term we get:
infGθ
supDω
EX∼Q log1
1 + e−Dω(X)+ EZ∼PZ log
(1− 1
1 + e−Dω(Gθ(Z)
))Compare to the original GAN objective (Goodfellow et al., NIPS2014)
infGθ
supDω
EX∼Pd [logDω(X)] + EZ∼PZ [log(1−Dω(Gθ(Z))
)].
f -GAN: Adversarial training for f -divergence minimization
Then we get the original GAN objective (Goodfellow et al., NIPS2014)
infGθ
supDω
EX∼Pd [logDω(X)] + EZ∼PZ [log(1−Dω(Gθ(Z))
)]. (*)
Let’s denote a distribution of Gθ(Z) when Z ∼ PZ with Pθ.
I For any fixed generator Gθ, the theoretical-optimal D∗ is:
D∗θ(x) =dPd(x)
dPd(x) + dPθ(x).
I If we insert this back to (*) we get
infGθ
2 · JS(Pd‖Pθ)− log(4).
f -GAN: Adversarial training for f -divergence minimizationThe f -divergence minimization can be now written as:
infPDf (Q‖P ) ⇔ inf
Psup
g : X→dom(f∗)
EX∼Q[g(X)]− EY∼P[f∗(g(Y )
)].
This is GREAT, because:
1. Estimate EX∼Q[g(X)] using an i.i.d. sample (data)X1, . . . , XN ∼ Q:
EX∼Q[g(X)] ≈ 1
N
N∑i=1
g(Xi);
2. Often we don’t know the analytical form of P , but can only samplefrom P . No problems! Sample i.i.d. Y1, . . . , YM ∼ P and write:
EY∼P[f∗(g(Y )
)]≈ 1
M
M∑j=1
f∗(g(Yj)
).
This results in implicit generative modelling: using complex modeldistributions, which are analytically intractable, but easy to sample.
f -GAN: Adversarial training for f -divergence minimization
Note that there is still a difference between doing GAN:
infGθ
supDω
EX∼Pd [logDω(X)] + EZ∼PZ [log(1−Dω(Gθ(Z))
)]
and minimizing JS(Pdata, Pmodel):
infPmodel
supg : ...
EX∼Pdata [g(X)]− EY∼Pmodel[f∗JS(g(Y )
)],
because
1. We replaced expectations with sample averages. Uniform lows oflarge numbers may not apply, as our function classes are huge;
2. Instead of taking supremum over all possible witness functions g weoptimize over restricted class of DNNs (which is still huge);
3. In practice we never optimize g/Dω “to the end”. This leads tovanishing gradients and other bad things.
Contents
1. Generative Adversarial Networks and f-divergences
GANs are trying to minimize specific f -divergence
minPmodel∈P
Df (Pdata‖Pmodel).
At least, this is one way to explain what they are doing.
2. Minimizing f-divergences with mixtures of distributions
We study theoretical properties of the following problem:
minαT ,PT∈P
Df
(Pdata
∥∥∥ T∑t=1
αtPt
)and show that it can be reduced to
minPT∈P
Df
(Pdata‖PT
). (*)
3. Back to GANs
Now we use GANs to train PT in (*), which results in AdaGAN.
Minimizing f -divergences with mixtures of distributionsAssumption: Let’s assume we know more or less how to solve
minQ∈G
Df (Q ‖P )
for any f and any P , given we have an i.i.d. sample from P .
New problem: Now we want to approximate an unknown Pd in Df
using a mixture model of the form
PTmodel :=
T∑i=1
αiPi.
Greedy strategy:
1. Train P1 to minimize Df (P1‖Pd), set α1 = 1, and get P 1model = P1;
2. Assume we trained P1, . . . , Pt with weights α1, . . . , αt. We aregoing to train one more component Q, choose weight β and update
P t+1model =
t∑i=1
(1− β)αiPi + βQ.
Minimizing f -divergences with mixtures of distributions
1. Train P1 to minimize Df (P1‖Pd), set α1 = 1, and get P 1model = P1;
2. Assume we trained P1, . . . , Pt with weights α1, . . . , αt. We aregoing to train one more component Q, choose weight β and update
P t+1model =
t∑i=1
(1− β)αiPi + βQ.
The goal: we want to choose β ∈ [0, 1] and Q greedily, so as to minimize
Df ((1− β)Pg + βQ ‖Pd),
where we denoted Pg := P tmodel.
A modest goal: find β ∈ [0, 1] and Q, such that, for c < 1:
Df ((1− β)Pg + βQ ‖Pd) ≤ c ·Df (Pg ‖Pd).
Inefficient in practice: as t→∞ we expect β → 0. Only β fraction of
points gets sampled from Q ⇒ gets harder and harder to tune Q.
Minimizing f -divergences with mixtures of distributions
Df ((1− β)Pg + βQ ‖Pd) (*)
Inefficient in practice: as t→∞ we expect β → 0. Only β fraction ofpoints gets sampled from Q ⇒ gets harder and harder to tune Q.
Workaround: instead optimize an upper bound on (*), which contains aterm Df (Q,Q0) for some Q0:
Lemma Given Pd, Pg, and β ∈ [0, 1], for any Q and any R with β dR ≤ dPd
(∗) ≤ βD(Q ‖R) + (1− β)Df
(Pg ‖
Pd − βR1− β
).
If furthermore Df is a Hilbertian metric, then, for any R, we have√(∗) ≤
√βDf (Q ‖R) +
√Df ((1− β)Pg + βR ‖Pd) .
JS-divergence, TV, Hellinger, and others are Hilbertian metrics.
Minimizing f -divergences with mixtures of distributions
Df ((1− β)Pg + βQ ‖Pd) (*)
Lemma Given Pd, Pg, and β ∈ [0, 1], for any Q and any R with β dR ≤ dPd
(∗) ≤ βD(Q ‖R) + (1− β)Df
(Pg ‖
Pd − βR1− β
).
If furthermore Df is a Hilbertian metric, then, for any R, we have
(∗) ≤(√
βDf (Q ‖R) +√Df ((1− β)Pg + βR ‖Pd)
)2
.
Recipe: choose R∗ to minimize the right-most terms and then minimize
Df (Q‖R∗), which we can do given we have an i.i.d. sample from R∗.
Minimizing f -divergences with mixtures of distributionsFirst surrogate upper bound
Lemma Given Pd, Pg, and β ∈ [0, 1], for any Q and R, if Df is Hilbertian
(∗) ≤(√
βDf (Q ‖R) +√Df ((1− β)Pg + βR ‖Pd)
)2
.
The optimal R∗ is given in the next result:
Theorem Given Pd, Pg, and β ∈ (0, 1], the solution to
minR∈P
Df ((1− β)Pg + βR ‖Pd),
where P is a class of all probability distributions, has the density
dR∗β(x) =1
β(λ∗dPd(x)− (1− β)dPg(x))+
for some unique λ∗ satisfying∫dR∗β = 1.
Also, λ∗ = 1 if and only if Pd((1− β)dPg > dPd) = 0, which is equivalentto βdQ∗β = dPd − (1− β)dPg.
Minimizing f -divergences with mixtures of distributions
Second surrogate upper bound
Lemma Given Pd, Pg, and β ∈ [0, 1], for any Q and any R with β dR ≤ dPd
Df ((1− β)Pg + βQ ‖Pd) ≤ βD(Q ‖R) + (1− β)Df
(Pg ‖
Pd − βR1− β
).
The optimal R∗ is given in the next result:
Theorem Given Pd, Pg, and β ∈ (0, 1], assume Pd (dPg = 0) < β. Thenthe solution to
minR:βdR≤dPd
Df
(Pg ‖
Pd − βR1− β
)is given by the density
dR†β(x) =1
β
(dPd(x)− λ†(1− β)dPg(x)
)+
for a unique λ† ≥ 1 satisfying∫dR†β = 1.
Minimizing f -divergences with mixtures of distributions
Two different optimal R∗:
dR∗β(x) =1
β(λ∗dPd(x)− (1− β)dPg(x))+
dR†β(x) =1
β
(dPd(x)− λ†(1− β)dPg(x)
)+
I Very symmetrically looking, but no obvious connections found. . .
I Both do not depend on f !
I λ∗ and λ† are implicitly defined via fixed-point equations.
I dR∗β(x) is also a solution to the original problem, that is of
minQ∈R
Df ((1− β)Pg + βQ ‖Pd).
We see that setting β = 1 is the best possible choice in theory. Sowe’ll have to use various heuristics, including uniform, constant, . . .
Minimizing f -divergences with mixtures of distributions
Df ((1− β)Pg + βQ ‖Pd) (*)
Lemma Given Pd, Pg, and β ∈ [0, 1], for any Q and any R with β dR ≤ dPd
(∗) ≤ βD(Q ‖R) + (1− β)Df
(Pg ‖
Pd − βR1− β
).
If furthermore Df is a Hilbertian metric, then, for any R, we have
(∗) ≤(√
βDf (Q ‖R) +√Df ((1− β)Pg + βR ‖Pd)
)2
.
Recipe: choose R∗ to minimize the right-most terms and then minimize
Df (Q‖R∗), which we can do given we have an i.i.d. sample from R∗.
Minimizing f -divergences with mixtures of distributionsRecipe: choose R∗ to minimize the right-most terms in the surrogateupper bounds and then minimize the first term Df (Q‖R∗).
I We found optimal R∗ distributions. Let’s now assume we are superpowerful and managed to find Q with Df (Q‖R∗) = 0.
I In other words, we are going to update our model:
Pg ⇒ (1− β)Pg + βQ
with Q either R∗β or R†β .
I How well are we doing in terms of the original objective
Df ((1− β)Pg + βQ ‖Pd)?
Theorem We have:
Df
((1−β)Pg+βR∗β
∥∥Pd) ≤ Df
((1−β)Pg+βPd
∥∥Pd) ≤ (1−β)Df (Pg ‖Pd)
andDf
((1− β)Pg + βR†β
∥∥Pd) ≤ (1− β)Df (Pg ‖Pd).
Minimizing f -divergences with mixtures of distributions
Other highlights:
I More careful convergence analysis. In particular we provide anecessary and sufficient condition for the iterative process toconverge in a finite number of steps: there exists M > 0 such that
Pd((1− β)dP1 > MdPd) = 0.
In other words, P1 � Pd
I In practice we don’t attain Df (Q‖R∗) = 0. We prove that given
Df (Q‖R∗β) ≤ γDd(Pg‖Pd) or Df (Q‖R†β) ≤ γDd(Pg‖Pd)
we achieve
Df ((1− β)Pg + βQ ‖Pd) ≤ c ·Df (Pg ‖Pd)
with c = c(γ, β) < 1. This is kind of similar to “weak learnability”in AdaBoost.
Contents
1. Generative Adversarial Networks and f-divergences
GANs are trying to minimize specific f -divergence
minPmodel∈P
Df (Pdata‖Pmodel).
At least, this is one way to explain what they are doing.
2. Minimizing f-divergences with mixtures of distributions
We study theoretical properties of the following problem:
minαT ,PT∈P
Df
(Pdata
∥∥∥ T∑t=1
αtPt
)and show that it can be reduced to
minPT∈P
Df
(Pdata‖PT
). (*)
3. Back to GANs
Now we use GANs to train PT in (*), which results in AdaGAN.
AdaGAN: adaptive mixtures of GANs
I We want to find Q so as to minimize Df (Q‖R∗β), where
dR∗β(x) =1
β(λ∗dPd(x)− (1− β)dPg(x))+ .
I We would like to use GAN for that. How do we sample from R∗β?
I Let’s rewrite:
dR∗β(x) =dPd(x)
β
(λ∗ − (1− β)dPg
dPd(x)
)+
.
I Recall: EPd [logD(X)] + EPg [log(1−D(Y )
)] is maximized by:
D∗(x) =dPd(x)
dPd(x) + dPg(x)⇔ dPg
dPd(x) =
1−D∗(x)D∗(x)
:= h(x).
I Each training sample end up with a weight
wi =piβ
(λ∗ − (1− β)h(xi)
)+≈ 1
Nβ
(λ∗ − (1− β)h(xi)
)+. (**)
I Finally, we compute λ∗ using (**).
AdaGAN: adaptive mixtures of GANs
Each training sample end up with a weight
wi =piβ
(λ∗ − (1− β)h(xi)
)+≈ 1
Nβ
(λ∗ − (1− β)1−D(xi)
D(xi)
)+
.
Now we run a usual GAN with an importance-weighted sample.
I D(xi) ≈ 1: our current model Pg is “missing” this point. Big wi.
I D(xi) ≈ 0: xi is covered by Pg nicely. Zero wi.
I The better our model gets, the more data points will be zeroed.
I This leads to an aggressive adaptive iterative procedure.
I β can be also chosen so as to force #{wi : wi > 0} ≈ r ·N .
AdaGAN: adaptive mixtures of GANs
Coverage metric C: mass of the true data “covered” by the model:
C := Pdata(dPmodel(X) > t) where Pmodel(dPmodel > t) = 0.95.
Experimental setting: Mixture of 5 two-dimensional Gaussians.
“Best of T GANs” “Bagging” “AdaGAN’
Blue points: median over 35 random runs;
Green error bars: 90% intervals. Not the confidence intervals for medians!
AdaGAN: adaptive mixtures of GANsCoverage metric C: mass of the true data “covered” by the model:
C := Pdata(dPmodel(X) > t) where Pmodel(dPmodel > t) = 0.95.
TopKLast r: βt is chosen to zero out r ·N of the training;
Boosted: AdaGAN with βt = 1/t.
AdaGAN: adaptive mixtures of GANsCoverage metric C: mass of the true data “covered” by the model:
C := Pdata(dPmodel(X) > t) where Pmodel(dPmodel > t) = 0.95.
Experimental setting: Mixture of 10 two-dimensional Gaussians.
AdaGAN: adaptive mixtures of GANs
Conclusions
Pros:
I It seems that AdaGAN is solving the missing modes problem;
I AdaGAN is compatible with any implementation of GAN and moregenerally with any other implicit generative modelling techniques;
I Still easy to sample from;
I Easy to implement and great freedom with heuristics;
I Theoretically grounded!
Cons:
I We loose the nice “manifold” structure in the latent space;
I Increased computational complexity;
I Individual GANs are still very unstable.
Thank you!
AdaGAN: Boosting Generative Models. Tolstikhin, Gelly, Bousquet,Simon-Gabriel, Schoelkopf, NIPS 2017.https://github.com/tolstikhin/adagan
Minimizing the optimal transportOptimal transport for a cost function c(x, y) : X × X → R+ is
Wc(PX , PG) := infΓ∈P(X∼PX ,Y∼PG)
E(X,Y )∼Γ[c(X,Y )],
If PG(Y |Z = z) = δG(z) for all z ∈ Z, where G : Z → X , we have
Wc(PX , PG) = infQ : QZ=PZ
EPXEQ(Z|X)
[c(X,G(Z)
)],
where QZ is the marginal distribution of Z when X ∼ PX , Z ∼ Q(Z|X).
Minimizing the optimal transportOptimal transport for a cost function c(x, y) : X × X → R+ is
Wc(PX , PG) := infΓ∈P(X∼PX ,Y∼PG)
E(X,Y )∼Γ[c(X,Y )],
If PG(Y |Z = z) = δG(z) for all z ∈ Z, where G : Z → X , we have
Wc(PX , PG) = infQ : QZ=PZ
EPXEQ(Z|X)
[c(X,G(Z)
)],
where QZ is the marginal distribution of Z when X ∼ PX , Z ∼ Q(Z|X).
Relaxing the constraint
Wc(PX , PG) = infQ : QZ=PZ
EPXEQ(Z|X)
[c(X,G(Z)
)],
Penalized Optimal Transport replaces the constraint with a penalty:
POT(PX , PG) := infQ
EPXEQ(Z|X)
[c(X,G(Z)
)]+ λ ·D(QZ , PZ)
and uses the adversarial training in the Z space to approximate D.
I For the 2-Wasserstein distance c(X,Y ) = ‖X − Y ‖22 POT recoversAdversarial Auto-Encoders (Makhzani et al., 2016)
I For the 1-Wasserstein distance c(X,Y ) = ‖X − Y ‖2 POT andWGAN are solving the same problem from the primal and dualforms respectively.
I Importantly, unlike VAE, POT does not force Q(Z|X = x) tointersect for different x, which is known to lead to the blurriness.
(Bousquet et al., 2017)
Further details on GAN: “the log trick”
infGθ
supTω
EX∼Pd [log Tω(X)] + EZ∼PZ [log(1− Tω(Gθ(Z))
)]
In practice (verify this!) Gθ is often trained to instead minimize
−EZ∼PZ [log(Tω(Gθ(Z))
)] (“The log trick”)
One possible explanation: For any fr and PG the optimal T∗ in the variational representation ofDfr (PX‖PG) has the form
T∗(x) := f
′r
(pX (x)
pG(x)
)⇔
pX (x)
pG(x)= (f
′r)−1(
T∗(x))
1. Approximate T∗ ≈ Tω∗ using the adversarial training with fr ;
2. ExpresspXpG
(x) = (f ′r)−1(Tw∗ (x));
3. Approximate the desired f -divergence Df (PX‖PG) using
∫X
f
(pX (x)
pG(x)
)pG(x)dx ≈
∫X
f((f′r)−1(
T∗(x)))
pG(x)dx
Applying this argument with the JS fr and f(x) = log(1 + x−1) recovers the log trick.
The corresponding f -divergence is much more mode-seeking than the reverse KL divergence.