    Wasserstein GAN

    Martin Arjovsky1, Soumith Chintala2, Leon Bottou1,2

    1Courant Institute of Mathematical Sciences2Facebook AI Research

    Presented by Chunyuan Li

    Preliminaries: “vanilla” GAN1 Real data distribution Pr;

    Generator’s distribution Pg , implemented as x = G(z), z ∼ P (z)

    minG maxD V (D,G)Discriminator

    −Ex∼Pr [logD(x)]− Ex∼Pg [log(1−D(x))] (1)D(x): the probability that x from the real data rather than generator.

    GeneratorEx∼Pg [log(1−D(x))] GAN0 (2)Ex∼Pg [− log(D(x))] GAN1 (3)2

    Problems [Goodfellow et al., 2014]:

    P1: “In practice, GAN0 may not provide sufficient gradientfor G to learn well”, GAN1 is used instead. (log D trick)

    P2: “G collapses too many values of z to the same value ofx” (Mode collapse in GAN1)

    What is the princpled interpretation? [Arjovsky and Bottou, 2017]2 / 16

    P1 on GAN0

    In GAN0, better discriminator leads to worse vanishing gradient in its generator

    Q: Why is GAN difficult to train?

    A: Either our updates to the discriminator will be inacurate, or they willvanish. It leaves up to the user to decide the precise amount of trainingdedicated to the discriminator, which can make GAN training hard.

    P1 on GAN0Proof Sketch:

    1 Minimizing generator yields minimizing the JS divergence when thediscriminator is optimal.For given x, the optimal discriminator is

    D∗(x) =Pr(x)

    Pr(x) + Pg(x)(4)

    The generator loss (by adding a term independent of Pg) is

    L = Ex∼Pr [logD(x)] + Ex∼Pg [log(1−D(x))] (5)

    Plug (4) into (5):

    2JS(Pr||Pg)− 2 log 2 (6)

    2 If the supports (underlying low-dimension manifolds) of Pr and Pg (almost)have no overlap, then JS(Pr||Pg) = log 2 (Theorem 2.3), and thus thegradient of (5) wrt. Pg vanishes (Theorem 2.4 and Corollary 2.1)

    3 The probability that the support of Pr and Pg have almost zero overlap is 1(Lemma 2, Lemma 3 and Theorem 2.2)

    P2 on GAN1

    GAN1 is a conflicting/asymmetric objective, thus (1)unstable gradient (2) mode callapse

    P2 on GAN1

    Proof Sketch:1 (Theorem 2.5) GAN1 equals to optimize

    KL(Pg ||Pr)− 2JS(Pg ||Pr) (7)

    2 Opposite signs for KL and JS. (Theorem 2.6: Instability of generatorgradient updates)

    3 KL(Pg ||Pr), NOT KL(Pr||Pg):KL(Pg ||Pr) assigns an high cost to generating fake looking samples, and anlow cost on mode dropping;KL(Pr||Pg) assigns an high cost to not covering parts of the data, and anlow cost on generating fake looking samples;

    Preliminaries: distance measures for distributions

    1 KLKL(P ||Q) = EP log


    Q2 JS

    JS(P ||Q) = 12KL(P ||P +Q

    2 ) +12KL(Q||

    P +Q2 )

    3 Wasserstein

    W (P ||Q) = infγ∈Π(P,Q)

    E(x,y)∼γ [||x− y||]

    Π(P,Q) denotes the set of all joint distributions γ(x, y)whose marginals are P and Q, respectivelyγ(x, y) indicates a plan to transport “mass” from x to y,when deforming P into Q.The Wasserstein (or Earth-Mover) distance is then the“cost” of the optimal transport plan

    P0: distribution of (0, Z), where Z ∼ U [0, 1]Pθ: distribution of (θ, Z), where θ is a single real parameter

    KL(P0||Pθ) = KL(Pθ||P0) ={

    +∞ if θ 6= 00 if θ = 0

    JS(P0||Pθ) ={

    log 2 if θ 6= 00 if θ = 0

    W (P0||Pθ) = |θ|




    P0 P✓

    (a) Distributions

    (b) Output of W and JS

    1 ObservationsWhen the distributions are supported by low dimensional manifolds (such asPr and Pg in GANs)

    KL or JS are binary, no meaningful gradientW is continuous and differentiable, hence always sensible

    2 Theoretical supportTheorem 1 supports the above statementCorollary 1 say Theorem 1 is true when the mapping is neural nets.Theorem 2 imply TV distance has the same probem with KL and JS.

    Wasserstein GAN

    The infimum is highly intractableWasserstein distance has a duality form

    W (Pr, Pg) = sup||f ||L≤1

    Ex∼Pr [f(x)]− Ex∼Pg [f(x)] (8)


    sup||f ||L≤K

    Ex∼Pr [f(x)]− Ex∼Pg [f(x)] (9)

    where supremum is over all the K-Lipschitz functionsConsider a w-parameterized family of functions {fw}w∈W that are allK-Lipschitz

    W (Pr, Pg) = maxw∈W

    Ex∼Pr [fw(x)]− Ex∼Pg [fw(x)] (10)

    For example, W = [−c, c]l

    Wasserstein GAN

    The loss for discriminator/critic

    Ex∼Pr [fw(x)]− Ex∼Pg [fw(x)] (11)

    The loss for generator

    −Ex∼Pg [fw(x)] = −Ez∼p(z)[fw(gθ(z))] (12)

    Main difference to vanilla GAN

    Remove the sigmoid of the last layer in DRemove the log in the loss of D and G.Clip the parameters of D in an inverval centered at 0.Momentum-based optmizaition is not allowed

    Meaningful loss metricA meaningful loss metric that correlates with the generator’sconvergence and sample quality. WGAN algorithm attempts to trainthe critic relatively well before each generator update, the loss function atthis point is an estimate of the EM distance.NOT to quantitatively evaluate generative models

    Top: DCGAN discriminator; Bottom: MLP discriminator

    Improved stability

    It allows us to train the critic till optimality, and thus no longer need tobalance generator and discriminator’s capacity properly

    A generator without batch normalization in DCGAN

    In no experiment did the authors see evidence of mode collapseA generator constrcuted with MLP

    Integral Probability Metrics (IPMs)

    dF (Pr, Pg) = supf∈F

    Ex∼Pr [f(x)]− Ex∼Pg [f(x)] (13)

    1 Wasserstein distance: F is the set of K-Lipschitz functions

    2 Total variation distance: F is the set of all measurable functions boundedbetween -1 and 1

    3 Energy-based GANs: generative approach to the total variation distance

    4 Maximum Mean Discrepancy (MMD): F : f ∈ H, ||f ||∞ ≤ 1, for some H inRKHS [Sutherland et al., 2016]

    5 Kernelized Stein Discrepancy: a special case of MMD, with “Steinalized”kernels depending on Pg , i.e., κ(x, x′) = T xPg (T

    x′Pg⊗ k(x, x′))

    [Wang and Liu, 2016]

