43
Bayesian Generative Adversarial Networks Andrew Gordon Wilson Assistant Professor https://people.orie.cornell.edu/andrew Cornell University Center for Informatics and Computational Science (CICS) Notre Dame University February 26, 2018 Joint work with Yunus Saatchi 1 / 43

Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

  • Upload
    others

  • View
    46

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Bayesian Generative Adversarial Networks

Andrew Gordon Wilson

Assistant Professorhttps://people.orie.cornell.edu/andrew

Cornell University

Center for Informatics and Computational Science (CICS)Notre Dame University

February 26, 2018

Joint work with Yunus Saatchi

1 / 43

Page 2: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Bayesian Generative Adversarial Networks

I Generative adversarial networks(GANs) (Goodfellow et. al, NIPS2014) learn rich distributions overimages, audio, and data which arehard to model with an explicitlikelihood.

I We introduce a Bayesian GAN,which requires minimal humanintervention, and provides powerfulsemi-supervised results.

I State-of-the-art predictive accuracyusing less than 1% of labels.

I Scalable inference with stochasticgradient HMC.

θg

p(θg|D)

(θg)ML

2 / 43

Page 3: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Unsupervised Generative Models

Why do we care?

I Foundational to intelligent systems

I Simulating possible futures in reinforcement learning

I Semi-supervised learning

I Image super-resolution, inpainting, extrapolation

GANs and VAEs have emerged as exceptionally powerful frameworksfor generative unsupervised modelling.

“GANs are the most significant new development in machinelearning in the last 10 years!”Yann LeCun, Cornell CS Colloquium, 2016

3 / 43

Page 4: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

What are GANs?

I Generative Adversarial Networks (GANs) implicitly performdensity estimation.

I A generator G proposes samples from the data distribution,attempting to fool a discriminator D. Learning takes placethrough an adversarial game between G and D.

I GANs are very good at learning to sample from a density overimages, which has been practically an intractable problem!

4 / 43

Page 5: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Classical Density Estimation

I Observations y1, . . . , yN drawn fromunknown density p(y).

I Specify an observation model. Forexample, we can let the points be drawnfrom a mixture of Gaussians:p(y|θ) = w1N (y|µ1, σ

21) + w2N (y|µ2, σ

22),

θ = {w1,w2, µ1, µ2, σ1, σ2}.

I Likelihood p(y|θ) =∏Ni=1 p(yi|θ) .

Can learn all free parameters θ using maximumlikelihood...

5 / 43

Page 6: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Regularisation = MAP 6= Bayesian Inference

Regularisation or MAP

I Find argmaxθ log p(θ|y) c=

model fit︷ ︸︸ ︷log p(y|θ)+

complexity penalty︷ ︸︸ ︷log p(θ)

I Choose p(θ) such that p(θ)→ 0 fasterthan p(y|θ)→∞ as σ1 or σ2 → 0.

Bayesian Inference

I Predictive Distribution: p(y∗|y) =∫

p(y∗|θ)p(θ|y)dθ.

I Parameter Posterior: p(θ|y) ∝ p(y|θ)p(θ).

6 / 43

Page 7: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Generative Adversarial Networks

Generative Procedure

I Sample z(1), . . . , z(n) ∼ p(z) (p(z) is typically uniform noise).

I Transform the noise through a generator to produce samplesx′(i) = G(z(i); θg)

I G can be arbitrary but is typically a de-convolutional neuralnetwork parametrized by θg.

I If G has sufficient capacity, there is a setting of θg such thatG(·; θg) can approximate the CDF inverse-CDF compositionrequired to sample from a data distribution of interest.

Notation summary:G: Generator; θg: generator parameters; z: noise; x: data sample

7 / 43

Page 8: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Generating Samples

8 / 43

Page 9: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

DC-GAN Architecture

9 / 43

Page 10: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Training Procedure

I The generator G(·; θg) proposes candidate data samples.

I The discriminator has access to a dataset X = {x(i)} from theactual data distribution (e.g., a collection of photographs).

I A discriminator D(·; θd) trains itself to classify samples from thegenerator vs samples from the actual data distribution byupdating its parameters θd.

I The generator updates its parameters θg to fool the discriminator,the discriminator updates its parameters θd to get better atcalling out the generator.

I If G and D have enough capacity samples from G converge tosamples from the actual data distribution.

I This procedure works in practice because of the powerfulinductive biases of G and D.

10 / 43

Page 11: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

GAN Training Illustration

x = data samplesz = noise samples

Black = data distributionGreen = generative distributionBlue = discriminative distribution

11 / 43

Page 12: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

GAN Objective

minG

maxD

V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]

12 / 43

Page 13: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

The Optimal Discriminator

minG

maxD

V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]

Proposition: For G fixed, the optimal discriminator D is

D∗G(x) =

pdata(x)pdata(x) + pg(x)

(1)

Proof : The training criterion for the discriminator D, given any generator G, is to maximize thequantity V(G,D)

V(G,D) =

∫x

pdata(x) log(D(x))dx +∫

zpz(z) log(1− D(G(z)))dz

=

∫x

pdata(x) log(D(x)) + pg(x) log(1− D(x))dx (2)

For any (a, b) ∈ R2 \ {0, 0}, the function y→ a log(y) + b log(1− y) achieves its maximumin [0, 1] at a

a+b . The discriminator does not need to be defined outside ofSupp(pdata) ∪ Supp(pg), concluding the proof.

13 / 43

Page 14: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

The Optimal Generator

minG

maxD

V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]

Let D∗G(x) =pdata(x)

pdata(x)+pg(x) .Then

C(G) = V(G,D∗) = Ex∼pdata

[log

pdata(x)Pdata(x) + pg(x)

]+ Ex∼pg

[log

pg(x)pdata(x) + pg(x)

]= − log(4) + KL

(pdata

∥∥∥∥pdata + pg

2

)+ KL

(pg

∥∥∥∥pdata + pg

2

)(3)

= − log(4) + JSD(pdata||pg) (4)

which attains its minimum when pg = pd .

14 / 43

Page 15: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

SGD Training Algorithm

15 / 43

Page 16: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Original Paper Illustrations (2014)

16 / 43

Page 17: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Improvements using DCGAN (2015, 2016)

17 / 43

Page 18: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Progressive GANs (2017)

18 / 43

Page 19: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Vector space arithmetic

19 / 43

Page 20: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Vector space arithmetic

20 / 43

Page 21: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

GANs with covariates

21 / 43

Page 22: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Mode Collapse

minG

maxD

V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]

Imagine switching the objective from minG maxD to maxD minG. Thepractical SGD training algorithm is agnostic to ordering.

22 / 43

Page 23: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

GAN stability

I Feature matching

I Minibatch discrimination

I Label smoothing

23 / 43

Page 24: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Bayesian Generative Adversarial Networks

Prior Model

I We induce a distribution over generators G and discriminators Dthrough distributions on their parameters:

θg ∼ p(θg|αg) (5)

θd ∼ p(θd|αd) (6)

I We then have a distribution over distributions of data.

24 / 43

Page 25: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Generative Model for Data

1. Sample θ′g ∼ p(θg|αg)

2. Sample z(1), . . . , z(n) ∼ p(z).

3. x′(j) = G(z(j); θ′g) ∼pgenerator(x; θ′g)

θg

p(θg|D)

(θg)ML

25 / 43

Page 26: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Posterior Inference with Adversarial Feedback

How do we update our posterior beliefs?

26 / 43

Page 27: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Propose Conditional Posteriors

p(θg|z, θd) ∝( ng∏

i=1

D(G(z(i); θg); θd)

)p(θg|αg) (7)

p(θd|z,X, θg) ∝nd∏

i=1

D(x(i); θd)×ng∏

i=1

(1− D(G(z(i); θg); θd))× p(θd|αd)

(8)

Sample iteratively from these conditional posteriors

27 / 43

Page 28: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Classical GANs as Maximum Likelihood

p(θg|z, θd) ∝( ng∏

i=1

D(G(z(i); θg); θd)

)p(θg|αg)

p(θd|z,X, θg) ∝nd∏

i=1

D(x(i); θd)×ng∏

i=1

(1− D(G(z(i); θg); θd))× p(θd|αd)

If we assign a vague uniform prior over θg and θd and performiterative MAP optimization instead of sampling, then the localoptima will be in the same place as in the classical GAN ofGoodfellow et. al (2014).

28 / 43

Page 29: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Marginalizing the Noise

p(θg|θd) =

∫p(θg, z|θd)dz =

∫p(θg|z, θd)

=p(z)︷ ︸︸ ︷p(z|θd) dz (9)

≈ 1J

J∑j=1

p(θg|z(j), θd) , z(j) ∼ p(z)

By following a similar derivation, p(θd|θg) ≈ 1J

∑Jj p(θd|z(j),X, θg),

z(j) ∼ p(z).

I p(z) is a white noise distribution from which we can takeefficient and exact samples.

I p(θg|z, θd) and p(θd|z,X, θg), when viewed as a function of z,are broad over z by construction, since z is used to producecandidate data samples in the generative procedure. Thereforeeach term in the sum contributes to the estimate.

29 / 43

Page 30: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Semi-supervised Learning

I Make label predictions using structure fromboth unlabelled and labelled training data.

I Can quantify recent advances inunsupervised learning.

I Crucial for reducing the dependency ofdeep learning on large labelled datasets.

30 / 43

Page 31: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Semi-supervised Learning

I Task: predict the class label of test images, based on atraining set of labelled and unlabelled images.

I n unlabelled observations, {x(i)}, and ns labelled observations,{(x(i)s , y

(i)s )}ns

i=1, with class labels y(i)s ∈ {1, . . . ,K}.I Redefine discriminator s.t. D(x(i) = y(i); θd) gives the

probability that sample x(i) belongs to class y(i).

p(θg|z, θd) ∝

ng∏i=1

K∑y=1

D(G(z(i); θg) = y; θd)

p(θg|αg) (10)

p(θd|z, x, ys, θg) ∝nd∏

i=1

K∑y=1

D(x(i) = y; θd)

ng∏i=1

D(G(z(i); θg) = 0; θd)

×ns∏

i=1

(D(x(i)s = y(i)s ; θd))p(θd|αd)

31 / 43

Page 32: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Making Predictions of Class Labels

To compute the predictive distribution for a class label y∗ at a testinput x∗ we use a model average over all collected samples withrespect to the posterior over θd:

p(y∗|x∗,D) =∫

p(y∗|x∗, θd)p(θd|D)dθd (11)

≈ 1T

T∑k=1

p(y∗|x∗, θ(k)d ) , θ(k)d ∼ p(θd|D) (12)

32 / 43

Page 33: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Stochastic Gradient Hamiltonian Monte Carlo

I Hamiltonian Monte Carlo (HMC) is an auxiliary variableMCMC approach, inspired by physics.

I HMC uses gradient information to make better proposals andavoid random walk behaviour.

I Stochastic gradient HMC (Chen et. al, 2014) is a new SGD likealgorithm that enables posterior sampling with no morecomputational complexity than SGD!

I SG-HMC makes it possible to do Bayesian deep learning withinsignificant computational overhead!

I Likelihood surfaces in deep architectures are very well suited tosampling over optimization!

33 / 43

Page 34: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Bayesian GAN Learning Algorithm

34 / 43

Page 35: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Exploring a whole distribution over G and D

θg

p(θg|D)

(θg)ML

35 / 43

Page 36: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Avoiding Mode Collapse

36 / 43

Page 37: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Semi-Supervised Results: MNIST

37 / 43

Page 38: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Semi-Supervised Results: CIFAR-10

38 / 43

Page 39: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Semi-Supervised Results

39 / 43

Page 40: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Semi-Supervised Results

BayesGAN

40 / 43

Page 41: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

More Sample Generation

41 / 43

Page 42: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Discussion

I A natural Bayesian generalization of the classical GAN.

I Avoids mode collapse and reduces the need for manualintervention.

I Has particularly promising results on semi-supervised predictiontasks.

I Future directions: deterministic approximate inference, differentarchitectures, different priors, new applications...

I Code available:https://github.com/andrewgordonwilson/bayesgan

42 / 43

Page 43: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Scalable Gaussian Processes

I Highly accurate kernel approximations that admit fast matrixvector multiplications (MVMs)

I LCG for inference, stochastic lanczos for log determinants andderivatives (kernel learning).

I O(n) training and O(1) testing (instead of O(n3) training andO(n2) testing.

I Harmonizes with GPU acceleration

I Very powerful for large-scale spatiotemporal regression.

I Implemented in our new library GPyTorch:https://github.com/jrg365/gpytorch

43 / 43