66
Bayesian Deep Learning Andrew Gordon Wilson Assistant Professor https://people.orie.cornell.edu/andrew Cornell University Toronto Deep Learning Summer School July 31, 2018 1 / 66

Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Bayesian Deep Learning

Andrew Gordon Wilson

Assistant Professorhttps://people.orie.cornell.edu/andrew

Cornell University

Toronto Deep Learning Summer SchoolJuly 31, 2018

1 / 66

Page 2: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Model Selection

1949 1951 1953 1955 1957 1959 1961100

200

300

400

500

600

700

Airl

ine

Pas

seng

ers

(Tho

usan

ds)

Year

Which model should we choose?

(1): f1(x) = a0 + a1x (2): f2(x) =

3∑j=0

ajxj (3): f3(x) =

104∑j=0

ajxj

2 / 66

Page 3: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Bayesian or Frequentist?

3 / 66

Page 4: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Bayesian Deep Learning

Why?

I A powerful framework for model construction and understanding generalization

I Uncertainty representation (crucial for decision making)

I Better point estimates

I It was the most successful approach at the end of the second wave of neuralnetworks (Neal, 1998).

I Neural nets are much less mysterious when viewed through the lens ofprobability theory.

Why not?

I Can be computationally intractable (but doesn’t have to be).

I Can involve a lot of moving parts (but doesn’t have to).

There has been exciting progress in the last two years addressing these limitations aspart of an extremely fruitful research direction.

4 / 66

Page 5: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

How do we build models that learn and generalize?

1949 1951 1953 1955 1957 1959 1961100

200

300

400

500

600

700

Airl

ine

Pas

seng

ers

(Tho

usan

ds)

Year

Basic Regression Problem

I Training set of N targets (observations) y = (y(x1), . . . , y(xN))T.

I Observations evaluated at inputs X = (x1, . . . , xN)T.

I Want to predict the value of y(x∗) at a test input x∗.

For example: Given CO2 concentrations y measured at times X, what will the CO2

concentration be for x∗ = 2024, 10 years from now?

Just knowing high school math, what might you try?

5 / 66

Page 6: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

How do we build models that learn and generalize?

Guess the parametric form of a function that could fit the data

I f (x,w) = wTx [Linear function of w and x]

I f (x,w) = wTφ(x) [Linear function of w] (Linear Basis Function Model)

I f (x,w) = g(wTφ(x)) [Non-linear in x and w] (E.g., Neural Network)

φ(x) is a vector of basis functions. For example, if φ(x) = (1, x, x2) and x ∈ R1 thenf (x,w) = w0 + w1x + w2x2 is a quadratic function.

Choose an error measure E(w), minimize with respect to wI E(w) =

∑Ni=1[f (xi,w)− y(xi)]

2

6 / 66

Page 7: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

How do we build models that learn and generalize?

A probabilistic approachWe could explicitly account for noise in our model.

I y(x) = f (x,w) + ε(x) , where ε(x) is a noise function.

One commonly takes ε(x) = N (0, σ2) for i.i.d. additive Gaussian noise, in whichcase

p(y(x)|x,w, σ2) = N (y(x); f (x,w), σ2) Observation Model (1)

p(y|x,w, σ2) =

N∏i=1

N (y(xi); f (xi,w), σ2) Likelihood (2)

I Maximize the likelihood of the data p(y|x,w, σ2) with respect to σ2,w.

For a Gaussian noise model, this approach will make the same predictions as using asquared loss error function:

log p(y|X,w, σ2) ∝ − 12σ2

N∑i=1

[f (xi,w)− y(xi)]2 (3)

7 / 66

Page 8: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

How do we build models that learn and generalize?

I The probabilistic approach helps us interpret the error measure in adeterministic approach, and gives us a sense of the noise level σ2.

I Both approaches are prone to over-fitting for flexible f (x,w): low error on thetraining data, high error on the test set.

Regularization

I Use a penalized log likelihood (or error function), such as

E(w) =

model fit︷ ︸︸ ︷− 1

2σ2

n∑i=1

(f (xi,w)− y(xi)2)

complexity penalty︷ ︸︸ ︷−λwTw . (4)

I But how should we define and penalize complexity?I Can set λ using cross-validation.

I Same as maximizing a posterior log p(w|y,X) = log p(y|w,X) + p(w) with aGaussian prior p(w). But this is not Bayesian!

8 / 66

Page 9: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Bayesian Inference

Bayes’ Rule

p(a|b) = p(b|a)p(a)/p(b) , p(a|b) ∝ p(b|a)p(a) . (5)

posterior =likelihood× prior

marginal likelihood, p(w|y,X, σ2) =

p(y|X,w, σ2)p(w)

p(y|X, σ2). (6)

Sum Rule

p(x) =∑

x

p(x, y) (7)

Product Rule

p(x, y) = p(x|y)p(y) = p(y|x)p(x) (8)

9 / 66

Page 10: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Predictive Distribution

Sum rule: p(x) =∑

x p(x, y). Product rule: p(x, y) = p(x|y)p(y) = p(y|x)p(x).

p(y|x∗, y,X) =

∫p(y|x∗,w)p(w|y,X)dw . (9)

I Think of each setting of w as a different model. Eq. (9) is a Bayesian modelaverage, an average of infinitely many models weighted by their posteriorprobabilities.

I No over-fitting, automatically calibrated complexity.

I Eq. (9) is not analytic for many likelihoods p(y|X,w, σ2) and priors p(w).(But recent advances such as SG-HMC have made these computations muchmore tractable in deep learning).

I Typically more interested in the induced distribution over functions than inparameters w. Can be hard to have intuitions for priors on p(w).

10 / 66

Page 11: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Example: Bent Coin

Suppose we flip a bent coin with probability λ oflanding tails.

1. What is the likelihood of a set of dataD = {x1, x2, . . . , xn}?

2. What is the maximum likelihood solution for λ?

3. Suppose we observe 2 tails in the first two flips.What is the probability that the next flip will be atails, using maximum likelihood?

11 / 66

Page 12: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Example: Bent Coin

Likelihood of getting m tails is

p(D|m, λ) =

(Nm

)λm(1− λ)N−m (10)

If we choose a prior p(λ) ∝ λα(1− λ)β then the posterior will have the samefunctional form as the prior.

12 / 66

Page 13: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Example: Bent CoinLikelihood of getting m tails is

p(D|m, λ) =

(Nm

)λm(1− λ)N−m (11)

If we choose a prior p(λ) ∝ λα(1− λ)β then the posterior will have the samefunctional form as the prior.We can choose the beta distribution:

Beta(λ|a, b) =Γ(a + b)

Γ(a)Γ(b)λa−1(1− λ)b−1 (12)

The Gamma functions ensure that the distribution is normalised:∫Beta(λ|a, b)dλ = 1 (13)

Moments:

E[λ] =a

a + b(14)

var[λ] =ab

(a + b)2(a + b− 1). (15)

13 / 66

Page 14: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Beta Distribution

14 / 66

Page 15: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Example: Bent Coin

Applying Bayes theorem, we find:

p(λ|D) ∝ p(D|λ)p(λ) (16)

= Beta(λ; m + a,N − m + b) (17)

We can view a and b as pseudo-observations!

E[λ|D] =m + a

N + a + b(18)

1. What is the probability that the next flip is tails?

2. What happens in the limits of a, b?

3. What happens in the limit of infinite data?

15 / 66

Page 16: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

A Function Space View: Gaussian processes

Function posterior︷ ︸︸ ︷p(f (x)|D) ∝

Likelihood︷ ︸︸ ︷p(D|f (x))

Function prior︷ ︸︸ ︷p(f (x))

−10 −5 0 5 10−3

−2

−1

0

1

2

3

−10 −5 0 5 10−3

−2

−1

0

1

2

3

Outp

uts

, f(

x)

Sample Prior Functions Sample Posterior Functions

Inputs, x Inputs, x

Outp

uts

, f(

x)

Radford Neal showed in 1996 that a Bayesian neural network with an infinitenumber of hidden units is a Gaussian process!

16 / 66

Page 17: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Gaussian processes

DefinitionA Gaussian process (GP) is a collection of random variables, any finite number ofwhich have a joint Gaussian distribution.

Nonparametric Regression Model

I Prior: f (x) ∼ GP(m(x), k(x, x′)), meaning (f (x1), . . . , f (xN)) ∼ N (µ,K),with µi = m(xi) and Kij = cov(f (xi), f (xj)) = k(xi, xj).

GP posterior︷ ︸︸ ︷p(f (x)|D) ∝

Likelihood︷ ︸︸ ︷p(D|f (x))

GP prior︷ ︸︸ ︷p(f (x))

−10 −5 0 5 10−3

−2

−1

0

1

2

3

−10 −5 0 5 10−3

−2

−1

0

1

2

3

Outp

uts

, f(

x)

Sample Prior Functions Sample Posterior Functions

Inputs, x Inputs, x

Outp

uts

, f(

x)

17 / 66

Page 18: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Linear Basis Models

Consider the simple linear model,

f (x) = a0 + a1x , (19)

a0, a1 ∼ N (0, 1) . (20)

−10 −8 −6 −4 −2 0 2 4 6 8 10−25

−20

−15

−10

−5

0

5

10

15

20

25

Input, x

Out

put,

f(x)

18 / 66

Page 19: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Linear Basis Models

Consider the simple linear model,

f (x) = a0 + a1x , (21)

a0, a1 ∼ N (0, 1) . (22)

−10 −8 −6 −4 −2 0 2 4 6 8 10−25

−20

−15

−10

−5

0

5

10

15

20

25

Input, x

Out

put,

f(x)

Any collection of values has a joint Gaussian distribution

[f (x1), . . . , f (xN)] ∼ N (0,K) , (23)

Kij = cov(f (xi), f (xj)) = k(xi, xj) = 1 + xbxc . (24)

By definition, f (x) is a Gaussian process.

19 / 66

Page 20: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Linear Basis Function Models

Model Specification

f (x,w) = wTφ(x) (25)

p(w) = N (0,Σw) (26)

Moments of Induced Distribution over Functions

E[f (x,w)] = m(x) = E[wT]φ(x) = 0 (27)

cov(f (xi), f (xj)) = k(xi, xj) = E[f (xi)f (xj)]− E[f (xi)]E[f (xj)] (28)

= φ(xi)TE[wwT]φ(xj)− 0 (29)

= φ(xi)TΣwφ(xj) (30)

I f (x,w) is a Gaussian process, f (x) ∼ N (m, k) with mean function m(x) = 0and covariance kernel k(xa, xb) = φ(xi)

TΣwφ(xj).I The entire basis function model of Eqs. (25) and (26) is encapsulated as a

distribution over functions with kernel k(x, x′).20 / 66

Page 21: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Example: RBF Kernel

kRBF(x, x′) = cov(f (x), f (x′)) = a2 exp(−||x− x′||2

2`2 ) (31)

I Far and above the most popular kernel.

I Expresses the intuition that function values at nearby inputs are more correlatedthan function values at far away inputs.

I The kernel hyperparameters a and ` control amplitudes and wiggliness of thesefunctions.

I GPs with an RBF kernel have large support and are universal approximators.

21 / 66

Page 22: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Sampling from a GP with an RBF Kernel

x = [-10:0.2:10]’; % inputs (where we query the GP)N = numel(x); % number of inputsK = zeros(N,N); % covariance matrix

% very inefficient way of creating K in Matlabfor i=1:Nfor j=1:N

K(i,j) = k_rbf(x(i),x(j));end

end

K = K + 1e-6*eye(N); % add jitter for conditioning of KCK = chol(K);f = CK’*randn(N,1); % draws from N(0,K)

plot(x,f);

22 / 66

Page 23: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Samples from a GP with an RBF Kernel

23 / 66

Page 24: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

RBF Kernel Covariance Matrix

kRBF(x, x′) = cov(f (x), f (x′)) = a2 exp(−||x− x′||2

2`2 ) (32)

The covariance matrix K for ordered inputs on a 1D grid. Kij = kRBF(xi, xj).

24 / 66

Page 25: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Learning and Predictions with Gaussian Processes

25 / 66

Page 26: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Inference using an RBF kernel

I Specify f (x) ∼ GP(0, k).

I Choose kRBF(x, x′) = a20 exp(− ||x−x′||2

2`20

). Choose values for a0 and `0.

I Observe data, look at the prior and posterior over functions.

−10 −5 0 5 10−4

−3

−2

−1

0

1

2

3

4

Input, x

Out

put,

f(x)

Samples from GP Prior

(a)

−10 −5 0 5 10−4

−3

−2

−1

0

1

2

3

4

Input, x

Out

put,

f(x)

Samples from GP Posterior

(b)

I Does something look strange about these functions?26 / 66

Page 27: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Inference using an RBF kernel

Increase the length-scale `.

−10 −5 0 5 10−4

−3

−2

−1

0

1

2

3

4

Input, x

Out

put,

f(x)

Samples from GP Prior

(a) (b)

27 / 66

Page 28: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Learning and Model Selection

p(Mi|y) =p(y|Mi)p(Mi)

p(y)(33)

We can write the evidence of the model as

p(y|Mi) =

∫p(y|f,Mi)p(f)df , (34)

yAll Possible Datasets

p(y|

M)

Complex Model

Simple Model

Appropriate Model

(a)

−10 −8 −6 −4 −2 0 2 4 6 8 10−4

−3

−2

−1

0

1

2

3

4

Input, x

Out

put,

f(x)

Data

Simple

Complex

Appropriate

(b)

28 / 66

Page 29: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Learning and Model Selection

I We can integrate away the entire Gaussian process f (x) to obtain the marginallikelihood, as a function of kernel hyperparameters θ alone.

p(y|θ,X) =

∫p(y|f,X)p(f|θ,X)df . (35)

log p(y|θ,X) =

model fit︷ ︸︸ ︷−1

2yT(Kθ + σ2I)−1y−

complexity penalty︷ ︸︸ ︷12

log |Kθ + σ2I| −N2

log(2π) . (36)

I An extremely powerful mechanism for kernel learning.

−10 −5 0 5 10−4

−3

−2

−1

0

1

2

3

4

Input, x

Out

put,

f(x)

Samples from GP Prior

(c)

−10 −5 0 5 10−4

−3

−2

−1

0

1

2

3

4

Input, x

Out

put,

f(x)

Samples from GP Posterior

(d)29 / 66

Page 30: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Aside: How Do We Build Models that Generalize?

yAll Possible Datasets

p(y|

M)

Complex Model

Simple Model

Appropriate Model

I Support: which datasets (hypotheses) are a priori possible.

I Inductive Biases: which datasets are a priori likely.

Want to make the support of our model as big as possible, with inductive biaseswhich are calibrated to particular applications, so as to not rule out potentialexplanations of the data, while at the same time quickly learn from a finite amount ofinformation on a particular application.

30 / 66

Page 31: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Gaussian Processes and Neural Networks

“How can Gaussian processespossibly replace neural networks?Have we thrown the baby out withthe bathwater?” (MacKay, 1998)

31 / 66

Page 32: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Deep Kernel Learning

We combine the inductive biases of deep learning architectures with thenon-parametric flexibility of Gaussian processes as part of a scalable deep kernellearning framework.

x1

xD

Input layer h(1)1

h(1)A

...

. . .

h(2)1

h(2)B

h(L)1

h(L)C

W (1)

W (2)

W (L)h1(θ)

h∞(θ)

Hidden layers

∞ layer

y1

yP

Output layer. . .

...

...

...

...

...

...

...

Deep Kernel LearningAndrew Gordon Wilson*, Zhiting Hu*, Ruslan Salakhutdinov, and Eric P. XingArtificial Intelligence and Statistics (AISTATS), 2016.

32 / 66

Page 33: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Deep Kernel Learning

Starting from a base kernel k(xi, xj|θ) with hyperparameters θ, we transform theinputs (predictors) x as

k(xi, xj|θ)→ k(g(xi,w), g(xj,w)|θ,w) , (37)

where g(x,w) is a non-linear mapping given by a deep architecture, such as a deepconvolutional network, parametrized by weights w.

We use spectral mixture base kernels (Wilson, 2014):

kSM(x, x′|θ) =

Q∑q=1

aq|Σq|

12

(2π)D2

exp(−1

2||Σ

12q (x− x′)||2

)cos〈x− x′, 2πµq〉 . (38)

Backprop to learn all parameters jointly through the GP marginal likelihood

33 / 66

Page 34: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Scalable Gaussian Processes

I Run Gaussian processes on millions of points in seconds, instead of thousandsof points in hours.

I Outperforms stand-alone deep neural networks by learning deep kernels.I Approach is based on kernel approximations which admit fast matrix vector

multiplies (Wilson and Nickisch, 2015).I Harmonizes with GPU acceleration.I O(n) training and O(1) testing (instead of O(n3) training and O(n2) testing).I Implemented in our new library GPyTorch:

https://github.com/cornellius-gp/gpytorch

34 / 66

Page 35: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Deep Kernel Learning Results

Similar runtime but improved performance over stand-alone DNNs and scalablekernel learning on a wide range of benchmarks

35 / 66

Page 36: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Face Orientation Extraction

36.15-43.10 -3.4917.35 -19.81

Training data

Test data

Label

Figure: Left: Randomly sampled examples of the training and test data. Right: Thetwo dimensional outputs of the convolutional network on a set of test cases. Eachpoint is shown using a line segment that has the same orientation as the input face.

36 / 66

Page 37: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Learning Flexible Non-Euclidean Similarity Metrics

100 200 300 400

100

200

300

400 −0.1

0

0.1

0.2

100 200 300 400

100

200

300

400 0

1

2

100 200 300 400

100

200

300

400 0

100

200

300

Figure: Left: The induced covariance matrix using DKL-SM kernel on a set of testcases, where the test samples are ordered according to the orientations of the inputfaces. Middle: The respective covariance matrix using DKL-RBF kernel. Right:The respective covariance matrix using regular RBF kernel. The models are trainedwith n = 12, 000. We set Q = 4 for the SM base kernel.

37 / 66

Page 38: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Step Function

−1 −0.5 0 0.5 14

6

8

10

12

14

16

18

Input X

Out

put Y

GP(RBF)GP(SM)DKL−SMTraining data

Figure: Recovering a step function. We show the predictive mean and 95% of thepredictive probability mass for regular GPs with RBF and SM kernels, and DKL withSM base kernel. We set Q = 4 for SM kernels.

38 / 66

Page 39: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

LSTM Kernels

I We derive kernels which have recurrent LSTM inductive biases, and apply toautonomous vehicles, where predictive uncertainty is critical.

0.0 0.2 0.4 0.6 0.8 1.0

East, mi

0.0

0.2

0.4

0.6

0.8

1.0

Nor

th, m

i

5 0 5 10 15 20

30

20

10

0

5 0 5 10 15 20

30

20

10

0

0

4

8

12

16

20

24

28

Spe

ed, m

i/s

0

10

20

30

40

50

0

10

20

30

40

50

Learning Scalable Deep Kernels with Recurrent StructureM. Al-Shedivat, A. G. Wilson, Y. Saatchi, Z. Hu, E. P. XingJournal of Machine Learning Research (JMLR), 2017

39 / 66

Page 40: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

GP-LSTM Predictive Distributions

−5 0 50

10

20

30

40

50

Fron

tdis

tanc

e,m

−5 0 5 −5 0 5Side distance, m

−5 0 5 −5 0 5

−5 0 50

10

20

30

40

50

Fron

tdis

tanc

e,m

−5 0 5 −5 0 5Side distance, m

−5 0 5 −5 0 5

40 / 66

Page 41: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

The Bayesian GAN

θg

p(θg|D)

(θg)ML

Wilson and Saatchi (NIPS 2017)

41 / 66

Page 42: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Wide Optima Generalize Better

Keskar et. al (2017)

I Bayesian integration will give very different predictions in deep learningespecially!

42 / 66

Page 43: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Loss Surfaces in Deep Learning

(1) Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs(2) Averaging Weights Leads to Wider Optima and Better Generalization(3) Consistency Based Semi-Supervised Learning with Weight Averaging

I Local optima are connected along simple curves

I SGD does not converge to broad optima in all directions

I Averaging weights along SGD trajectories with constant or cyclical learningrates leads to much faster convergence and solutions that generalize better.

I Can approximate ensembles and Bayesian model averages with a single model.

43 / 66

Page 44: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Mode Connectivity

−20 0 20 40 60 80 100

−20

0

20

40

60

80

0.065

0.11

0.17

0.28

0.54

1.1

2.3

5

> 5

−20 0 20 40 60 80 100

−20

0

20

40

60

80

100

0.065

0.11

0.17

0.28

0.54

1.1

2.3

5

> 5

−20 0 20 40 60 80 100

−20

0

20

40

60

0.065

0.11

0.17

0.28

0.54

1.1

2.3

5

> 5

44 / 66

Page 45: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Example Parametrizations

Polygonal Chain:

φθ(t) =

{2 (tθ + (0.5− t)w1) , 0 ≤ t ≤ 0.52 ((t − 0.5)w2 + (1− t)θ) , 0.5 ≤ t ≤ 1.

(39)

Bezier Curve:

φθ(t) = (1− t)2w1 + 2t(1− t)θ + t2w2, 0 ≤ t ≤ 1. (40)

−20 0 20 40 60 80 100

−20

0

20

40

60

80

100

0.065

0.11

0.17

0.28

0.54

1.1

2.3

5

> 5

−20 0 20 40 60 80 100

−20

0

20

40

60

0.065

0.11

0.17

0.28

0.54

1.1

2.3

5

> 5

45 / 66

Page 46: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Connection Procedure with Tractable Loss

We propose the computationally tractable loss:

`(θ) =

∫ 1

0L(φθ(t))dt = Et∼U(0,1)L(φθ(t)) (41)

At each iteration we sample t ∈ [0, 1] and make a gradient step for θ with respect toL(φθ(t)):

Et∼U(0,1)∇θL(φθ(t)) = ∇θEt∼U(0,1)L(φθ(t)) = ∇θ`(θ).

46 / 66

Page 47: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Curve Ensembling

47 / 66

Page 48: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Fast Geometric Ensembling

48 / 66

Page 49: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Trajectory of SGD

−10 0 10 20 30 40 50

−10

0

10

20

30

W1

W2

W3

WSWA

Test error (%)

19.95

20.64

21.24

22.38

24.5

28.49

35.97

50

> 50

49 / 66

Page 50: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Trajectory of SGD

−10 0 10 20 30 40 50

−10

0

10

20

30

W1

W2

W3

WSWA

Test error (%)

19.95

20.64

21.24

22.38

24.5

28.49

35.97

50

> 50

50 / 66

Page 51: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Trajectory of SGD

−10 0 10 20 30 40 50

−10

0

10

20

30

W1

W2

W3

WSWA

Test error (%)

19.95

20.64

21.24

22.38

24.5

28.49

35.97

50

> 50

51 / 66

Page 52: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Trajectory of SGD

−10 0 10 20 30 40 50

−10

0

10

20

30

W1

W2

W3

WSWA

Test error (%)

19.95

20.64

21.24

22.38

24.5

28.49

35.97

50

> 50

−5 0 5 10 15 20 25

0

5

10

epoch 125

WSGD

WSWA

Test error (%)

19.62

20.15

20.67

21.67

23.65

27.52

35.11

50

> 50

−5 0 5 10 15 20 25

0

5

10

epoch 125

WSGD

WSWA

Train loss

0.00903

0.02142

0.03422

0.06024

0.1131

0.2206

0.4391

0.8832

> 0.8832

52 / 66

Page 53: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Trajectory of SGD

−10 0 10 20 30 40 50

−10

0

10

20

30

Train loss

0.03013

0.04494

0.06269

0.1017

0.1874

0.3758

0.7899

1.7

> 1.7

−10 0 10 20 30 40 50

−10

0

10

20

30

Test error (%)

19.95

20.64

21.24

22.38

24.5

28.49

35.97

50

> 50

0 10 20 30 40 50

0

10

20

30

Train loss

0.1835

0.1981

0.2152

0.2522

0.3324

0.5062

0.883

1.7

> 1.7

0 10 20 30 40 50

0

10

20

30

Test error (%)

21.9

22.58

23.17

24.26

26.28

30.04

37.03

50

> 50

53 / 66

Page 54: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Following Random Paths

0 5 10 15 20

20

22

24

26

28

30

54 / 66

Page 55: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Path from wSWA to wSGD

−80 −60 −40 −20 0 20 40

Distance

17.5

20.0

22.5

25.0

27.5

30.0

Tes

ter

ror

(%)

Test error

SWA

SGD

0.0

0.5

1.0

1.5

2.0

2.5

Tra

inlo

ss

Train loss

SWA

SGD

55 / 66

Page 56: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Approximating an FGE Ensemble

Because the points sampled from an FGE ensembletake small steps in weight space by design, we can doa linearization analysis to show that

f (wSWA) ≈1n

∑f (wi)

56 / 66

Page 57: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

SWA Results, CIFAR

57 / 66

Page 58: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

SWA Results, ImageNet (Top-1 Error Rate)

58 / 66

Page 59: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Sampling from a High Dimensional Gaussian

SGD (with constant LR) proposals are on the surface of a hypersphere. Averaginglets us go inside the sphere to a point of higher density.

59 / 66

Page 60: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

High Constant LR

0 50 100 150 200 250 300

Epochs

15

20

25

30

35

40

45

50T

est

erro

r(%

)

SGD

Const LR SGD

Const LR SWA

Side observation: Averaging bad models does not give good solutions. Averagingbad weights can give great solutions.

60 / 66

Page 61: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Conclusions

I Bayesian deep learning can provide better predictions anduncertainty estimates

I Computation is a key challenge. We address this issuethrough developing numerical linear algebra approaches.

I Developing fast deterministic (e.g., variational)approximations and stochastic MCMC algorithms will beimportant too.

I We can better understand model construction, andgeneralization, by taking a function space approach.

I We can be inspired by Bayesian methods to developoptimization procedures that generalize better.

I Code for the approaches discussed here is available at:https://people.orie.cornell.edu/andrew/code

61 / 66

Page 62: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Appendix

62 / 66

Page 63: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Deriving the RBF Kernel

f (x) =J∑

i=1

wiφi(x) , wi ∼ N(

0,σ2

J

), φi(x) = exp

(− (x− ci)

2

2`2

)(42)

∴ k(x, x′) =σ2

J

J∑i=1

φi(x)φi(x′) (43)

I Let cJ = log J, c1 = − log J, and ci+1 − ci = ∆c = 2 log JJ , and J →∞, the

kernel in Eq. (43) becomes a Riemann sum:

k(x, x′) = limJ→∞

σ2

J

J∑i=1

φi(x)φi(x′) =

∫ c∞

c0

φc(x)φc(x′)dc (44)

I By setting c0 = −∞ and c∞ =∞, we spread the infinitely many basisfunctions across the whole real line, each a distance ∆c→ 0 apart:

k(x, x′) =

∫ ∞−∞

exp(− (x− c)2

2`2 ) exp(− (x′ − c)2

2`2 )dc (45)

=√π`σ2 exp(− (x− x′)2

2(√

2`)2) . (46)

63 / 66

Page 64: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Deriving the RBF Kernel

I It is remarkable we can work with infinitely many basis functions with finiteamounts of computation using the kernel trick – replacing inner products ofbasis functions with kernels.

I The RBF kernel, also known as the Gaussian or squared exponential kernel, isby far the most popular kernel. kRBF(x, x′) = a2 exp(− ||x−x′||2

2`2 ).

I Functions drawn from a GP with an RBF kernel are infinitely differentiable.For this reason, the RBF kernel is accused of being overly smooth andunrealistic. Nonetheless it has nice theoretical properties...

64 / 66

Page 65: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

References

I A.G. Wilson and H. Nickisch. Kernel interpolation for scalable structured Gaussian processes(KISS-GP). International Conference on Machine Learning (ICML), 2015.

I P. Izmailov*, D. Podoprikhin*, T. Garipov*, D. Vetrov, A.G. Wilson. Averaging Weights Leads toWider Optima and Better Generalization, Uncertainty in Artificial Intelligence (UAI), 2018.

I T. Garipov*, P. Izmailov*, D. Podoprikhin*, D. Vetrov, A.G. Wilson. Loss Surfaces, ModeConnectivity, and Fast Ensembling of DNNs, 2018.

I R. Neal. Bayesian Learning for Neural Networks. PhD Thesis, 1996.

I D. MacKay. Introduction to Gaussian Processes. Neural Networks and Machine Learning, 1998.

I D. MacKay. Information Theory, Inference, and Learning Algorithms, 2003.

I D. MacKay. Bayesian Interpolation, 1992.

I M. Welling and Y.W. Teh. Bayesian learning via stochastic gradient langevin dynamics. InternationalConference on Machine Learning (ICML), 2011.

65 / 66

Page 66: Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

References

I T. Chen, E. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In InternationalConference on Machine Learning (ICML), 2014.

I J. Gardner, G. Pleiss, K. Weinberger, A.G. Wilson. GPyTorch: Scalable and Modular GaussianProcesses in PyTorch: https://github.com/cornellius-gp/gpytorch.

I Codebase for group projects: https://people.orie.cornell.edu/andrew/code.

I B. Athiwaratkun, M. Finzi, P. Izmailov, A.G. Wilson. Improving Consistency-Based Semi-SupervisedLearning with Weight Averaging.

I A.G. Wilson*, Z. Hu*, R. Salakhutdinov, and E.P. Xing. Deep kernel learning. Artificial Intelligenceand Statistics (AISTATS), 2016.

I A.G. Wilson*, Z. Hu*, R. Salakhutdinov, and E.P. Xing. Stochastic Variational Deep Kernel Learning.Neural Information Processing Systems (NIPS), 2016.

I M. Al-Shedivat, A.G. Wilson, Y. Saatchi, Z. Hu, and E.P. Xing. Learning Scalable Deep Kernels withRecurrent Structure. Journal of Machine Learning Research (JMLR), 2017.

I C. E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. Cambridge Press,2006.

I N.S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P. Tang. On Large-Batch Training for DeepLearning: Generalization Gap and Sharp Minima. ICLR, 2017.

66 / 66