Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation

Bayesian Deep Learning

Andrew Gordon Wilson

Assistant Professorhttps://people.orie.cornell.edu/andrew

Cornell University

Toronto Deep Learning Summer SchoolJuly 31, 2018

1 / 66

https://people.orie.cornell.edu/andrew

Model Selection

1949 1951 1953 1955 1957 1959 1961100

200

300

400

500

600

700

Airl

ine

Pas

seng

ers

(Tho

usan

ds)

Year

Which model should we choose?

(1): f1(x) = a0 + a1x (2): f2(x) =

3∑j=0

ajxj (3): f3(x) =

104∑j=0

ajxj

2 / 66

Bayesian or Frequentist?

3 / 66

Bayesian Deep Learning

Why?

I A powerful framework for model construction and understanding generalization

I Uncertainty representation (crucial for decision making)

I Better point estimates

I It was the most successful approach at the end of the second wave of neuralnetworks (Neal, 1998).

I Neural nets are much less mysterious when viewed through the lens ofprobability theory.

Why not?

I Can be computationally intractable (but doesn’t have to be).

I Can involve a lot of moving parts (but doesn’t have to).

There has been exciting progress in the last two years addressing these limitations aspart of an extremely fruitful research direction.

4 / 66

How do we build models that learn and generalize?

1949 1951 1953 1955 1957 1959 1961100

200

300

400

500

600

700

Airl

ine

Pas

seng

ers

(Tho

usan

ds)

Year

Basic Regression Problem

I Training set of N targets (observations) y = (y(x1), . . . , y(xN))T.

I Observations evaluated at inputs X = (x1, . . . , xN)T.

I Want to predict the value of y(x∗) at a test input x∗.

For example: Given CO2 concentrations y measured at times X, what will the CO2

concentration be for x∗ = 2024, 10 years from now?

Just knowing high school math, what might you try?

5 / 66


Guess the parametric form of a function that could fit the data

I f (x,w) = wTx [Linear function of w and x]

I f (x,w) = wTφ(x) [Linear function of w] (Linear Basis Function Model)

I f (x,w) = g(wTφ(x)) [Non-linear in x and w] (E.g., Neural Network)

φ(x) is a vector of basis functions. For example, if φ(x) = (1, x, x2) and x ∈ R1 thenf (x,w) = w0 + w1x + w2x2 is a quadratic function.

Choose an error measure E(w), minimize with respect to wI E(w) =

∑Ni=1[f (xi,w)− y(xi)]

2

6 / 66


A probabilistic approachWe could explicitly account for noise in our model.

I y(x) = f (x,w) + ε(x) , where ε(x) is a noise function.

One commonly takes ε(x) = N (0, σ2) for i.i.d. additive Gaussian noise, in whichcase

p(y(x)|x,w, σ2) = N (y(x); f (x,w), σ2) Observation Model (1)

p(y|x,w, σ2) =

N∏i=1

N (y(xi); f (xi,w), σ2) Likelihood (2)

I Maximize the likelihood of the data p(y|x,w, σ2) with respect to σ2,w.

For a Gaussian noise model, this approach will make the same predictions as using asquared loss error function:

log p(y|X,w, σ2) ∝ − 12σ2

N∑i=1

[f (xi,w)− y(xi)]2 (3)

7 / 66


I The probabilistic approach helps us interpret the error measure in adeterministic approach, and gives us a sense of the noise level σ2.

I Both approaches are prone to over-fitting for flexible f (x,w): low error on thetraining data, high error on the test set.

Regularization

I Use a penalized log likelihood (or error function), such as

E(w) =

model fit︷︸︸︷− 1

2σ2

n∑i=1

(f (xi,w)− y(xi)2)

complexity penalty︷︸︸︷−λwTw . (4)

I But how should we define and penalize complexity?I Can set λ using cross-validation.

I Same as maximizing a posterior log p(w|y,X) = log p(y|w,X) + p(w) with aGaussian prior p(w). But this is not Bayesian!

8 / 66

Bayesian Inference

Bayes’ Rule

p(a|b) = p(b|a)p(a)/p(b) , p(a|b) ∝ p(b|a)p(a) . (5)

posterior =likelihood× prior

marginal likelihood, p(w|y,X, σ2) =

p(y|X,w, σ2)p(w)

p(y|X, σ2). (6)

Sum Rule

p(x) =∑

x

p(x, y) (7)

Product Rule

p(x, y) = p(x|y)p(y) = p(y|x)p(x) (8)

9 / 66

Predictive Distribution

Sum rule: p(x) =∑

x p(x, y). Product rule: p(x, y) = p(x|y)p(y) = p(y|x)p(x).

p(y|x∗, y,X) =

∫p(y|x∗,w)p(w|y,X)dw . (9)

I Think of each setting of w as a different model. Eq. (9) is a Bayesian modelaverage, an average of infinitely many models weighted by their posteriorprobabilities.

I No over-fitting, automatically calibrated complexity.

I Eq. (9) is not analytic for many likelihoods p(y|X,w, σ2) and priors p(w).(But recent advances such as SG-HMC have made these computations muchmore tractable in deep learning).

I Typically more interested in the induced distribution over functions than inparameters w. Can be hard to have intuitions for priors on p(w).

10 / 66

Example: Bent Coin

Suppose we flip a bent coin with probability λ oflanding tails.

1. What is the likelihood of a set of dataD = {x1, x2, . . . , xn}?

2. What is the maximum likelihood solution for λ?

3. Suppose we observe 2 tails in the first two flips.What is the probability that the next flip will be atails, using maximum likelihood?

11 / 66

Example: Bent Coin

Likelihood of getting m tails is

p(D|m, λ) =

(Nm

)λm(1− λ)N−m (10)

If we choose a prior p(λ) ∝ λα(1− λ)β then the posterior will have the samefunctional form as the prior.

12 / 66

Example: Bent CoinLikelihood of getting m tails is

p(D|m, λ) =

(Nm

)λm(1− λ)N−m (11)

If we choose a prior p(λ) ∝ λα(1− λ)β then the posterior will have the samefunctional form as the prior.We can choose the beta distribution:

Beta(λ|a, b) =Γ(a + b)

Γ(a)Γ(b)λa−1(1− λ)b−1 (12)

The Gamma functions ensure that the distribution is normalised:∫Beta(λ|a, b)dλ = 1 (13)

Moments:

E[λ] =a

a + b(14)

var[λ] =ab

(a + b)2(a + b− 1). (15)

13 / 66

Beta Distribution

14 / 66

Example: Bent Coin

Applying Bayes theorem, we find:

p(λ|D) ∝ p(D|λ)p(λ) (16)

= Beta(λ; m + a,N − m + b) (17)

We can view a and b as pseudo-observations!

E[λ|D] =m + a

N + a + b(18)

1. What is the probability that the next flip is tails?

2. What happens in the limits of a, b?

3. What happens in the limit of infinite data?

15 / 66

A Function Space View: Gaussian processes

Function posterior︷︸︸︷p(f (x)|D) ∝

Likelihood︷︸︸︷p(D|f (x))

Function prior︷︸︸︷p(f (x))

−10 −5 0 5 10−3

−2

−1

0

1

2

3

−10 −5 0 5 10−3

−2

−1

0

1

2

3

Outp

uts

, f(

x)

Sample Prior Functions Sample Posterior Functions

Inputs, x Inputs, x

Outp

uts

, f(

x)

Radford Neal showed in 1996 that a Bayesian neural network with an infinitenumber of hidden units is a Gaussian process!

16 / 66

Gaussian processes

DefinitionA Gaussian process (GP) is a collection of random variables, any finite number ofwhich have a joint Gaussian distribution.

Nonparametric Regression Model

I Prior: f (x) ∼ GP(m(x), k(x, x′)), meaning (f (x1), . . . , f (xN)) ∼ N (µ,K),with µi = m(xi) and Kij = cov(f (xi), f (xj)) = k(xi, xj).

GP posterior︷︸︸︷p(f (x)|D) ∝

Likelihood︷︸︸︷p(D|f (x))

GP prior︷︸︸︷p(f (x))

−10 −5 0 5 10−3

−2

−1

0

1

2

3

−10 −5 0 5 10−3

−2

−1

0

1

2

3

Outp

uts

, f(

x)

Sample Prior Functions Sample Posterior Functions

Inputs, x Inputs, x

Outp

uts

, f(

x)

17 / 66

Linear Basis Models

Consider the simple linear model,

f (x) = a0 + a1x , (19)

a0, a1 ∼ N (0, 1) . (20)

−10 −8 −6 −4 −2 0 2 4 6 8 10−25

−20

−15

−10

−5

0

5

10

15

20

25

Input, x

Out

put,

f(x)

18 / 66

Linear Basis Models

Consider the simple linear model,

f (x) = a0 + a1x , (21)

a0, a1 ∼ N (0, 1) . (22)

−10 −8 −6 −4 −2 0 2 4 6 8 10−25

−20

−15

−10

−5

0

5

10

15

20

25

Input, x

Out

put,

f(x)

Any collection of values has a joint Gaussian distribution

[f (x1), . . . , f (xN)] ∼ N (0,K) , (23)

Kij = cov(f (xi), f (xj)) = k(xi, xj) = 1 + xbxc . (24)

By definition, f (x) is a Gaussian process.

19 / 66

Linear Basis Function Models

Model Specification

f (x,w) = wTφ(x) (25)

p(w) = N (0,Σw) (26)

Moments of Induced Distribution over Functions

E[f (x,w)] = m(x) = E[wT]φ(x) = 0 (27)

cov(f (xi), f (xj)) = k(xi, xj) = E[f (xi)f (xj)]− E[f (xi)]E[f (xj)] (28)

= φ(xi)TE[wwT]φ(xj)− 0 (29)

= φ(xi)TΣwφ(xj) (30)

I f (x,w) is a Gaussian process, f (x) ∼ N (m, k) with mean function m(x) = 0and covariance kernel k(xa, xb) = φ(xi)

TΣwφ(xj).I The entire basis function model of Eqs. (25) and (26) is encapsulated as a

distribution over functions with kernel k(x, x′).20 / 66

Example: RBF Kernel

kRBF(x, x′) = cov(f (x), f (x′)) = a2 exp(−||x− x′||2

2`2 ) (31)

I Far and above the most popular kernel.

I Expresses the intuition that function values at nearby inputs are more correlatedthan function values at far away inputs.

I The kernel hyperparameters a and ` control amplitudes and wiggliness of thesefunctions.

I GPs with an RBF kernel have large support and are universal approximators.

21 / 66

Sampling from a GP with an RBF Kernel

x = [-10:0.2:10]’; % inputs (where we query the GP)N = numel(x); % number of inputsK = zeros(N,N); % covariance matrix

% very inefficient way of creating K in Matlabfor i=1:Nfor j=1:N

K(i,j) = k_rbf(x(i),x(j));end

end

K = K + 1e-6*eye(N); % add jitter for conditioning of KCK = chol(K);f = CK’*randn(N,1); % draws from N(0,K)

plot(x,f);

22 / 66

Samples from a GP with an RBF Kernel

23 / 66

RBF Kernel Covariance Matrix

kRBF(x, x′) = cov(f (x), f (x′)) = a2 exp(−||x− x′||2

2`2 ) (32)

The covariance matrix K for ordered inputs on a 1D grid. Kij = kRBF(xi, xj).

24 / 66

Learning and Predictions with Gaussian Processes

25 / 66

Inference using an RBF kernel

I Specify f (x) ∼ GP(0, k).

I Choose kRBF(x, x′) = a20 exp(− ||x−x′||2

2`20

). Choose values for a0 and `0.

I Observe data, look at the prior and posterior over functions.

−10 −5 0 5 10−4

−3

−2

−1

0

1

2

3

4

Input, x

Out

put,

f(x)

Samples from GP Prior

(a)

−10 −5 0 5 10−4

−3

−2

−1

0

1

2

3

4

Input, x

Out

put,

f(x)

Samples from GP Posterior

(b)

I Does something look strange about these functions?26 / 66

Inference using an RBF kernel

Increase the length-scale `.

−10 −5 0 5 10−4

−3

−2

−1

0

1

2

3

4

Input, x

Out

put,

f(x)


(a) (b)

27 / 66

Learning and Model Selection

p(Mi|y) =p(y|Mi)p(Mi)

p(y)(33)

We can write the evidence of the model as

p(y|Mi) =

∫p(y|f,Mi)p(f)df , (34)

yAll Possible Datasets

p(y|

M)

Complex Model

Simple Model

Appropriate Model

(a)

−10 −8 −6 −4 −2 0 2 4 6 8 10−4

−3

−2

−1

0

1

2

3

4

Input, x

Out

put,

f(x)

Data

Simple

Complex

Appropriate

(b)

28 / 66

Learning and Model Selection

I We can integrate away the entire Gaussian process f (x) to obtain the marginallikelihood, as a function of kernel hyperparameters θ alone.

p(y|θ,X) =

∫p(y|f,X)p(f|θ,X)df . (35)

log p(y|θ,X) =

model fit︷︸︸︷−1

2yT(Kθ + σ2I)−1y−

complexity penalty︷︸︸︷12

log |Kθ + σ2I| −N2

log(2π) . (36)

I An extremely powerful mechanism for kernel learning.

−10 −5 0 5 10−4

−3

−2

−1

0

1

2

3

4

Input, x

Out

put,

f(x)


(c)

−10 −5 0 5 10−4

−3

−2

−1

0

1

2

3

4

Input, x

Out

put,

f(x)

Samples from GP Posterior

(d)29 / 66

Aside: How Do We Build Models that Generalize?

yAll Possible Datasets

p(y|

M)

Complex Model

Simple Model

Appropriate Model

I Support: which datasets (hypotheses) are a priori possible.

I Inductive Biases: which datasets are a priori likely.

Want to make the support of our model as big as possible, with inductive biaseswhich are calibrated to particular applications, so as to not rule out potentialexplanations of the data, while at the same time quickly learn from a finite amount ofinformation on a particular application.

30 / 66

Gaussian Processes and Neural Networks

“How can Gaussian processespossibly replace neural networks?Have we thrown the baby out withthe bathwater?” (MacKay, 1998)

31 / 66

Deep Kernel Learning

We combine the inductive biases of deep learning architectures with thenon-parametric flexibility of Gaussian processes as part of a scalable deep kernellearning framework.

x1

xD

Input layer h(1)1

h(1)A

...

. . .

h(2)1

h(2)B

h(L)1

h(L)C

W (1)

W (2)

W (L)h1(θ)

h∞(θ)

Hidden layers

∞ layer

y1

yP

Output layer. . .

...

...

...

...

...

...

...

Deep Kernel LearningAndrew Gordon Wilson*, Zhiting Hu*, Ruslan Salakhutdinov, and Eric P. XingArtificial Intelligence and Statistics (AISTATS), 2016.

32 / 66

Deep Kernel Learning

Starting from a base kernel k(xi, xj|θ) with hyperparameters θ, we transform theinputs (predictors) x as

k(xi, xj|θ)→ k(g(xi,w), g(xj,w)|θ,w) , (37)

where g(x,w) is a non-linear mapping given by a deep architecture, such as a deepconvolutional network, parametrized by weights w.

We use spectral mixture base kernels (Wilson, 2014):

kSM(x, x′|θ) =

Q∑q=1

aq|Σq|

12

(2π)D2

exp(−1

2||Σ

12q (x− x′)||2

)cos〈x− x′, 2πµq〉 . (38)

Backprop to learn all parameters jointly through the GP marginal likelihood

33 / 66

Scalable Gaussian Processes

I Run Gaussian processes on millions of points in seconds, instead of thousandsof points in hours.

I Outperforms stand-alone deep neural networks by learning deep kernels.I Approach is based on kernel approximations which admit fast matrix vector

multiplies (Wilson and Nickisch, 2015).I Harmonizes with GPU acceleration.I O(n) training and O(1) testing (instead of O(n3) training and O(n2) testing).I Implemented in our new library GPyTorch:

https://github.com/cornellius-gp/gpytorch

34 / 66


Deep Kernel Learning Results

Similar runtime but improved performance over stand-alone DNNs and scalablekernel learning on a wide range of benchmarks

35 / 66

Face Orientation Extraction

36.15-43.10 -3.4917.35 -19.81

Training data

Test data

Label

Figure: Left: Randomly sampled examples of the training and test data. Right: Thetwo dimensional outputs of the convolutional network on a set of test cases. Eachpoint is shown using a line segment that has the same orientation as the input face.

36 / 66

Learning Flexible Non-Euclidean Similarity Metrics

100 200 300 400

100

200

300

400 −0.1

0

0.1

0.2

100 200 300 400

100

200

300

400 0

1

2

100 200 300 400

100

200

300

400 0

100

200

300

Figure: Left: The induced covariance matrix using DKL-SM kernel on a set of testcases, where the test samples are ordered according to the orientations of the inputfaces. Middle: The respective covariance matrix using DKL-RBF kernel. Right:The respective covariance matrix using regular RBF kernel. The models are trainedwith n = 12, 000. We set Q = 4 for the SM base kernel.

37 / 66

Step Function

−1 −0.5 0 0.5 14

6

8

10

12

14

16

18

Input X

Out

put Y

GP(RBF)GP(SM)DKL−SMTraining data

Figure: Recovering a step function. We show the predictive mean and 95% of thepredictive probability mass for regular GPs with RBF and SM kernels, and DKL withSM base kernel. We set Q = 4 for SM kernels.

38 / 66

LSTM Kernels

I We derive kernels which have recurrent LSTM inductive biases, and apply toautonomous vehicles, where predictive uncertainty is critical.

0.0 0.2 0.4 0.6 0.8 1.0

East, mi

0.0

0.2

0.4

0.6

0.8

1.0

Nor

th, m

i

5 0 5 10 15 20

30

20

10

0

5 0 5 10 15 20

30

20

10

0

0

4

8

12

16

20

24

28

Spe

ed, m

i/s

0

10

20

30

40

50

0

10

20

30

40

50

Learning Scalable Deep Kernels with Recurrent StructureM. Al-Shedivat, A. G. Wilson, Y. Saatchi, Z. Hu, E. P. XingJournal of Machine Learning Research (JMLR), 2017

39 / 66

GP-LSTM Predictive Distributions

−5 0 50

10

20

30

40

50

Fron

tdis

tanc

e,m

−5 0 5 −5 0 5Side distance, m

−5 0 5 −5 0 5

−5 0 50

10

20

30

40

50

Fron

tdis

tanc

e,m

−5 0 5 −5 0 5Side distance, m

−5 0 5 −5 0 5

40 / 66

The Bayesian GAN

θg

p(θg|D)

(θg)ML

Wilson and Saatchi (NIPS 2017)

41 / 66

Wide Optima Generalize Better

Keskar et. al (2017)

I Bayesian integration will give very different predictions in deep learningespecially!

42 / 66

Loss Surfaces in Deep Learning

(1) Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs(2) Averaging Weights Leads to Wider Optima and Better Generalization(3) Consistency Based Semi-Supervised Learning with Weight Averaging

I Local optima are connected along simple curves

I SGD does not converge to broad optima in all directions

I Averaging weights along SGD trajectories with constant or cyclical learningrates leads to much faster convergence and solutions that generalize better.

I Can approximate ensembles and Bayesian model averages with a single model.

43 / 66

Mode Connectivity

−20 0 20 40 60 80 100

−20

0

20

40

60

80

0.065

0.11

0.17

0.28

0.54

1.1

2.3

5

> 5

−20 0 20 40 60 80 100

−20

0

20

40

60

80

100

0.065

0.11

0.17

0.28

0.54

1.1

2.3

5

> 5

−20 0 20 40 60 80 100

−20

0

20

40

60

0.065

0.11

0.17

0.28

0.54

1.1

2.3

5

> 5

44 / 66

Example Parametrizations

Polygonal Chain:

φθ(t) =

{2 (tθ + (0.5− t)w1) , 0 ≤ t ≤ 0.52 ((t − 0.5)w2 + (1− t)θ) , 0.5 ≤ t ≤ 1.

(39)

Bezier Curve:

φθ(t) = (1− t)2w1 + 2t(1− t)θ + t2w2, 0 ≤ t ≤ 1. (40)

−20 0 20 40 60 80 100

−20

0

20

40

60

80

100

0.065

0.11

0.17

0.28

0.54

1.1

2.3

5

> 5

−20 0 20 40 60 80 100

−20

0

20

40

60

0.065

0.11

0.17

0.28

0.54

1.1

2.3

5

> 5

45 / 66

Connection Procedure with Tractable Loss

We propose the computationally tractable loss:

`(θ) =

∫ 1

0L(φθ(t))dt = Et∼U(0,1)L(φθ(t)) (41)

At each iteration we sample t ∈ [0, 1] and make a gradient step for θ with respect toL(φθ(t)):

Et∼U(0,1)∇θL(φθ(t)) = ∇θEt∼U(0,1)L(φθ(t)) = ∇θ`(θ).

46 / 66

Curve Ensembling

47 / 66

Fast Geometric Ensembling

48 / 66

Trajectory of SGD

−10 0 10 20 30 40 50

−10

0

10

20

30

W1

W2

W3

WSWA

Test error (%)

19.95

20.64

21.24

22.38

24.5

28.49

35.97

50

> 50

49 / 66

Trajectory of SGD

−10 0 10 20 30 40 50

−10

0

10

20

30

W1

W2

W3

WSWA

Test error (%)

19.95

20.64

21.24

22.38

24.5

28.49

35.97

50

> 50

50 / 66

Trajectory of SGD

−10 0 10 20 30 40 50

−10

0

10

20

30

W1

W2

W3

WSWA

Test error (%)

19.95

20.64

21.24

22.38

24.5

28.49

35.97

50

> 50

51 / 66

Trajectory of SGD

−10 0 10 20 30 40 50

−10

0

10

20

30

W1

W2

W3

WSWA

Test error (%)

19.95

20.64

21.24

22.38

24.5

28.49

35.97

50

> 50

−5 0 5 10 15 20 25

0

5

10

epoch 125

WSGD

WSWA

Test error (%)

19.62

20.15

20.67

21.67

23.65

27.52

35.11

50

> 50

−5 0 5 10 15 20 25

0

5

10

epoch 125

WSGD

WSWA

Train loss

0.00903

0.02142

0.03422

0.06024

0.1131

0.2206

0.4391

0.8832

> 0.8832

52 / 66

Trajectory of SGD

−10 0 10 20 30 40 50

−10

0

10

20

30

Train loss

0.03013

0.04494

0.06269

0.1017

0.1874

0.3758

0.7899

1.7

> 1.7

−10 0 10 20 30 40 50

−10

0

10

20

30

Test error (%)

19.95

20.64

21.24

22.38

24.5

28.49

35.97

50

> 50

0 10 20 30 40 50

0

10

20

30

Train loss

0.1835

0.1981

0.2152

0.2522

0.3324

0.5062

0.883

1.7

> 1.7

0 10 20 30 40 50

0

10

20

30

Test error (%)

21.9

22.58

23.17

24.26

26.28

30.04

37.03

50

> 50

53 / 66

Following Random Paths

0 5 10 15 20

20

22

24

26

28

30

54 / 66

Path from wSWA to wSGD

−80 −60 −40 −20 0 20 40

Distance

17.5

20.0

22.5

25.0

27.5

30.0

Tes

ter

ror

(%)

Test error

SWA

SGD

0.0

0.5

1.0

1.5

2.0

2.5

Tra

inlo

ss

Train loss

SWA

SGD

55 / 66

Approximating an FGE Ensemble

Because the points sampled from an FGE ensembletake small steps in weight space by design, we can doa linearization analysis to show that

f (wSWA) ≈1n

∑f (wi)

56 / 66

SWA Results, CIFAR

57 / 66

SWA Results, ImageNet (Top-1 Error Rate)

58 / 66

Sampling from a High Dimensional Gaussian

SGD (with constant LR) proposals are on the surface of a hypersphere. Averaginglets us go inside the sphere to a point of higher density.

59 / 66

High Constant LR

0 50 100 150 200 250 300

Epochs

15

20

25

30

35

40

45

50T

est

erro

r(%

)

SGD

Const LR SGD

Const LR SWA

Side observation: Averaging bad models does not give good solutions. Averagingbad weights can give great solutions.

60 / 66

Conclusions

I Bayesian deep learning can provide better predictions anduncertainty estimates

I Computation is a key challenge. We address this issuethrough developing numerical linear algebra approaches.

I Developing fast deterministic (e.g., variational)approximations and stochastic MCMC algorithms will beimportant too.

I We can better understand model construction, andgeneralization, by taking a function space approach.

I We can be inspired by Bayesian methods to developoptimization procedures that generalize better.

I Code for the approaches discussed here is available at:https://people.orie.cornell.edu/andrew/code

61 / 66

https://people.orie.cornell.edu/andrew/code

Appendix

62 / 66

Deriving the RBF Kernel

f (x) =J∑

i=1

wiφi(x) , wi ∼ N(

0,σ2

J

), φi(x) = exp

(− (x− ci)

2

2`2

)(42)

∴ k(x, x′) =σ2

J

J∑i=1

φi(x)φi(x′) (43)

I Let cJ = log J, c1 = − log J, and ci+1 − ci = ∆c = 2 log JJ , and J →∞, the

kernel in Eq. (43) becomes a Riemann sum:

k(x, x′) = limJ→∞

σ2

J

J∑i=1

φi(x)φi(x′) =

∫ c∞

c0

φc(x)φc(x′)dc (44)

I By setting c0 = −∞ and c∞ =∞, we spread the infinitely many basisfunctions across the whole real line, each a distance ∆c→ 0 apart:

k(x, x′) =

∫ ∞−∞

exp(− (x− c)2

2`2 ) exp(− (x′ − c)2

2`2 )dc (45)

=√π`σ2 exp(− (x− x′)2

2(√

2`)2) . (46)

63 / 66

Deriving the RBF Kernel

I It is remarkable we can work with infinitely many basis functions with finiteamounts of computation using the kernel trick – replacing inner products ofbasis functions with kernels.

I The RBF kernel, also known as the Gaussian or squared exponential kernel, isby far the most popular kernel. kRBF(x, x′) = a2 exp(− ||x−x′||2

2`2 ).

I Functions drawn from a GP with an RBF kernel are infinitely differentiable.For this reason, the RBF kernel is accused of being overly smooth andunrealistic. Nonetheless it has nice theoretical properties...

64 / 66

References

I A.G. Wilson and H. Nickisch. Kernel interpolation for scalable structured Gaussian processes(KISS-GP). International Conference on Machine Learning (ICML), 2015.

I P. Izmailov*, D. Podoprikhin*, T. Garipov*, D. Vetrov, A.G. Wilson. Averaging Weights Leads toWider Optima and Better Generalization, Uncertainty in Artificial Intelligence (UAI), 2018.

I T. Garipov*, P. Izmailov*, D. Podoprikhin*, D. Vetrov, A.G. Wilson. Loss Surfaces, ModeConnectivity, and Fast Ensembling of DNNs, 2018.

I R. Neal. Bayesian Learning for Neural Networks. PhD Thesis, 1996.

I D. MacKay. Introduction to Gaussian Processes. Neural Networks and Machine Learning, 1998.

I D. MacKay. Information Theory, Inference, and Learning Algorithms, 2003.

I D. MacKay. Bayesian Interpolation, 1992.

I M. Welling and Y.W. Teh. Bayesian learning via stochastic gradient langevin dynamics. InternationalConference on Machine Learning (ICML), 2011.

65 / 66

http://jmlr.org/proceedings/papers/v37/wilson15.pdf

http://jmlr.org/proceedings/papers/v37/wilson15.pdf

https://arxiv.org/abs/1802.10026




References

I T. Chen, E. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In InternationalConference on Machine Learning (ICML), 2014.

I J. Gardner, G. Pleiss, K. Weinberger, A.G. Wilson. GPyTorch: Scalable and Modular GaussianProcesses in PyTorch: https://github.com/cornellius-gp/gpytorch.

I Codebase for group projects: https://people.orie.cornell.edu/andrew/code.

I B. Athiwaratkun, M. Finzi, P. Izmailov, A.G. Wilson. Improving Consistency-Based Semi-SupervisedLearning with Weight Averaging.

I A.G. Wilson*, Z. Hu*, R. Salakhutdinov, and E.P. Xing. Deep kernel learning. Artificial Intelligenceand Statistics (AISTATS), 2016.

I A.G. Wilson*, Z. Hu*, R. Salakhutdinov, and E.P. Xing. Stochastic Variational Deep Kernel Learning.Neural Information Processing Systems (NIPS), 2016.

I M. Al-Shedivat, A.G. Wilson, Y. Saatchi, Z. Hu, and E.P. Xing. Learning Scalable Deep Kernels withRecurrent Structure. Journal of Machine Learning Research (JMLR), 2017.

I C. E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. Cambridge Press,2006.

I N.S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P. Tang. On Large-Batch Training for DeepLearning: Generalization Gap and Sharp Minima. ICLR, 2017.

66 / 66


https://people.orie.cornell.edu/andrew/code

http://arxiv.org/abs/1511.02222



Documents

Bayesian Deep Learning - Cornell University · Bayesian Deep Learning Why? I A powerful framework for model construction and understanding generalization I Uncertainty representation