Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Bayesian Deep Learning
Andrew Gordon Wilson
Assistant Professorhttps://people.orie.cornell.edu/andrew
Cornell University
Toronto Deep Learning Summer SchoolJuly 31, 2018
1 / 66
Model Selection
1949 1951 1953 1955 1957 1959 1961100
200
300
400
500
600
700
Airl
ine
Pas
seng
ers
(Tho
usan
ds)
Year
Which model should we choose?
(1): f1(x) = a0 + a1x (2): f2(x) =
3∑j=0
ajxj (3): f3(x) =
104∑j=0
ajxj
2 / 66
Bayesian or Frequentist?
3 / 66
Bayesian Deep Learning
Why?
I A powerful framework for model construction and understanding generalization
I Uncertainty representation (crucial for decision making)
I Better point estimates
I It was the most successful approach at the end of the second wave of neuralnetworks (Neal, 1998).
I Neural nets are much less mysterious when viewed through the lens ofprobability theory.
Why not?
I Can be computationally intractable (but doesn’t have to be).
I Can involve a lot of moving parts (but doesn’t have to).
There has been exciting progress in the last two years addressing these limitations aspart of an extremely fruitful research direction.
4 / 66
How do we build models that learn and generalize?
1949 1951 1953 1955 1957 1959 1961100
200
300
400
500
600
700
Airl
ine
Pas
seng
ers
(Tho
usan
ds)
Year
Basic Regression Problem
I Training set of N targets (observations) y = (y(x1), . . . , y(xN))T.
I Observations evaluated at inputs X = (x1, . . . , xN)T.
I Want to predict the value of y(x∗) at a test input x∗.
For example: Given CO2 concentrations y measured at times X, what will the CO2
concentration be for x∗ = 2024, 10 years from now?
Just knowing high school math, what might you try?
5 / 66
How do we build models that learn and generalize?
Guess the parametric form of a function that could fit the data
I f (x,w) = wTx [Linear function of w and x]
I f (x,w) = wTφ(x) [Linear function of w] (Linear Basis Function Model)
I f (x,w) = g(wTφ(x)) [Non-linear in x and w] (E.g., Neural Network)
φ(x) is a vector of basis functions. For example, if φ(x) = (1, x, x2) and x ∈ R1 thenf (x,w) = w0 + w1x + w2x2 is a quadratic function.
Choose an error measure E(w), minimize with respect to wI E(w) =
∑Ni=1[f (xi,w)− y(xi)]
2
6 / 66
How do we build models that learn and generalize?
A probabilistic approachWe could explicitly account for noise in our model.
I y(x) = f (x,w) + ε(x) , where ε(x) is a noise function.
One commonly takes ε(x) = N (0, σ2) for i.i.d. additive Gaussian noise, in whichcase
p(y(x)|x,w, σ2) = N (y(x); f (x,w), σ2) Observation Model (1)
p(y|x,w, σ2) =
N∏i=1
N (y(xi); f (xi,w), σ2) Likelihood (2)
I Maximize the likelihood of the data p(y|x,w, σ2) with respect to σ2,w.
For a Gaussian noise model, this approach will make the same predictions as using asquared loss error function:
log p(y|X,w, σ2) ∝ − 12σ2
N∑i=1
[f (xi,w)− y(xi)]2 (3)
7 / 66
How do we build models that learn and generalize?
I The probabilistic approach helps us interpret the error measure in adeterministic approach, and gives us a sense of the noise level σ2.
I Both approaches are prone to over-fitting for flexible f (x,w): low error on thetraining data, high error on the test set.
Regularization
I Use a penalized log likelihood (or error function), such as
E(w) =
model fit︷ ︸︸ ︷− 1
2σ2
n∑i=1
(f (xi,w)− y(xi)2)
complexity penalty︷ ︸︸ ︷−λwTw . (4)
I But how should we define and penalize complexity?I Can set λ using cross-validation.
I Same as maximizing a posterior log p(w|y,X) = log p(y|w,X) + p(w) with aGaussian prior p(w). But this is not Bayesian!
8 / 66
Bayesian Inference
Bayes’ Rule
p(a|b) = p(b|a)p(a)/p(b) , p(a|b) ∝ p(b|a)p(a) . (5)
posterior =likelihood× prior
marginal likelihood, p(w|y,X, σ2) =
p(y|X,w, σ2)p(w)
p(y|X, σ2). (6)
Sum Rule
p(x) =∑
x
p(x, y) (7)
Product Rule
p(x, y) = p(x|y)p(y) = p(y|x)p(x) (8)
9 / 66
Predictive Distribution
Sum rule: p(x) =∑
x p(x, y). Product rule: p(x, y) = p(x|y)p(y) = p(y|x)p(x).
p(y|x∗, y,X) =
∫p(y|x∗,w)p(w|y,X)dw . (9)
I Think of each setting of w as a different model. Eq. (9) is a Bayesian modelaverage, an average of infinitely many models weighted by their posteriorprobabilities.
I No over-fitting, automatically calibrated complexity.
I Eq. (9) is not analytic for many likelihoods p(y|X,w, σ2) and priors p(w).(But recent advances such as SG-HMC have made these computations muchmore tractable in deep learning).
I Typically more interested in the induced distribution over functions than inparameters w. Can be hard to have intuitions for priors on p(w).
10 / 66
Example: Bent Coin
Suppose we flip a bent coin with probability λ oflanding tails.
1. What is the likelihood of a set of dataD = {x1, x2, . . . , xn}?
2. What is the maximum likelihood solution for λ?
3. Suppose we observe 2 tails in the first two flips.What is the probability that the next flip will be atails, using maximum likelihood?
11 / 66
Example: Bent Coin
Likelihood of getting m tails is
p(D|m, λ) =
(Nm
)λm(1− λ)N−m (10)
If we choose a prior p(λ) ∝ λα(1− λ)β then the posterior will have the samefunctional form as the prior.
12 / 66
Example: Bent CoinLikelihood of getting m tails is
p(D|m, λ) =
(Nm
)λm(1− λ)N−m (11)
If we choose a prior p(λ) ∝ λα(1− λ)β then the posterior will have the samefunctional form as the prior.We can choose the beta distribution:
Beta(λ|a, b) =Γ(a + b)
Γ(a)Γ(b)λa−1(1− λ)b−1 (12)
The Gamma functions ensure that the distribution is normalised:∫Beta(λ|a, b)dλ = 1 (13)
Moments:
E[λ] =a
a + b(14)
var[λ] =ab
(a + b)2(a + b− 1). (15)
13 / 66
Beta Distribution
14 / 66
Example: Bent Coin
Applying Bayes theorem, we find:
p(λ|D) ∝ p(D|λ)p(λ) (16)
= Beta(λ; m + a,N − m + b) (17)
We can view a and b as pseudo-observations!
E[λ|D] =m + a
N + a + b(18)
1. What is the probability that the next flip is tails?
2. What happens in the limits of a, b?
3. What happens in the limit of infinite data?
15 / 66
A Function Space View: Gaussian processes
Function posterior︷ ︸︸ ︷p(f (x)|D) ∝
Likelihood︷ ︸︸ ︷p(D|f (x))
Function prior︷ ︸︸ ︷p(f (x))
−10 −5 0 5 10−3
−2
−1
0
1
2
3
−10 −5 0 5 10−3
−2
−1
0
1
2
3
Outp
uts
, f(
x)
Sample Prior Functions Sample Posterior Functions
Inputs, x Inputs, x
Outp
uts
, f(
x)
Radford Neal showed in 1996 that a Bayesian neural network with an infinitenumber of hidden units is a Gaussian process!
16 / 66
Gaussian processes
DefinitionA Gaussian process (GP) is a collection of random variables, any finite number ofwhich have a joint Gaussian distribution.
Nonparametric Regression Model
I Prior: f (x) ∼ GP(m(x), k(x, x′)), meaning (f (x1), . . . , f (xN)) ∼ N (µ,K),with µi = m(xi) and Kij = cov(f (xi), f (xj)) = k(xi, xj).
GP posterior︷ ︸︸ ︷p(f (x)|D) ∝
Likelihood︷ ︸︸ ︷p(D|f (x))
GP prior︷ ︸︸ ︷p(f (x))
−10 −5 0 5 10−3
−2
−1
0
1
2
3
−10 −5 0 5 10−3
−2
−1
0
1
2
3
Outp
uts
, f(
x)
Sample Prior Functions Sample Posterior Functions
Inputs, x Inputs, x
Outp
uts
, f(
x)
17 / 66
Linear Basis Models
Consider the simple linear model,
f (x) = a0 + a1x , (19)
a0, a1 ∼ N (0, 1) . (20)
−10 −8 −6 −4 −2 0 2 4 6 8 10−25
−20
−15
−10
−5
0
5
10
15
20
25
Input, x
Out
put,
f(x)
18 / 66
Linear Basis Models
Consider the simple linear model,
f (x) = a0 + a1x , (21)
a0, a1 ∼ N (0, 1) . (22)
−10 −8 −6 −4 −2 0 2 4 6 8 10−25
−20
−15
−10
−5
0
5
10
15
20
25
Input, x
Out
put,
f(x)
Any collection of values has a joint Gaussian distribution
[f (x1), . . . , f (xN)] ∼ N (0,K) , (23)
Kij = cov(f (xi), f (xj)) = k(xi, xj) = 1 + xbxc . (24)
By definition, f (x) is a Gaussian process.
19 / 66
Linear Basis Function Models
Model Specification
f (x,w) = wTφ(x) (25)
p(w) = N (0,Σw) (26)
Moments of Induced Distribution over Functions
E[f (x,w)] = m(x) = E[wT]φ(x) = 0 (27)
cov(f (xi), f (xj)) = k(xi, xj) = E[f (xi)f (xj)]− E[f (xi)]E[f (xj)] (28)
= φ(xi)TE[wwT]φ(xj)− 0 (29)
= φ(xi)TΣwφ(xj) (30)
I f (x,w) is a Gaussian process, f (x) ∼ N (m, k) with mean function m(x) = 0and covariance kernel k(xa, xb) = φ(xi)
TΣwφ(xj).I The entire basis function model of Eqs. (25) and (26) is encapsulated as a
distribution over functions with kernel k(x, x′).20 / 66
Example: RBF Kernel
kRBF(x, x′) = cov(f (x), f (x′)) = a2 exp(−||x− x′||2
2`2 ) (31)
I Far and above the most popular kernel.
I Expresses the intuition that function values at nearby inputs are more correlatedthan function values at far away inputs.
I The kernel hyperparameters a and ` control amplitudes and wiggliness of thesefunctions.
I GPs with an RBF kernel have large support and are universal approximators.
21 / 66
Sampling from a GP with an RBF Kernel
x = [-10:0.2:10]’; % inputs (where we query the GP)N = numel(x); % number of inputsK = zeros(N,N); % covariance matrix
% very inefficient way of creating K in Matlabfor i=1:Nfor j=1:N
K(i,j) = k_rbf(x(i),x(j));end
end
K = K + 1e-6*eye(N); % add jitter for conditioning of KCK = chol(K);f = CK’*randn(N,1); % draws from N(0,K)
plot(x,f);
22 / 66
Samples from a GP with an RBF Kernel
23 / 66
RBF Kernel Covariance Matrix
kRBF(x, x′) = cov(f (x), f (x′)) = a2 exp(−||x− x′||2
2`2 ) (32)
The covariance matrix K for ordered inputs on a 1D grid. Kij = kRBF(xi, xj).
24 / 66
Learning and Predictions with Gaussian Processes
25 / 66
Inference using an RBF kernel
I Specify f (x) ∼ GP(0, k).
I Choose kRBF(x, x′) = a20 exp(− ||x−x′||2
2`20
). Choose values for a0 and `0.
I Observe data, look at the prior and posterior over functions.
−10 −5 0 5 10−4
−3
−2
−1
0
1
2
3
4
Input, x
Out
put,
f(x)
Samples from GP Prior
(a)
−10 −5 0 5 10−4
−3
−2
−1
0
1
2
3
4
Input, x
Out
put,
f(x)
Samples from GP Posterior
(b)
I Does something look strange about these functions?26 / 66
Inference using an RBF kernel
Increase the length-scale `.
−10 −5 0 5 10−4
−3
−2
−1
0
1
2
3
4
Input, x
Out
put,
f(x)
Samples from GP Prior
(a) (b)
27 / 66
Learning and Model Selection
p(Mi|y) =p(y|Mi)p(Mi)
p(y)(33)
We can write the evidence of the model as
p(y|Mi) =
∫p(y|f,Mi)p(f)df , (34)
yAll Possible Datasets
p(y|
M)
Complex Model
Simple Model
Appropriate Model
(a)
−10 −8 −6 −4 −2 0 2 4 6 8 10−4
−3
−2
−1
0
1
2
3
4
Input, x
Out
put,
f(x)
Data
Simple
Complex
Appropriate
(b)
28 / 66
Learning and Model Selection
I We can integrate away the entire Gaussian process f (x) to obtain the marginallikelihood, as a function of kernel hyperparameters θ alone.
p(y|θ,X) =
∫p(y|f,X)p(f|θ,X)df . (35)
log p(y|θ,X) =
model fit︷ ︸︸ ︷−1
2yT(Kθ + σ2I)−1y−
complexity penalty︷ ︸︸ ︷12
log |Kθ + σ2I| −N2
log(2π) . (36)
I An extremely powerful mechanism for kernel learning.
−10 −5 0 5 10−4
−3
−2
−1
0
1
2
3
4
Input, x
Out
put,
f(x)
Samples from GP Prior
(c)
−10 −5 0 5 10−4
−3
−2
−1
0
1
2
3
4
Input, x
Out
put,
f(x)
Samples from GP Posterior
(d)29 / 66
Aside: How Do We Build Models that Generalize?
yAll Possible Datasets
p(y|
M)
Complex Model
Simple Model
Appropriate Model
I Support: which datasets (hypotheses) are a priori possible.
I Inductive Biases: which datasets are a priori likely.
Want to make the support of our model as big as possible, with inductive biaseswhich are calibrated to particular applications, so as to not rule out potentialexplanations of the data, while at the same time quickly learn from a finite amount ofinformation on a particular application.
30 / 66
Gaussian Processes and Neural Networks
“How can Gaussian processespossibly replace neural networks?Have we thrown the baby out withthe bathwater?” (MacKay, 1998)
31 / 66
Deep Kernel Learning
We combine the inductive biases of deep learning architectures with thenon-parametric flexibility of Gaussian processes as part of a scalable deep kernellearning framework.
x1
xD
Input layer h(1)1
h(1)A
...
. . .
h(2)1
h(2)B
h(L)1
h(L)C
W (1)
W (2)
W (L)h1(θ)
h∞(θ)
Hidden layers
∞ layer
y1
yP
Output layer. . .
...
...
...
...
...
...
...
Deep Kernel LearningAndrew Gordon Wilson*, Zhiting Hu*, Ruslan Salakhutdinov, and Eric P. XingArtificial Intelligence and Statistics (AISTATS), 2016.
32 / 66
Deep Kernel Learning
Starting from a base kernel k(xi, xj|θ) with hyperparameters θ, we transform theinputs (predictors) x as
k(xi, xj|θ)→ k(g(xi,w), g(xj,w)|θ,w) , (37)
where g(x,w) is a non-linear mapping given by a deep architecture, such as a deepconvolutional network, parametrized by weights w.
We use spectral mixture base kernels (Wilson, 2014):
kSM(x, x′|θ) =
Q∑q=1
aq|Σq|
12
(2π)D2
exp(−1
2||Σ
12q (x− x′)||2
)cos〈x− x′, 2πµq〉 . (38)
Backprop to learn all parameters jointly through the GP marginal likelihood
33 / 66
Scalable Gaussian Processes
I Run Gaussian processes on millions of points in seconds, instead of thousandsof points in hours.
I Outperforms stand-alone deep neural networks by learning deep kernels.I Approach is based on kernel approximations which admit fast matrix vector
multiplies (Wilson and Nickisch, 2015).I Harmonizes with GPU acceleration.I O(n) training and O(1) testing (instead of O(n3) training and O(n2) testing).I Implemented in our new library GPyTorch:
https://github.com/cornellius-gp/gpytorch
34 / 66
Deep Kernel Learning Results
Similar runtime but improved performance over stand-alone DNNs and scalablekernel learning on a wide range of benchmarks
35 / 66
Face Orientation Extraction
36.15-43.10 -3.4917.35 -19.81
Training data
Test data
Label
Figure: Left: Randomly sampled examples of the training and test data. Right: Thetwo dimensional outputs of the convolutional network on a set of test cases. Eachpoint is shown using a line segment that has the same orientation as the input face.
36 / 66
Learning Flexible Non-Euclidean Similarity Metrics
100 200 300 400
100
200
300
400 −0.1
0
0.1
0.2
100 200 300 400
100
200
300
400 0
1
2
100 200 300 400
100
200
300
400 0
100
200
300
Figure: Left: The induced covariance matrix using DKL-SM kernel on a set of testcases, where the test samples are ordered according to the orientations of the inputfaces. Middle: The respective covariance matrix using DKL-RBF kernel. Right:The respective covariance matrix using regular RBF kernel. The models are trainedwith n = 12, 000. We set Q = 4 for the SM base kernel.
37 / 66
Step Function
−1 −0.5 0 0.5 14
6
8
10
12
14
16
18
Input X
Out
put Y
GP(RBF)GP(SM)DKL−SMTraining data
Figure: Recovering a step function. We show the predictive mean and 95% of thepredictive probability mass for regular GPs with RBF and SM kernels, and DKL withSM base kernel. We set Q = 4 for SM kernels.
38 / 66
LSTM Kernels
I We derive kernels which have recurrent LSTM inductive biases, and apply toautonomous vehicles, where predictive uncertainty is critical.
0.0 0.2 0.4 0.6 0.8 1.0
East, mi
0.0
0.2
0.4
0.6
0.8
1.0
Nor
th, m
i
5 0 5 10 15 20
30
20
10
0
5 0 5 10 15 20
30
20
10
0
0
4
8
12
16
20
24
28
Spe
ed, m
i/s
0
10
20
30
40
50
0
10
20
30
40
50
Learning Scalable Deep Kernels with Recurrent StructureM. Al-Shedivat, A. G. Wilson, Y. Saatchi, Z. Hu, E. P. XingJournal of Machine Learning Research (JMLR), 2017
39 / 66
GP-LSTM Predictive Distributions
−5 0 50
10
20
30
40
50
Fron
tdis
tanc
e,m
−5 0 5 −5 0 5Side distance, m
−5 0 5 −5 0 5
−5 0 50
10
20
30
40
50
Fron
tdis
tanc
e,m
−5 0 5 −5 0 5Side distance, m
−5 0 5 −5 0 5
40 / 66
The Bayesian GAN
θg
p(θg|D)
(θg)ML
Wilson and Saatchi (NIPS 2017)
41 / 66
Wide Optima Generalize Better
Keskar et. al (2017)
I Bayesian integration will give very different predictions in deep learningespecially!
42 / 66
Loss Surfaces in Deep Learning
(1) Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs(2) Averaging Weights Leads to Wider Optima and Better Generalization(3) Consistency Based Semi-Supervised Learning with Weight Averaging
I Local optima are connected along simple curves
I SGD does not converge to broad optima in all directions
I Averaging weights along SGD trajectories with constant or cyclical learningrates leads to much faster convergence and solutions that generalize better.
I Can approximate ensembles and Bayesian model averages with a single model.
43 / 66
Mode Connectivity
−20 0 20 40 60 80 100
−20
0
20
40
60
80
0.065
0.11
0.17
0.28
0.54
1.1
2.3
5
> 5
−20 0 20 40 60 80 100
−20
0
20
40
60
80
100
0.065
0.11
0.17
0.28
0.54
1.1
2.3
5
> 5
−20 0 20 40 60 80 100
−20
0
20
40
60
0.065
0.11
0.17
0.28
0.54
1.1
2.3
5
> 5
44 / 66
Example Parametrizations
Polygonal Chain:
φθ(t) =
{2 (tθ + (0.5− t)w1) , 0 ≤ t ≤ 0.52 ((t − 0.5)w2 + (1− t)θ) , 0.5 ≤ t ≤ 1.
(39)
Bezier Curve:
φθ(t) = (1− t)2w1 + 2t(1− t)θ + t2w2, 0 ≤ t ≤ 1. (40)
−20 0 20 40 60 80 100
−20
0
20
40
60
80
100
0.065
0.11
0.17
0.28
0.54
1.1
2.3
5
> 5
−20 0 20 40 60 80 100
−20
0
20
40
60
0.065
0.11
0.17
0.28
0.54
1.1
2.3
5
> 5
45 / 66
Connection Procedure with Tractable Loss
We propose the computationally tractable loss:
`(θ) =
∫ 1
0L(φθ(t))dt = Et∼U(0,1)L(φθ(t)) (41)
At each iteration we sample t ∈ [0, 1] and make a gradient step for θ with respect toL(φθ(t)):
Et∼U(0,1)∇θL(φθ(t)) = ∇θEt∼U(0,1)L(φθ(t)) = ∇θ`(θ).
46 / 66
Curve Ensembling
47 / 66
Fast Geometric Ensembling
48 / 66
Trajectory of SGD
−10 0 10 20 30 40 50
−10
0
10
20
30
W1
W2
W3
WSWA
Test error (%)
19.95
20.64
21.24
22.38
24.5
28.49
35.97
50
> 50
49 / 66
Trajectory of SGD
−10 0 10 20 30 40 50
−10
0
10
20
30
W1
W2
W3
WSWA
Test error (%)
19.95
20.64
21.24
22.38
24.5
28.49
35.97
50
> 50
50 / 66
Trajectory of SGD
−10 0 10 20 30 40 50
−10
0
10
20
30
W1
W2
W3
WSWA
Test error (%)
19.95
20.64
21.24
22.38
24.5
28.49
35.97
50
> 50
51 / 66
Trajectory of SGD
−10 0 10 20 30 40 50
−10
0
10
20
30
W1
W2
W3
WSWA
Test error (%)
19.95
20.64
21.24
22.38
24.5
28.49
35.97
50
> 50
−5 0 5 10 15 20 25
0
5
10
epoch 125
WSGD
WSWA
Test error (%)
19.62
20.15
20.67
21.67
23.65
27.52
35.11
50
> 50
−5 0 5 10 15 20 25
0
5
10
epoch 125
WSGD
WSWA
Train loss
0.00903
0.02142
0.03422
0.06024
0.1131
0.2206
0.4391
0.8832
> 0.8832
52 / 66
Trajectory of SGD
−10 0 10 20 30 40 50
−10
0
10
20
30
Train loss
0.03013
0.04494
0.06269
0.1017
0.1874
0.3758
0.7899
1.7
> 1.7
−10 0 10 20 30 40 50
−10
0
10
20
30
Test error (%)
19.95
20.64
21.24
22.38
24.5
28.49
35.97
50
> 50
0 10 20 30 40 50
0
10
20
30
Train loss
0.1835
0.1981
0.2152
0.2522
0.3324
0.5062
0.883
1.7
> 1.7
0 10 20 30 40 50
0
10
20
30
Test error (%)
21.9
22.58
23.17
24.26
26.28
30.04
37.03
50
> 50
53 / 66
Following Random Paths
0 5 10 15 20
20
22
24
26
28
30
54 / 66
Path from wSWA to wSGD
−80 −60 −40 −20 0 20 40
Distance
17.5
20.0
22.5
25.0
27.5
30.0
Tes
ter
ror
(%)
Test error
SWA
SGD
0.0
0.5
1.0
1.5
2.0
2.5
Tra
inlo
ss
Train loss
SWA
SGD
55 / 66
Approximating an FGE Ensemble
Because the points sampled from an FGE ensembletake small steps in weight space by design, we can doa linearization analysis to show that
f (wSWA) ≈1n
∑f (wi)
56 / 66
SWA Results, CIFAR
57 / 66
SWA Results, ImageNet (Top-1 Error Rate)
58 / 66
Sampling from a High Dimensional Gaussian
SGD (with constant LR) proposals are on the surface of a hypersphere. Averaginglets us go inside the sphere to a point of higher density.
59 / 66
High Constant LR
0 50 100 150 200 250 300
Epochs
15
20
25
30
35
40
45
50T
est
erro
r(%
)
SGD
Const LR SGD
Const LR SWA
Side observation: Averaging bad models does not give good solutions. Averagingbad weights can give great solutions.
60 / 66
Conclusions
I Bayesian deep learning can provide better predictions anduncertainty estimates
I Computation is a key challenge. We address this issuethrough developing numerical linear algebra approaches.
I Developing fast deterministic (e.g., variational)approximations and stochastic MCMC algorithms will beimportant too.
I We can better understand model construction, andgeneralization, by taking a function space approach.
I We can be inspired by Bayesian methods to developoptimization procedures that generalize better.
I Code for the approaches discussed here is available at:https://people.orie.cornell.edu/andrew/code
61 / 66
Appendix
62 / 66
Deriving the RBF Kernel
f (x) =J∑
i=1
wiφi(x) , wi ∼ N(
0,σ2
J
), φi(x) = exp
(− (x− ci)
2
2`2
)(42)
∴ k(x, x′) =σ2
J
J∑i=1
φi(x)φi(x′) (43)
I Let cJ = log J, c1 = − log J, and ci+1 − ci = ∆c = 2 log JJ , and J →∞, the
kernel in Eq. (43) becomes a Riemann sum:
k(x, x′) = limJ→∞
σ2
J
J∑i=1
φi(x)φi(x′) =
∫ c∞
c0
φc(x)φc(x′)dc (44)
I By setting c0 = −∞ and c∞ =∞, we spread the infinitely many basisfunctions across the whole real line, each a distance ∆c→ 0 apart:
k(x, x′) =
∫ ∞−∞
exp(− (x− c)2
2`2 ) exp(− (x′ − c)2
2`2 )dc (45)
=√π`σ2 exp(− (x− x′)2
2(√
2`)2) . (46)
63 / 66
Deriving the RBF Kernel
I It is remarkable we can work with infinitely many basis functions with finiteamounts of computation using the kernel trick – replacing inner products ofbasis functions with kernels.
I The RBF kernel, also known as the Gaussian or squared exponential kernel, isby far the most popular kernel. kRBF(x, x′) = a2 exp(− ||x−x′||2
2`2 ).
I Functions drawn from a GP with an RBF kernel are infinitely differentiable.For this reason, the RBF kernel is accused of being overly smooth andunrealistic. Nonetheless it has nice theoretical properties...
64 / 66
References
I A.G. Wilson and H. Nickisch. Kernel interpolation for scalable structured Gaussian processes(KISS-GP). International Conference on Machine Learning (ICML), 2015.
I P. Izmailov*, D. Podoprikhin*, T. Garipov*, D. Vetrov, A.G. Wilson. Averaging Weights Leads toWider Optima and Better Generalization, Uncertainty in Artificial Intelligence (UAI), 2018.
I T. Garipov*, P. Izmailov*, D. Podoprikhin*, D. Vetrov, A.G. Wilson. Loss Surfaces, ModeConnectivity, and Fast Ensembling of DNNs, 2018.
I R. Neal. Bayesian Learning for Neural Networks. PhD Thesis, 1996.
I D. MacKay. Introduction to Gaussian Processes. Neural Networks and Machine Learning, 1998.
I D. MacKay. Information Theory, Inference, and Learning Algorithms, 2003.
I D. MacKay. Bayesian Interpolation, 1992.
I M. Welling and Y.W. Teh. Bayesian learning via stochastic gradient langevin dynamics. InternationalConference on Machine Learning (ICML), 2011.
65 / 66
References
I T. Chen, E. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In InternationalConference on Machine Learning (ICML), 2014.
I J. Gardner, G. Pleiss, K. Weinberger, A.G. Wilson. GPyTorch: Scalable and Modular GaussianProcesses in PyTorch: https://github.com/cornellius-gp/gpytorch.
I Codebase for group projects: https://people.orie.cornell.edu/andrew/code.
I B. Athiwaratkun, M. Finzi, P. Izmailov, A.G. Wilson. Improving Consistency-Based Semi-SupervisedLearning with Weight Averaging.
I A.G. Wilson*, Z. Hu*, R. Salakhutdinov, and E.P. Xing. Deep kernel learning. Artificial Intelligenceand Statistics (AISTATS), 2016.
I A.G. Wilson*, Z. Hu*, R. Salakhutdinov, and E.P. Xing. Stochastic Variational Deep Kernel Learning.Neural Information Processing Systems (NIPS), 2016.
I M. Al-Shedivat, A.G. Wilson, Y. Saatchi, Z. Hu, and E.P. Xing. Learning Scalable Deep Kernels withRecurrent Structure. Journal of Machine Learning Research (JMLR), 2017.
I C. E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. Cambridge Press,2006.
I N.S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P. Tang. On Large-Batch Training for DeepLearning: Generalization Gap and Sharp Minima. ICLR, 2017.
66 / 66