15
Gaussian process covariance functions Carl Edward Rasmussen October 20th, 2016 Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 1 / 15

Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

  • Upload
    others

  • View
    35

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Gaussian process covariance functions

Carl Edward Rasmussen

October 20th, 2016

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 1 / 15

Page 2: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Key concepts

• chose covariance functions and use the marginal likelihood to• set hyperparameters• chose between different covariance functions

• covariance functions and hyperparameters can help interpret the data• we illutrate a number of different covariance function families

• stationary covariance functions: squared exponential, rational quadraticand Matérn forms

• many existing models are special cases of Gaussian processes• radial basis function networks (RBF)• splines• large neural networks

• combining existing simple covariance functions into more interesting ones

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 2 / 15

Page 3: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Model Selection, Hyperparameters, and ARD

We need to determine both the form and parameters of the covariance function.We typically use a hierarchical model, where the parameters of the covariance arecalled hyperparameters.A very useful idea is to use automatic relevance determination (ARD) covariancefunctions for feature/variable selection, e.g.:

k(x, x ′) = v20 exp

(−

D∑d=1

(xd − x ′d)2

2v2d

), hyperparameters θ = (v0, v1, . . . , vd,σ2

n).

−20

2

−20

2

0

1

2

x1

v1=v2=1

x2 −20

2

−20

2−2

0

2

x1

v1=v2=0.32

x2 −20

2

−20

2

−2

0

2

x1

v1=0.32 and v2=1

x2Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 3 / 15

Page 4: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Rational quadratic covariance function

The rational quadratic (RQ) covariance function, where r = x− x ′:

kRQ(r) =(

1 +r2

2α`2

)−αwith α, ` > 0 can be seen as a scale mixture (an infinite sum) of squaredexponential (SE) covariance functions with different characteristic length-scales.Using τ = `−2 and p(τ|α,β) ∝ τα−1 exp(−ατ/β):

kRQ(r) =

∫p(τ|α,β)kSE(r|τ)dτ

∝∫τα−1 exp

(−ατ

β

)exp

(−τr2

2

)dτ ∝

(1 +

r2

2α`2

)−α,

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 4 / 15

Page 5: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Rational quadratic covariance function II

0 1 2 30

0.2

0.4

0.6

0.8

1

input distance

cova

rianc

e

α=1/2α=2α→∞

−5 0 5−3

−2

−1

0

1

2

3

input, xou

tput

, f(x

)

The limit α→∞ of the RQ covariance function is the SE.

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 5 / 15

Page 6: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Matérn covariance functions

Stationary covariance functions can be based on the Matérn form:

k(x, x ′) =1

Γ(ν)2ν−1

[√2ν`

|x − x ′|]νKν

(√2ν`

|x − x ′|)

,

where Kν is the modified Bessel function of second kind of order ν, and ` is thecharacteristic length scale.Sample functions from Matérn forms are bν− 1c times differentiable. Thus, thehyperparameter ν can control the degree of smoothnessSpecial cases:

• kν=1/2(r) = exp(− r`): Laplacian covariance function, Brownian motion

(Ornstein-Uhlenbeck)

• kν=3/2(r) =(1 +

√3r`

)exp

(−√

3r`

)(once differentiable)

• kν=5/2(r) =(1 +

√5r`

+ 5r2

3`2

)exp

(−√

5r`

)(twice differentiable)

• kν→∞ = exp(− r2

2`2 ): smooth (infinitely differentiable)

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 6 / 15

Page 7: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Matérn covariance functions II

Univariate Matérn covariance function with unit characteristic length scale andunit variance:

0 1 2 30

0.5

1

covariance function

input distance

cova

rianc

e

−5 0 5

−2

−1

0

1

2

sample functions

input, x

outp

ut, f

(x)

ν=1/2ν=1ν=2ν→∞

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 7 / 15

Page 8: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Periodic, smooth functions

To create a distribution over periodic functions of x, we can first map the inputsto u = (sin(x), cos(x))>, and then measure distances in the u space. Combinedwith the SE covariance function, which characteristic length scale `, we get:

kperiodic(x, x′) = exp(−2 sin2(π(x− x ′))/`2)

−2 −1 0 1 2−3

−2

−1

0

1

2

3

−2 −1 0 1 2−3

−2

−1

0

1

2

3

Three functions drawn at random; left ` > 1, and right ` < 1.

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 8 / 15

Page 9: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Spline models

One dimensional minimization problem: find the function f(x) which minimizes:

c∑i=1

(f(x(i)) − y(i))2 + λ

∫1

0(f ′′(x))2dx,

where 0 < x(i) < x(i+1) < 1, ∀i = 1, . . . ,n− 1, has as solution the NaturalSmoothing Cubic Spline: first order polynomials when x ∈ [0; x(1)] and whenx ∈ [x(n); 1] and a cubic polynomical in each x ∈ [x(i); x(i+1)], ∀i = 1, . . . ,n− 1,joined to have continuous second derivatives at the knots.The identical function is also the mean of a Gaussian process: Consider the classa functions given by:

f(x) = α+ βx+ limn→∞ 1√

n

n−1∑i=0

γi(x−i

n)+, where (x)+ =

{x if x > 00 otherwise

with Gaussian priors:

α ∼ N(0, ξ), β ∼ N(0, ξ), γi ∼ N(0, Γ), ∀i = 0, . . . ,n− 1.

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 9 / 15

Page 10: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

The covariance function becomes:

k(x, x ′) = ξ+ xx ′ξ+ Γ limn→∞ 1

n

n−1∑i=0

(x−i

n)+ (x ′ −

i

n)+

= ξ+ xx ′ξ+ Γ

∫1

0(x− u)+ (x ′ − u)+du

= ξ+ xx ′ξ+ Γ(1

2|x− x ′|min(x, x ′)2 +

13

min(x, x ′)3).In the limit ξ→∞ and λ = σ2

n/Γ the posterior mean becomes the natrual cubicspline.We can thus find the hyperparameters σ2 and Γ (and thereby λ) by maximisingthe marginal likelihood in the usual way.Defining h(x) = (1, x)> the posterior predictions with mean and variance:

µ̃(X∗) = H(X∗)>β+ K(X,X∗)[K(X,X) + σ2

nI]−1(y −H(X)>β)

Σ̃(x∗) = Σ(X∗) + R(X,X∗)>A(X)−1R(X,X∗)

β = A(X)−1H(X)[K+ σ2nI]

−1y, A(X) = H(X)[K(X,X) + σ2nI]

−1H(X)>

R(X,X∗) = H(X∗) −H(X)[K+ σ2nI]

−1K(X,X∗)

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 10 / 15

Page 11: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Cubic Splines, Example

Although this is not the fastest way to compute splines, it offers a principled wayof finding hyperparameters, and uncertainties on predictions.Note also, that although the posterior mean is smooth (piecewise cubic), posteriorsample functions are not.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

−0.5

0

0.5

1

1.5observationsmeanrandomrandom95% conf

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 11 / 15

Page 12: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Feed Forward Neural Networks

���������������������������������������������������������������������������������

���������������������������������������������������������������������������������

...

... bias

hidden units

input units

output unit

tanh

linear

input−hiddenbias−hidden

Weight groups:output weights

A feed forward neural network implements the function:

f(x) =

H∑i=1

vi tanh(∑j

uijxj + bj)

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 12 / 15

Page 13: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Limits of Large Neural Networks

Sample random neural network weights from the (Gaussian) prior.

−10 0 10

−1

0

1

2 hidden units

input, x

netw

ork

outp

ut, y

−10 0 10

−1

0

1

5 hidden units

input, x

netw

ork

outp

ut, y

−10 0 10

−1

0

1

1000 hidden units

input, x

netw

ork

outp

ut, y

Note: The prior on the neural network weights induces a prior over functions.

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 13 / 15

Page 14: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

−4−3

−2−1

01

23

4

−4−3

−2−1

01

23

4

−1.5

−1

−0.5

0

0.5

1

1.5

Function drawn at random from a Neural Network covariance function

k(x, x ′) =2π

arcsin( 2x>Σx ′√

(1 + x>Σx)(1 + 2x ′>Σx ′)

).

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 14 / 15

Page 15: Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance functions.pdf · Matérn covariance functions Stationary covariance functions can be based on

Composite covariance functions

We’ve seen many examples of covariance functions.

Covariance functions have to be possitive definite.

One way of building covariance functions is by composing simpler ones invarious ways

• sums of covariance functions k(x, x ′) = k1(x, x ′) + k2(x, x ′)• products k(x, x ′) = k1(x, x ′)× k2(x, x ′)• other combinations: g(x)k(x, x ′)g(x ′)• etc.

The gpml toolbox supports such constructions.

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 15 / 15