Gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance...

Preview:

Citation preview

Gaussian process covariance functions

Carl Edward Rasmussen

October 20th, 2016

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 1 / 15

Key concepts

• chose covariance functions and use the marginal likelihood to• set hyperparameters• chose between different covariance functions

• covariance functions and hyperparameters can help interpret the data• we illutrate a number of different covariance function families

• stationary covariance functions: squared exponential, rational quadraticand Matérn forms

• many existing models are special cases of Gaussian processes• radial basis function networks (RBF)• splines• large neural networks

• combining existing simple covariance functions into more interesting ones

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 2 / 15

Model Selection, Hyperparameters, and ARD

We need to determine both the form and parameters of the covariance function.We typically use a hierarchical model, where the parameters of the covariance arecalled hyperparameters.A very useful idea is to use automatic relevance determination (ARD) covariancefunctions for feature/variable selection, e.g.:

k(x, x ′) = v20 exp

(−

D∑d=1

(xd − x ′d)2

2v2d

), hyperparameters θ = (v0, v1, . . . , vd,σ2

n).

−20

2

−20

2

0

1

2

x1

v1=v2=1

x2 −20

2

−20

2−2

0

2

x1

v1=v2=0.32

x2 −20

2

−20

2

−2

0

2

x1

v1=0.32 and v2=1

x2Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 3 / 15

Rational quadratic covariance function

The rational quadratic (RQ) covariance function, where r = x− x ′:

kRQ(r) =(

1 +r2

2α`2

)−αwith α, ` > 0 can be seen as a scale mixture (an infinite sum) of squaredexponential (SE) covariance functions with different characteristic length-scales.Using τ = `−2 and p(τ|α,β) ∝ τα−1 exp(−ατ/β):

kRQ(r) =

∫p(τ|α,β)kSE(r|τ)dτ

∝∫τα−1 exp

(−ατ

β

)exp

(−τr2

2

)dτ ∝

(1 +

r2

2α`2

)−α,

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 4 / 15

Rational quadratic covariance function II

0 1 2 30

0.2

0.4

0.6

0.8

1

input distance

cova

rianc

e

α=1/2α=2α→∞

−5 0 5−3

−2

−1

0

1

2

3

input, xou

tput

, f(x

)

The limit α→∞ of the RQ covariance function is the SE.

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 5 / 15

Matérn covariance functions

Stationary covariance functions can be based on the Matérn form:

k(x, x ′) =1

Γ(ν)2ν−1

[√2ν`

|x − x ′|]νKν

(√2ν`

|x − x ′|)

,

where Kν is the modified Bessel function of second kind of order ν, and ` is thecharacteristic length scale.Sample functions from Matérn forms are bν− 1c times differentiable. Thus, thehyperparameter ν can control the degree of smoothnessSpecial cases:

• kν=1/2(r) = exp(− r`): Laplacian covariance function, Brownian motion

(Ornstein-Uhlenbeck)

• kν=3/2(r) =(1 +

√3r`

)exp

(−√

3r`

)(once differentiable)

• kν=5/2(r) =(1 +

√5r`

+ 5r2

3`2

)exp

(−√

5r`

)(twice differentiable)

• kν→∞ = exp(− r2

2`2 ): smooth (infinitely differentiable)

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 6 / 15

Matérn covariance functions II

Univariate Matérn covariance function with unit characteristic length scale andunit variance:

0 1 2 30

0.5

1

covariance function

input distance

cova

rianc

e

−5 0 5

−2

−1

0

1

2

sample functions

input, x

outp

ut, f

(x)

ν=1/2ν=1ν=2ν→∞

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 7 / 15

Periodic, smooth functions

To create a distribution over periodic functions of x, we can first map the inputsto u = (sin(x), cos(x))>, and then measure distances in the u space. Combinedwith the SE covariance function, which characteristic length scale `, we get:

kperiodic(x, x′) = exp(−2 sin2(π(x− x ′))/`2)

−2 −1 0 1 2−3

−2

−1

0

1

2

3

−2 −1 0 1 2−3

−2

−1

0

1

2

3

Three functions drawn at random; left ` > 1, and right ` < 1.

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 8 / 15

Spline models

One dimensional minimization problem: find the function f(x) which minimizes:

c∑i=1

(f(x(i)) − y(i))2 + λ

∫1

0(f ′′(x))2dx,

where 0 < x(i) < x(i+1) < 1, ∀i = 1, . . . ,n− 1, has as solution the NaturalSmoothing Cubic Spline: first order polynomials when x ∈ [0; x(1)] and whenx ∈ [x(n); 1] and a cubic polynomical in each x ∈ [x(i); x(i+1)], ∀i = 1, . . . ,n− 1,joined to have continuous second derivatives at the knots.The identical function is also the mean of a Gaussian process: Consider the classa functions given by:

f(x) = α+ βx+ limn→∞ 1√

n

n−1∑i=0

γi(x−i

n)+, where (x)+ =

{x if x > 00 otherwise

with Gaussian priors:

α ∼ N(0, ξ), β ∼ N(0, ξ), γi ∼ N(0, Γ), ∀i = 0, . . . ,n− 1.

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 9 / 15

The covariance function becomes:

k(x, x ′) = ξ+ xx ′ξ+ Γ limn→∞ 1

n

n−1∑i=0

(x−i

n)+ (x ′ −

i

n)+

= ξ+ xx ′ξ+ Γ

∫1

0(x− u)+ (x ′ − u)+du

= ξ+ xx ′ξ+ Γ(1

2|x− x ′|min(x, x ′)2 +

13

min(x, x ′)3).In the limit ξ→∞ and λ = σ2

n/Γ the posterior mean becomes the natrual cubicspline.We can thus find the hyperparameters σ2 and Γ (and thereby λ) by maximisingthe marginal likelihood in the usual way.Defining h(x) = (1, x)> the posterior predictions with mean and variance:

µ̃(X∗) = H(X∗)>β+ K(X,X∗)[K(X,X) + σ2

nI]−1(y −H(X)>β)

Σ̃(x∗) = Σ(X∗) + R(X,X∗)>A(X)−1R(X,X∗)

β = A(X)−1H(X)[K+ σ2nI]

−1y, A(X) = H(X)[K(X,X) + σ2nI]

−1H(X)>

R(X,X∗) = H(X∗) −H(X)[K+ σ2nI]

−1K(X,X∗)

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 10 / 15

Cubic Splines, Example

Although this is not the fastest way to compute splines, it offers a principled wayof finding hyperparameters, and uncertainties on predictions.Note also, that although the posterior mean is smooth (piecewise cubic), posteriorsample functions are not.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

−0.5

0

0.5

1

1.5observationsmeanrandomrandom95% conf

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 11 / 15

Feed Forward Neural Networks

���������������������������������������������������������������������������������

���������������������������������������������������������������������������������

...

... bias

hidden units

input units

output unit

tanh

linear

input−hiddenbias−hidden

Weight groups:output weights

A feed forward neural network implements the function:

f(x) =

H∑i=1

vi tanh(∑j

uijxj + bj)

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 12 / 15

Limits of Large Neural Networks

Sample random neural network weights from the (Gaussian) prior.

−10 0 10

−1

0

1

2 hidden units

input, x

netw

ork

outp

ut, y

−10 0 10

−1

0

1

5 hidden units

input, x

netw

ork

outp

ut, y

−10 0 10

−1

0

1

1000 hidden units

input, x

netw

ork

outp

ut, y

Note: The prior on the neural network weights induces a prior over functions.

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 13 / 15

−4−3

−2−1

01

23

4

−4−3

−2−1

01

23

4

−1.5

−1

−0.5

0

0.5

1

1.5

Function drawn at random from a Neural Network covariance function

k(x, x ′) =2π

arcsin( 2x>Σx ′√

(1 + x>Σx)(1 + 2x ′>Σx ′)

).

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 14 / 15

Composite covariance functions

We’ve seen many examples of covariance functions.

Covariance functions have to be possitive definite.

One way of building covariance functions is by composing simpler ones invarious ways

• sums of covariance functions k(x, x ′) = k1(x, x ′) + k2(x, x ′)• products k(x, x ′) = k1(x, x ′)× k2(x, x ′)• other combinations: g(x)k(x, x ′)g(x ′)• etc.

The gpml toolbox supports such constructions.

Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 15 / 15

Recommended