112
An Introduction to Fitting Gaussian Processes to Data Michael Osborne Pattern Analysis and Machine Learning Research Group Department of Engineering University of Oxford

An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

An Introduction to Fitting Gaussian Processes to Data

Michael Osborne Pattern Analysis and Machine Learning Research Group

Department of Engineering University of Oxford

Page 2: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

You will learn how to fit a Gaussian process to data.

Page 3: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Probability Theory

C

R

Deductive Logic

C

R

Probability theory represents an extension of traditional logic, allowing us to reason in the face of uncertainty.

Page 4: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

P( R | C, I )

A probability is a degree of belief. This might be held by any agent – a human, a robot, a sensor, etc.

Page 5: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We define our agents so that they can perform difficult inference for us.

‘I’ is the totality of an agent’s prior information. An agent is defined by I.

Page 6: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

A dot-to-dot is an inference problem.

Page 7: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

A dot-to-dot is a problem with many possible solutions.

Page 8: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Our prior information allows us to discriminate between solutions.

Page 9: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

y

x

A dot-to-dot requires us to do inference about functions, as can be seen more clearly in one dimension.

Page 10: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

y1

y2

mean

The multivariate Gaussian distribution is wonderful; it is defined by a mean vector (which simply gives the centrepoint) and covariance matrix.

Page 11: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

3 1 0 0 1 3 2 0 0 2 4 -1 0 0 -1 3

K must be symmetric. The diagonal of K must be positive. K = RT R for some upper triangular R. The eigenvalues of K are all ≥ 0.

The covariance K must be a positive semi-definite matrix; so for any vector x, xT K x ≥ 0. This implies that:

Page 12: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

=

3

2

1

32

321

21

T

000000

)sin()sin(00)cos()sin()sin(0

)cos()cos(1

hh

h

R

RRK

θθ

θθθ

θθ

We can represent any covariance K using the spherical parameterisation.

Page 13: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

0.32 0.56

0.56 1.09

K =

The (i,j)th element of the covariance expresses how variable i is dependent upon variable j.

Page 14: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

1.67 -2 1.33 -2 3 -2 1.33 -2 1.67

3 2 0 2 3 2 0 2 3

K-1 =

K =

The (i,j)th element of the inverse covariance (precision) expresses how variable i is dependent upon variable j, conditioned on all other variables.

Page 15: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

y1

y2

p(y1 | I)

p(y1 | y2=-5, I)

The multivariate Gaussian distribution allows us to produce distributions for variables conditioned on any other observed variables.

Page 16: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

y1

y2

p(y1 | y2=-5, I)

yx

x

The Gaussian distribution allows us to produce distributions for variables conditioned on any other observed variables.

y1

y2

Page 17: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

A Gaussian process is the generalisation of a multivariate Gaussian distribution to a potentially infinite number of variables.

yx

x

Page 18: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

);,();,();,();,();,();,();,();,();,(

);();();(

332313

322212

312111

333231

232221

132111

3

2

1

3

2

1

φφφ

φφφ

φφφ

φµ

φµ

φµ

µ

µ

µ

xxKxxKxxKxxKxxKxxKxxKxxKxxK

KKKKKKKKK

xxx

For a Gaussian process, we need to define mean and covariance functions, specified by hyperparameters φ.

Page 19: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

y

x

A Gaussian process represents a powerful way to perform Bayesian inference about functions.

Page 20: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

A Gaussian process produces a mean estimate.

y

x

Page 21: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

y

x

A Gaussian process produces a mean estimate along with an indication of the uncertainty in it.

y

x

Page 22: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

( )),(),(),(),()|(

)(),(),()()|(

),(),(),(),(

,)()(

;),(

)()(

*1

****

1***

*

******

**

xxKxxKxxKxxKyyCxyxxKxxKxyym

xxKxxKxxKxxK

xx

yy

Nyyp

xyyxyy

ddddd

dddddd

ddd

d

ddd

dd

+=

−+=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

=

=

µµ

µ

µ

Predictants Data

Mean Cov.

All functions here are dependent upon hyperparameters.

The posterior mean and covariance equations follow simply from Gaussian identities.

Page 23: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

y

x

A Gaussian process can accommodate noise.

y

x

Page 24: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

( )

( )

⎟⎟⎟

⎜⎜⎜

+

+

=+=

+=

−+=

=

2

2212

212

112

*1

****

1***

2

),(),(),(),(

),(),(

),(),(),(),()|(

)(),(),()()|(

,;)|(

σ

σ

σ

µµ

σ

xxKxxKxxKxxK

IxxKxxV

xxKxxVxxKxxKzyCxzxxVxxKxzym

IyzNyzp

ddddd

ddddd

dddddd

ddddd Identity matrix

We usually consider making independent and identically distributed (IID) Gaussian noisy measurements z, of y; giving

Page 25: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

More generally, we could consider correlated noise, in which the noise contribution could itself be drawn from a GP.

Page 26: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We often want to address functions of time, using Gaussian processes for tracking.

Page 27: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We often want to address functions of time, using Gaussian processes for tracking.

Page 28: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We often want to address functions of time, using Gaussian processes for tracking.

Page 29: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We often want to address functions of time, using Gaussian processes for tracking.

Page 30: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We often want to address functions of time, using Gaussian processes for tracking.

Page 31: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We often want to address functions of time, using Gaussian processes for tracking.

Page 32: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We often want to address functions of time, using Gaussian processes for tracking.

Page 33: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We often want to address functions of time, using Gaussian processes for tracking.

Page 34: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

The prior mean function µ(x;φ) should be our best guess (of any form) for the function y(x) before any observations are made.

Page 35: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

The prior mean function is the function our inference will default to far from observations.

Page 36: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Predictions required Mean function Interpolation µ(x;φ)=mean(yd ). Extrapolation Bespoke model built

using domain knowledge.

It’s rarely worth using a complicated mean function (with many hyperparameters), unless we’re concerned with prediction far from our observations.

Page 37: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Function type Covariance function Improbably smooth Squared exponential. Less smooth Matérn. Polynomial Polynomial.

There are a huge number of covariance functions (in spite of the requirement that they be positive semi-definite) appropriate for modelling functions of different types.

Page 38: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

( ) ( ));,(;, 2 wxxdhwxxK jiji κ=

squared output scale

distance function

input scale

decreases with increasing d

( )wxx

wxxd jiji

−=;,e.g.

Many covariance functions (including the squared exponential and Matérn) are of the metric form

Page 39: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

stationary functions non-stationary function

We often want distances that are stationary (a function of x1-x2), implying that the function looks similar throughout its domain.

Page 40: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

w

h2

w

h

The hyperparameters h and w specify our expected length scales of the function in output (‘height’) and input (‘width’) spaces respectively.

Page 41: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

The squared exponential and Matérn covariances allow us to model functions of various degrees of smoothness.

Page 42: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

( ) ( )( ) ( ) ( )( ) ( ));,(exp,,;,

);,(3exp);,(31,,;,

);,(exp,;,

221

Mtn

223

Mtn

2212

SE

wxxdhwhxxK

wxxdwxxdhwhxxK

wxxdhwhxxK

jiji

jijiji

jiji

−==

−+==

−=

ν

ν

The squared exponential and Matérn covariances allow us to model functions of various degrees of smoothness.

Page 43: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

( ) ( )djTiji xWx+c=Wx,;x,xK 2

P

Polynomial covariances exist to model functions that are known to be polynomial.

Page 44: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜

−−

−−

=

=

2100121001210012

)ugly(

1K

Kfunction, one that gives a sparse precision matrix. This allows efficient computation.

The Kalman filter is a Gaussian process with a special covariance

Page 45: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

× ( × + ) e.g.

We can create new covariance functions by adding or multiplying other covariance functions.

Page 46: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

When a function is the sum of two independent functions, use a covariance that is the sum of the covariances for those two functions.

Page 47: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

( ) ( ) ( )( )( ) )(x)µ(xµx,xK

+)(x)µ(xµx,xK+x,xKx,xK=x,xK

a(x)b(x)=y(x)

aab

bba

bay

2121

2121

212121

When a function is the product of two independent functions, use a covariance that is (almost) the product of the covariances for those two functions.

Page 48: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

1-dim

n-dim derivative

periodic

Squared exponential

Matern Polynomial Inputs

We can also modify covariance functions.

Page 49: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

( ) ( ) ( )jijiji xxWxxWxxd −−= −1T;,

⎟⎟⎠

⎞⎜⎜⎝

⎛=

10001

We.g.

We can modify covariance functions to accommodate multiple input dimensions, using

Page 50: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

x

Hence we do not need simultaneous observations of all outputs.

If there are multiple outputs, reframe the problem as having a single output, and an additional label input specifying the output.

Page 51: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

( ) ( )( ) ( ) ( )jijijjii llKxxKlxlxK ,,,,, =

separable for convenience

If L is not too large, we could use the spherical parameterisation.

If the inputs were previously x, and outputs were labelled by l = 1, ..., L, we now need to specify a covariance over both x and l, e.g.

Page 52: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Many other modifications are possible, to build covariances allowing for e.g. changepoints, faults and sets.

Page 53: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛ −

Txx

πw

=,xxd jiji sin1

We can modify covariance functions for functions known to be periodic, by using the distance

Page 54: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Gaussian distributed variables are joint Gaussian with any affine transform of them.

Page 55: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

A function over which we have a Gaussian process is joint Gaussian with any integral or derivative of it, as integration and differentiation are affine.

Page 56: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

( ) ( )

( ) ( )j

i

i

xxxxjiDD

xxjjiD

xxKxx

xxK

xxKx

xxK

==

=

∂=

∂=

'

, ','

,

,,derivative observation at xi and function observation at xj.

derivative observation at xi and derivative observation at xj.

We can modify covariance functions to manage derivative or integral observations.

Page 57: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can modify the squared exponential covariance to manage derivative observations.

Page 58: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can improve our inference by including observations of the gradient of a function.

Page 59: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can improve our inference by including observations of the gradient of a function.

Page 60: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can use observations of an integrand ℓ in order to perform inference for its integral, Z: this is known as Bayesian Quadrature.

Page 61: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Consider the integral . Bayesian quadrature achieves more accurate results than Monte Carlo, and provides an estimate of our uncertainty.

Page 62: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Unfortunately, these integrals are non-analytic.

The enormous flexibility afforded by covariance functions comes at a price: hyperparameters, which must be marginalised.

Page 63: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

likelihood prior

Given that we don’t to fix y*, the two important terms in our integrands are the likelihood and the prior (specifically, their product, proportional to the posterior for φ).

Page 64: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

The hyperparameter priors can have a significant influence on our inference. Prior A favours small input scale, prior C favours large input scale and prior B is uninformative.

Page 65: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Prior

Use what you know.

Selecting priors is easy.

Page 66: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Model ( I )

Predictions

Probability theory Our expectations

≈ ?

If probability theory makes ‘wrong’ predictions, then we have learned something!

Page 67: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Model ( I )

Predictions

Probability theory Our expectations

One of these two must be wrong

If probability theory makes ‘wrong’ predictions, then we have learned something!

Page 68: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

In this way, we are led to construct successively better models.

Page 69: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Marginalisation requires quadrature, which presents two challenges: integrand exploration, and integral estimation.

log-likelihood

hyperparameters

Page 70: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

log-likelihood

There are many different approaches to quadrature for probabilistic integrals; integrand estimation is usually undervalued.

Page 71: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Optimisation (as in maximum likelihood), particularly using global optimisers, gives a reasonable heuristic for exploring the integrand.

Page 72: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

However, maximum likelihood is an unreasonable way of estimating a multi-modal likelihood integrand: why throw away all those other samples?

Page 73: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Monte Carlo schemes give a another reasonable method of exploration.

Page 74: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Monte Carlo schemes give a fairly reasonable method of exploration; but a less reasonable means of integrand estimation.

Page 75: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Bayesian Monte Carlo uses samples obtained via Monte Carlo within a Bayesian quadrature framework to give an estimate for the integral.

Page 76: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

With Bayesian quadrature, we can also estimate the posterior distributions for any hyperparameters.

Page 77: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Likelihood Marginalisation

Unimodal or high dimensional

Maximum likelihood.

Multimodal or computationally expensive

Bayesian Monte Carlo.

There are many approaches to hyperparameter marginalisation, but only two are recommended.

Page 78: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can put Gaussian processes to work not just for regression, but also for classification.

Page 79: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

To do so, use a Gaussian process to model a latent variable, mapped through a sigmoid to a discrete class label.

Page 80: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Unfortunately, using this sigmoid makes inference intractable. Approximate inference can be achieved using a number of algorithms.

Algorithm Speed Accuracy

Laplace approximation

Very fast Low.

Expectation Propagation

Fast High.

Markov Chain Monte Carlo

Very slow Very high.

Page 81: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

x Objective function

y

We treat global optimisation as a Bayesian decision problem.

Page 82: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can also use Gaussian processes for optimisation.

Page 83: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can also use Gaussian processes for optimisation.

Page 84: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can also use Gaussian processes for optimisation.

Page 85: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can also use Gaussian processes for optimisation.

Page 86: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can also use Gaussian processes for optimisation.

Page 87: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can also use Gaussian processes for optimisation.

Page 88: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can also use Gaussian processes for optimisation.

Page 89: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can also use Gaussian processes for optimisation.

Page 90: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can also use Gaussian processes for optimisation.

Page 91: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛−

3

2

11

333231

232221

132111

vvv

KKKKKKKKK

The key computational bottleneck associated with Gaussian processes is resolving inv(K) v, or, equivalently, solving v = K x for x.

Page 92: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Covariance matrix Solving method

Poorly conditioned Improve conditioning, then see below.

(Just) positive semi-definite

Cholesky factorisation.

Toeplitz Toeplitz solver.

Kronecker product Kronecker solver.

Too big and dense Sparsification.

Updated version of previous matrix

Update, dependent on above.

Our choice of a method to solve v = K x for x depends on the structure of covariance K.

Page 93: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Inversion is slow, O(n3) in matrix size n.

Inversion is also unstable; conditioning errors are significant.

1

333231

232221

131211−

⎟⎟⎟

⎜⎜⎜

KKKKKKKKK

You should never actually invert a matrix.

Page 94: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

The condition number (cond) of a covariance matrix is the ratio of the largest to the smallest eigenvalue; in Matlab, things break down at about cond(K) = 1016.

1 0.9999 0 0

0.9999 1 0 0

0 0 1 0.1

0 0 0.1 1

Too similar

Conditioning becomes an issue when we have multiple close observations, giving rows in the covariance matrix that are very similar.

Page 95: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

1.01 0.9999 0 0

0.9999 1.01 0 0

0 0 1.01 0.1

0 0 0.1 1.01

Sufficiently dissimilar

The solution to conditioning problems is to add a small positive quantity (jitter) to the diagonal of the covariance matrix.

Page 96: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

As jitter is effectively imposed noise, adding jitter to all diagonal elements (unnecessarily) dilutes the informativeness of our data.

Page 97: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

==

=

nn

n

n

R

RRRRR

KR

RRK

00

0)chol( 222

11211

T

The Cholesky factorisation of a positive semi-definite matrix K is relatively fast (1/3 O(n3) in matrix size n) and more numerically stable.

Page 98: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

=

=

nnn

n

n

n x

xx

R

RRRRR

v

vvRxxxRv

Kxv

'

''

00

0

''

2

1

222

11211

2

1

T

The upper triangular Cholesky factor can then be stored and used to solve v = K x for x very quickly (O(n2) in matrix size n) by back substitution.

Page 99: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

=

1

1234

2123

3212

4321

kk

kkkkkkkkkkkk

kkkkk

K

n

n

If K is Toeplitz, there exists a very efficient method to solve v = K x for x (O(4n2) in matrix size n).

A symmetric matrix K is Toeplitz if it can be written as

Page 100: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

⎟⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜⎜

1007.605.1311.103.0007.601007.605.1311.103.005.137.601007.605.1311.103.011.15.137.601007.605.1311.103.011.15.137.601007.605.13003.011.15.137.601007.600003.011.15.137.60100

=K

A Gaussian process has a Toeplitz covariance matrix if we have linearly spaced observations and a stationary covariance function.

Page 101: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

e.g.

Some special large matrices can be represented in a compact way using the Kronecker product.

Page 102: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

( ) ( )( ) ( )bbaa

baba

vKvK

vvKKx11

1

−−

⊗=

⊗⊗=

size na ₓ nb

size na size nb

Recall that solving operations are typically O

(n3)!

If K is a Kronecker product, there exists a very efficient method to solve v = K x for x (particularly when v is itself a Kronecker product):

Page 103: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

A Gaussian process will have a Kronecker product for a covariance matrix if we use a product covariance function and a grid of samples.

Page 104: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

=

nnK

KKKKK

KKKKK

K

0

000

0000

4443

343332

232221

1211

If a very large covariance matrix is not decomposable as a Kronecker product (or otherwise), we may wish to attempt sparsification.

Page 105: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

There are many ways to sparsify our data; the simplest involve selecting a subset. Windowing represents a reasonable way to do this.

Page 106: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

A Gaussian process assumes all variables f are correlated.

[image courtesy C. Rasmussen]

Page 107: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Imagine introducing additional, unobserved inducing variables u.

[image courtesy C. Rasmussen]

Page 108: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We can sparsify data by using inducing variables to mediate the interactions between test and training variables.

[image courtesy C. Rasmussen]

Page 109: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

There are many such schemes for sparsification, that differ in the choice of inducing inputs.

[image courtesy C. Rasmussen]

Page 110: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

Finally, if we already have the Cholesky factor

),chol( 1111 KR =

we can efficiently determine the updated factor

,chol0 2212

1211

22

1211⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟⎠

⎞⎜⎜⎝

KKKK

RRR

and similar for other types of Cholesky updates and downdates, and for solutions based upon them. A Toeplitz update is probably also possible.

Page 111: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

We want to evaluate a large number of hyperparameter samples to explore hyperparameter space. Fortunately, each sample can be evaluated in parallel (possibly on a graphics card).

Page 112: An Introduction to Fitting Gaussian Processes to DataA dot-to-dot is a problem with many possible ... best guess (of any form) for the function y(x) ... The Kalman filter is a Gaussian

I hope you have learned how to fit Gaussian processes to data.