Download pdf - Gaussian Processes - Massachusetts Institute of …courses.media.mit.edu/2010fall/mas622j/ProblemSets/slidesGP.pdf · Gaussian Processes Daniel McDu ... Video tutorials, slides, software:

Gaussian Processes

Daniel McDuff

MIT Media Lab

December 2, 2010

Daniel McDuff (MIT Media Lab) Gaussian Processes December 2, 2010 1 / 44

Quote of the Day

”Only in Britain could it be thought a defect to be too clever

by half. The probability is that too many people are too stupid

by three-quarters.”

Former Prime Minister John Major assessing the priors on intelligencein the UK.


Outline

1 Introduction

2 Theory

3 Regression

4 Classification

5 Applications

6 Summary


Introduction

Introduction

Books and Resources

Gaussian Processes for Machine Learning - C. Rasmussen and C.Williams. MATLAB code to accompany.

Information Theory, Inference, and Learning Algorithms - D. Mackay.

Video tutorials, slides, software: www.gaussianprocess.org


Introduction

Notation

D = Training dataset, { (xi , y) | i= 1,...n }x = Training feature vectorX = Training featuresy = Training labelsx* = Test feature vectory* = Test labelsGP(m(x), k(x, x’)) = Gaussian process with mean function, m(x), andcovariance function, k(x, x’).


Introduction

Parametric Models vs. Non-parametric Models

Non-parametric models differ from parametric models. The modelstructure is not specified a priori but is determined from the data.

Parametric models:

Support Vector Machines

Neural Networks

Non-parametric models:

KNN

Gaussian processes


Introduction

Supervised Learning

Our Problem: To map from finite training examples to a function thatpredicts for all possible inputs.

Restrict the solution functions that we consider. Example: Linearregression.

-ve: If wrong form of function is chosen then predictions will be poor.

Apply prior probabilities to functions that we consider more likely.

-ve: Infinite possibilities of functions to consider.


Theory

Theory

Definition: A Gaussian process is a collection of random variables, anyfinite number of which have a joint Gaussian distribution.

A Gaussian process is a generalization of the Gaussian probabilitydistribution.

It is a distribution over functions rather a distribution over vectors.

It is a non-parametric method of modeling data.

A Gaussian Process is of infinite dimensions. However, we only workwith a finite subset.

Fortunately, this yields the same inference as if we were to consider allthe infinite possibilities.


Theory

Theory

Gaussian distributions

N (µ,Σ)

Distribution over vectors.Fully specified by a mean andcovariance.The position of the randomvariable in the vector plays therole of the index.

Gaussian processes

GP(m(x), k(x, x’))

Distribution over functions.Fully specified by a meanfunction and covariancefunction.The argument of the randomfunction plays the role of theindex.


Theory

Theory

Covariance function defines the properties in the function space.

Data points ”anchor” the function as specific locations.

Figure: Left: Samples drawn from the prior distribution of functions with noobserved datapoints. Right: Samples drawn from the prior distribution offunctions with two observed datapoints. Solid black line shows the mean andshaded region twice the standard deviation. Credit: Rasmussen & Williams.


Theory

Theory

What do we need to define?

Mean function, m(x) - Usually defined to be zero. Justified bymanipulating the data.

Covariance function,k(x, x’) - This defines the prior properties of thefunctions considered for inference.

StationaritySmoothnessLength-scales

Gaussian processes provide a well defined approach for learning model andhyperparameters from the data.


Regression

Regression

Prediction of continuous quantity, y, from input, x*.

We will assume the prediction to have two parts:

A systematic variation: Accurately predicted by an underlying processf (x).

A random variation: Unpredictability of the system.

The result:

yi = f (xi ) + ǫi (1)


Regression

Bayesian Linear Regression

A simple example of Gaussian Processes:

y = xTw+ ǫ (2)

Estimate a distribution over weights.

Marginalise over weights to derive prediction.

Predictive distribution:

f∗|x∗,X , y = N (xT∗A−1X y, xT

∗A−1x∗) (3)


Regression

Projections Into Feature Space

In linear regression we constrain the relationship between the observationsand the output to be linear. An alternative is to use a kernel:

Φ =

φ(||x1 − x1||) φ(||x1 − x2||) ... φ(||x1 − xn||)φ(||x2 − x1||) φ(||x2 − x2||) ... φ(||x2 − xn||)

... ... ... ...

φ(||xn − x1||) φ(||xn − x2||) ... φ(||x1 − xn||)

Measures similarity between feature points.


Regression

Projections Into Feature Space

Change the model to:

y = φ(x)Tw+ ǫ (4)

Where φ() projects x into a new space.

The predictive distribution becomes:

f∗|x∗,X , y = N (1

σnφ(x∗)

TA−1φ y, φ(x∗)TA−1φ(x∗)) (5)

Where: φ is an aggregation of all φ(x)’s in the training set.


Regression

Function Space View

To generate functions we generate random Gaussian vectors with acovariance matrix defined by your input points:

f∗ ∼ N (0,K (X∗,X∗)) (6)


Regression

Function Space View

Since we are interested in making predictions more than just generatingfunctions we want to incorporate the knowledge of training data. For a GPwith zero mean and covariance function, K(X, X) + σn

2I the joindistribution of training outputs and test outputs is:

[

y

f∗

]

∼ N(0,

[

K (X ,X ) + σ2nI K (X ,X∗)

K (X∗,X ) K (X∗,X∗)

]

)

The predictive equations for GP regression:

f∗|X , y,X∗ ∼ N (m(x), covf ) (7)

Where:

m(x) = K (X∗,X )[K (X ,X ) + σ2nI ]

−1y

covf = K (X∗,X∗)− K (X∗,X )[K (X ,X ) + σ2nI ]

−1K (X ,X∗)


Regression

MATLAB Example

MATLAB Example. Code adapted from GPML demo - gaussianprocess.org


Regression

Covariance Functions

Represents the covariance between pairs of random variables.

Kij = k(xi , xj) (8)

Kxx =

k(x1, x1) k(x1, x2) ... k(x1, xn)k(x2, x1) k(x1, x1) ... k(x2, xn)

... ... ... ...

k(xn, x1) k(xn, x2) ... k(xn, xn)

Examples:

Squared exponential

Matern

γ -exponential

Rational quadratic

Noise (i.e. White noise - flat spectrum).


Regression

Covariance Functions

The covariance function must be:

Positive semi-definite

Symmetric

Covariance functions can be split broadly into two groups:

Stationary: A function of xi = xj . Invariant to translations in theinput space.

Non-stationary: Functions vary with translation.

Most important sub-set of these covariance functions are dot-productfunctions:

Linear - k(xi , xj) = 〈xi , xj〉

Polynomial - k(xi , xj) = (〈xi , xj〉 + d)d


Regression

Squared Exponential

Commonly used covariance function:

Kij = k(xi , xj) = αexp{−||xi − xj ||

2

2σ21

} (9)

Hyperparameters:

σ1 = Characteristic lengthscale

α = Signal variance

Large for inputs that are close to one another. Decreases are distances inthe input space increases.


Regression

Squared Exponential

We normally assume some prediction noise:


2

2σ21

}+ σ2nδij (10)

Hyperparameters:




Regression

Noise

The impact of different values of noise variance:

Figure: Comparison of Gaussian processes with different values of noise variance.Left: Too great?, Middle: Just right?, Right: Too small?


Regression

Squared Exponential

We normally assume some prediction noise:


2

2σ21

}+ σ2nδij (11)

Hyperparameters:




Regression

Lengthscales

Lengthscales

Figure: Comparison of Gaussian processes with different lengthscales.


Regression

Learning in Gaussian Processes

Bayesian evidence is the probability of the data given the model.

Figure: Complex models that account for many datasets only achieve modestevidence. If the model is too simple evidence may be high but only for fewdatasets (Credit: Carl Rasmussen).

Remember Occam’s Razor.Daniel McDuff (MIT Media Lab) Gaussian Processes December 2, 2010 26 / 44

Regression


One approach this is to maximise the marginal likelihood:

p(x|X , θ) = N (y;0,Kθ(X ,X ) + σ2nI ) (12)

In log form:

log(p(x|X , θ)) = −1

2yTΣ−1

y y−1

2log(|Σy |)−

n

2log(2π) (13)

Where:

Σy = Kθ(X ,X ) + σ2nI

Data fit: -12yTΣy

−1y

Complexity penalty: -12 log(|Σy |)


Regression


Optimising the Hyper-parameters:

Partial differentials wrt to each θj (see Appendix for matrix identities):

∂

∂θjlog(p(x|X , θ)) =

1

2yTK−1∂K

∂θjK−1y−

1

2tr(K−1∂K

∂θj) (14)

=1

2tr((K−1yyT (KT )−1 − K−1)

∂K

∂θj) (15)

We cannot necessarily guarantee we will find a global optimum here anddifferent solutions may lead to different interpretations of the data.


Regression

Impact of Covariance Functions

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−1

0

1

2

3

4

5Data

x

y

Figure: Data generated from a GP with affine mean function and Materncovariance function with Gaussian noise.


Regression


−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2

−1

0

1

2

3

4

5Zero Mean and RQ Covariance

x

y

Figure: Negative log marginal likelihood: 12.54.


Regression


−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2

−1

0

1

2

3

4

5Affine Mean and RQ Covariance

x

y



Regression


−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−3

−2

−1

0

1

2

3

4

5Zero Mean and SE Covariance

x

y



Regression


−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−3

−2

−1

0

1

2

3

4

5Affine Mean and SE Covariance

x

y



Classification

Classification

Prediction of discrete classes, y ∈ { -1, 1 }, from input, x*. .

Classification with a Regression Model:

Map output of GP regression from range {-∞, + ∞} to probabilitiesin the range {0, 1}.

One possible method - logistic regression:

P(y = +1|x, w) =1

1 + exp(−wTx)(16)

P(y = −1|x, w) = 1− P(y = 1|x, w)


Classification

Classification

There is a more general form of classification. We will not go into thedetails of this here. However, the concept is similar to the regression case.It once again involves marginalising over the model parameter distribution.

Unfortunately unlike the regression case there is not a simple analyticalsolution.


Classification

Classification

In classification, the posterior P(f|D, Θ) is intractable because it involvesan integral that is the product of a Gaussian and a product of sigmoids.

Approximation methods:

Laplace approximation [Barber & Williams]

Variational methods [Gibbs & MacKay]

Expectation-Propagation [Minka & Ghahramani]

MCMC sampling [Neal]


Classification

Complexity

The complexity of Gaussian processes is of the order N3, where N isthe number of data points, due to the inversion of the N x N matrix.

σ−2n (σ−2

n XXT +Σ−1p )−1 (17)


Applications

Applications

Predicting financial markets:

0 5 10 15 20 25−0.03

−0.02

−0.01

0

0.01

0.02

0.03S&P 500 Closing Prices over the Month of November

Time

Pric

e (1

000s

) (M

ean

subt

ract

ed)

Figure: S&P closing prices from November 2010.


Applications

Applications


0 5 10 15 20 25−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04Zero Mean and Matern Covariance

Time

Pric

e (1

000s

) (M

ean

subt

ract

ed)

Figure: S&P closing prices from November 2010.


Applications

Applications


0 5 10 15 20 25−0.05

−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03Google Closing Prices over the Month of November

Time

Pric

e (1

000s

) (M

ean

subt

ract

ed)

Figure: Google closing prices from November 2010.


Applications

Applications


0 5 10 15 20 25−0.06

−0.05

−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04Zero Mean and Matern Covariance

Time

Pric

e (1

000s

) (M

ean

subt

ract

ed)

Figure: Google closing prices from November 2010.


Summary

Key Points

Gaussian processes are non-parametric.

They provide a structured method of model and parameter selection.

A Gaussian process is defined by a mean and covariance function.

Learning takes the form of setting the hyper-parameters. Occam’sRazor is implicit.

GP’s can be used for regression or classification.


Summary

Questions


Summary

Appendix

Matrix identities for optimisation of hyper-parameters:

∂

∂θK−1 = −K−1∂K

∂θK−1 (18)

∂

∂θlog(|K |) = tr(K−1 ∂K

∂θ) (19)