174
Gaussian Processes for Big Data Problems Marc Deisenroth Department of Computing Imperial College London http://wp.doc.ic.ac.uk/sml/marc-deisenroth Machine Learning Summer School, Chalmers University 14 April 2015

Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Gaussian Processes for Big DataProblems

Marc Deisenroth

Department of ComputingImperial College London

http://wp.doc.ic.ac.uk/sml/marc-deisenroth

Machine Learning Summer School, Chalmers University14 April 2015

Page 2: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

http://www.gaussianprocess.org/

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 2

Page 3: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Problem Setting

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Probabilistic regression problem

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 3

Page 4: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Problem Setting

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10

−2

0

2

x

f(x)

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Probabilistic regression problem

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 3

Page 5: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

(Some) Relevant Application Areas

§ Global black-box optimization and experimental design

§ Autonomous learning in robotics

§ Probabilistic dimensionality reduction and data visualization

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 4

Page 6: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Table of Contents

IntroductionGaussian Distribution

Gaussian ProcessesDefinition and DerivationInferenceCovariance FunctionsModel SelectionLimitations

Scaling Gaussian Processes to Large Data SetsSparse Gaussian ProcessesDistributed Gaussian Processes

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 5

Page 7: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Table of Contents

IntroductionGaussian Distribution

Gaussian ProcessesDefinition and DerivationInferenceCovariance FunctionsModel SelectionLimitations

Scaling Gaussian Processes to Large Data SetsSparse Gaussian ProcessesDistributed Gaussian Processes

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 6

Page 8: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Key Concepts in Probability Theory

Two fundamental rules:

ppxq “ż

ppx, yqdy Sum rule/Marginalization property

ppx, yq “ ppy|xqppxq Product rule

Bayes’ Theorem (Probabilistic Inverse)

ppx|yq “ppy|xq ppxq

ppyq, x : hypothesis, y : measurement

§ Posterior belief

§ Prior belief

§ Likelihood (measurement model)

§ Marginal likelihood (normalization constant)

x

y

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 7

Page 9: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Key Concepts in Probability Theory

Two fundamental rules:

ppxq “ż

ppx, yqdy Sum rule/Marginalization property

ppx, yq “ ppy|xqppxq Product rule

Bayes’ Theorem (Probabilistic Inverse)

ppx|yq “ppy|xq ppxq

ppyq, x : hypothesis, y : measurement

§ Posterior belief

§ Prior belief

§ Likelihood (measurement model)

§ Marginal likelihood (normalization constant)

x

y

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 7

Page 10: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Key Concepts in Probability Theory

Two fundamental rules:

ppxq “ż

ppx, yqdy Sum rule/Marginalization property

ppx, yq “ ppy|xqppxq Product rule

Bayes’ Theorem (Probabilistic Inverse)

ppx|yq “ppy|xq ppxq

ppyq, x : hypothesis, y : measurement

§ Posterior belief

§ Prior belief

§ Likelihood (measurement model)

§ Marginal likelihood (normalization constant)

x

y

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 7

Page 11: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

The Gaussian Distribution

ppx|µ, Σq “ p2πq´D2 |Σ|´

12 exp

`

´ 12px´ µqJΣ´1px´ µq

˘

§ Mean vector µ Average of the data

§ Covariance matrix Σ Spread of the data

−4 −3 −2 −1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

x

p(x

)

p(x)

Mean

95% confidence bound

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 8

Page 12: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

The Gaussian Distribution

ppx|µ, Σq “ p2πq´D2 |Σ|´

12 exp

`

´ 12px´ µqJΣ´1px´ µq

˘

§ Mean vector µ Average of the data

§ Covariance matrix Σ Spread of the data

−4 −3 −2 −1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

x

p(x

)

p(x)

Mean

95% confidence bound

−5 −4 −3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

x1

x2

Mean

95% confidence bound

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 8

Page 13: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

The Gaussian Distribution

ppx|µ, Σq “ p2πq´D2 |Σ|´

12 exp

`

´ 12px´ µqJΣ´1px´ µq

˘

§ Mean vector µ Average of the data

§ Covariance matrix Σ Spread of the data

−4 −3 −2 −1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

x

p(x

)

Data

p(x)

Mean

95% confidence interval

−5 −4 −3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

x1

x2

Data

Mean

95% confidence bound

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 8

Page 14: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Sampling from a Multivariate Gaussian

Objective

Generate a random sample y „ N`

µ, Σ˘

from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.

However, we only have access to a random number generator thatcan sample x from N

`

0, I˘

...

Exploit that affine transformations y “ Ax` b of Gaussians remainGaussian

§ Mean: ExrAx` bs “ AExrxs ` b§ Covariance: VxrAx` bs “ AVxrxsAJ

1. Find conditions for A, b to match the mean of y

2. Find conditions for A, b to match the covariance of y

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9

Page 15: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Sampling from a Multivariate Gaussian

Objective

Generate a random sample y „ N`

µ, Σ˘

from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.

However, we only have access to a random number generator thatcan sample x from N

`

0, I˘

...

Exploit that affine transformations y “ Ax` b of Gaussians remainGaussian

§ Mean: ExrAx` bs “ AExrxs ` b§ Covariance: VxrAx` bs “ AVxrxsAJ

1. Find conditions for A, b to match the mean of y

2. Find conditions for A, b to match the covariance of y

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9

Page 16: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Sampling from a Multivariate Gaussian

Objective

Generate a random sample y „ N`

µ, Σ˘

from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.

However, we only have access to a random number generator thatcan sample x from N

`

0, I˘

...

Exploit that affine transformations y “ Ax` b of Gaussians remainGaussian

§ Mean: ExrAx` bs “ AExrxs ` b§ Covariance: VxrAx` bs “ AVxrxsAJ

1. Find conditions for A, b to match the mean of y

2. Find conditions for A, b to match the covariance of yGaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9

Page 17: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Sampling from a Multivariate Gaussian (2)

Objective

Generate a random sample y „ N`

µ, Σ˘

from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.

x = randn(D,1); Sample x „ N`

0, I˘

y = chol(Σ)’*x + µ; Scale x

Here chol(Σ) is the Cholesky factor L, such that LJL “ Σ

Therefore, the mean and covariance of y are

Erys “ y “ ErLJx` µs “ LJErxs ` µ “ µ

Covrys “ Erpy´ yqpy´ yqJs “ ErLJxxJLs “ LJErxxJsL “ LJL “ Σ

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 10

Page 18: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Sampling from a Multivariate Gaussian (2)

Objective

Generate a random sample y „ N`

µ, Σ˘

from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.

x = randn(D,1); Sample x „ N`

0, I˘

y = chol(Σ)’*x + µ; Scale x

Here chol(Σ) is the Cholesky factor L, such that LJL “ Σ

Therefore, the mean and covariance of y are

Erys “ y “ ErLJx` µs “ LJErxs ` µ “ µ

Covrys “ Erpy´ yqpy´ yqJs “ ErLJxxJLs “ LJErxxJsL “ LJL “ Σ

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 10

Page 19: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Conditional

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y) ppx, yq “ N

˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “ N`

µx|y, Σx|y˘

µx|y “ µx ` Σxy Σ´1yy py´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Conditional ppx|yq is also GaussianComputationally convenient

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 11

Page 20: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Conditional

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y)Observation

ppx, yq “ N˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “ N`

µx|y, Σx|y˘

µx|y “ µx ` Σxy Σ´1yy py´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Conditional ppx|yq is also GaussianComputationally convenient

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 11

Page 21: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Conditional

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y)Observation yConditional p(x|y)

ppx, yq “ N˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “ N`

µx|y, Σx|y˘

µx|y “ µx ` Σxy Σ´1yy py´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Conditional ppx|yq is also GaussianComputationally convenient

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 11

Page 22: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Marginal

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y)Marginal p(x)

ppx, yq “ N˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

pp x q “ż

pp x , y qd y

“ N`

µx , Σxx˘

§ The marginal of a joint Gaussian distribution is Gaussian

§ Intuitively: Ignore (integrate out) everything you are notinterested in

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 12

Page 23: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Table of Contents

IntroductionGaussian Distribution

Gaussian ProcessesDefinition and DerivationInferenceCovariance FunctionsModel SelectionLimitations

Scaling Gaussian Processes to Large Data SetsSparse Gaussian ProcessesDistributed Gaussian Processes

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 13

Page 24: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Gaussian Process

DefinitionA Gaussian process (GP) is a collection of random variables x1, x2, . . . ,any finite number of which is Gaussian distributed.

Unexpected (?) example: Linear dynamical system

xt`1 “ Axt `w, w „ N`

0, Q˘

x1, x2, . . . is a collection of random variables,any finite number of which is Gaussian distributed

This is a GP

But we will use the Gaussian process for a different purpose, not fortime series

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 14

Page 25: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Gaussian Process

DefinitionA Gaussian process (GP) is a collection of random variables x1, x2, . . . ,any finite number of which is Gaussian distributed.

Unexpected (?) example: Linear dynamical system

xt`1 “ Axt `w, w „ N`

0, Q˘

x1, x2, . . . is a collection of random variables

,any finite number of which is Gaussian distributed

This is a GP

But we will use the Gaussian process for a different purpose, not fortime series

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 14

Page 26: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Gaussian Process

DefinitionA Gaussian process (GP) is a collection of random variables x1, x2, . . . ,any finite number of which is Gaussian distributed.

Unexpected (?) example: Linear dynamical system

xt`1 “ Axt `w, w „ N`

0, Q˘

x1, x2, . . . is a collection of random variables,any finite number of which is Gaussian distributed

This is a GP

But we will use the Gaussian process for a different purpose, not fortime series

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 14

Page 27: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Gaussian Process

DefinitionA Gaussian process (GP) is a collection of random variables x1, x2, . . . ,any finite number of which is Gaussian distributed.

Unexpected (?) example: Linear dynamical system

xt`1 “ Axt `w, w „ N`

0, Q˘

x1, x2, . . . is a collection of random variables,any finite number of which is Gaussian distributed

This is a GP

But we will use the Gaussian process for a different purpose, not fortime series

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 14

Page 28: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Gaussian Process

DefinitionA Gaussian process (GP) is a collection of random variables x1, x2, . . . ,any finite number of which is Gaussian distributed.

Unexpected (?) example: Linear dynamical system

xt`1 “ Axt `w, w „ N`

0, Q˘

x1, x2, . . . is a collection of random variables,any finite number of which is Gaussian distributed

This is a GP

But we will use the Gaussian process for a different purpose, not fortime series

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 14

Page 29: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

The Gaussian in the Limit

Consider the joint Gaussian distribution ppx, xq, where x P RD andx P Rk, k Ñ8

Then

ppx, xq “ N˜«

µxµx

ff

,„

Σxx ΣxxΣxx Σxx

¸

where Σxx P Rkˆk and Σxx P R

Dˆk, k Ñ8.However, the marginal remains finite

pp x q “ż

pp x , x qd x “ N`

µx , Σxx˘

where we integrate out an infinite number of variables xi.

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 15

Page 30: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

The Gaussian in the Limit

Consider the joint Gaussian distribution ppx, xq, where x P RD andx P Rk, k Ñ8

Then

ppx, xq “ N˜«

µxµx

ff

,„

Σxx ΣxxΣxx Σxx

¸

where Σxx P Rkˆk and Σxx P R

Dˆk, k Ñ8.

However, the marginal remains finite

pp x q “ż

pp x , x qd x “ N`

µx , Σxx˘

where we integrate out an infinite number of variables xi.

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 15

Page 31: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

The Gaussian in the Limit

Consider the joint Gaussian distribution ppx, xq, where x P RD andx P Rk, k Ñ8

Then

ppx, xq “ N˜«

µxµx

ff

,„

Σxx ΣxxΣxx Σxx

¸

where Σxx P Rkˆk and Σxx P R

Dˆk, k Ñ8.However, the marginal remains finite

pp x q “ż

pp x , x qd x “ N`

µx , Σxx˘

where we integrate out an infinite number of variables xi.

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 15

Page 32: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Marginal and Conditional in the Limit

§ In practice, we consider finite training and test data xtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xother plays the role of x from previous slide)

ppxq “ N

¨

˚

˚

˝

»

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi

ffi

ffi

fl

˛

ppxtrain, xtestq “

ż

pp xtrain, xtest , xother qd xother

ppxtest|xtrainq “ N`

µ˚, Σ˚˘

µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q

Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 16

Page 33: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Marginal and Conditional in the Limit

§ In practice, we consider finite training and test data xtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xother plays the role of x from previous slide)

ppxq “ N

¨

˚

˚

˝

»

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi

ffi

ffi

fl

˛

ppxtrain, xtestq “

ż

pp xtrain, xtest , xother qd xother

ppxtest|xtrainq “ N`

µ˚, Σ˚˘

µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q

Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 16

Page 34: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Marginal and Conditional in the Limit

§ In practice, we consider finite training and test data xtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xother plays the role of x from previous slide)

ppxq “ N

¨

˚

˚

˝

»

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi

ffi

ffi

fl

˛

ppxtrain, xtestq “

ż

pp xtrain, xtest , xother qd xother

ppxtest|xtrainq “ N`

µ˚, Σ˚˘

µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q

Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 16

Page 35: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Marginal and Conditional in the Limit

§ In practice, we consider finite training and test data xtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xother plays the role of x from previous slide)

ppxq “ N

¨

˚

˚

˝

»

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi

ffi

ffi

fl

˛

ppxtrain, xtestq “

ż

pp xtrain, xtest , xother qd xother

ppxtest|xtrainq “ N`

µ˚, Σ˚˘

µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q

Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 16

Page 36: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Marginal and Conditional in the Limit

§ In practice, we consider finite training and test data xtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xother plays the role of x from previous slide)

ppxq “ N

¨

˚

˚

˝

»

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi

ffi

ffi

fl

˛

ppxtrain, xtestq “

ż

pp xtrain, xtest , xother qd xother

ppxtest|xtrainq “ N`

µ˚, Σ˚˘

µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q

Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 16

Page 37: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Back to Regression: Distribution over Functions

§ We are not really interested in a distribution on x, but rather in adistribution over function values f pxq

§ Let’s replace x with function values f “ f pxq(Treat a function as a long vector of function values)

pp f , f q “ N

¨

˝

»

µ f

µ f

fi

fl ,

»

Σ f f Σ f f

Σ f f Σ f f

fi

fl

˛

where Σ f f P Rkˆk and Σ f f P R

Nˆk, k Ñ8.

§ Again, the marginal remains finite

pp f q “ż

pp f , f qd f “ N`

µ f , Σ f f˘

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 17

Page 38: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Back to Regression: Distribution over Functions

§ We are not really interested in a distribution on x, but rather in adistribution over function values f pxq

§ Let’s replace x with function values f “ f pxq(Treat a function as a long vector of function values)

pp f , f q “ N

¨

˝

»

µ f

µ f

fi

fl ,

»

Σ f f Σ f f

Σ f f Σ f f

fi

fl

˛

where Σ f f P Rkˆk and Σ f f P R

Nˆk, k Ñ8.

§ Again, the marginal remains finite

pp f q “ż

pp f , f qd f “ N`

µ f , Σ f f˘

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 17

Page 39: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Back to Regression: Distribution over Functions

§ We are not really interested in a distribution on x, but rather in adistribution over function values f pxq

§ Let’s replace x with function values f “ f pxq(Treat a function as a long vector of function values)

pp f , f q “ N

¨

˝

»

µ f

µ f

fi

fl ,

»

Σ f f Σ f f

Σ f f Σ f f

fi

fl

˛

where Σ f f P Rkˆk and Σ f f P R

Nˆk, k Ñ8.

§ Again, the marginal remains finite

pp f q “ż

pp f , f qd f “ N`

µ f , Σ f f˘

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 17

Page 40: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Marginal and Conditional over Functions

Define f˚ :“ f test, f :“ f train.

§ Marginal

pp f˚, f q “ N˜«

µ f

µ˚

ff

,

«

Σ f f Σ f˚

Σ˚ f Σ˚˚

ff¸

§ Conditional (predictive distribution)

pp f˚| f q “ N`

m, S˘

m “ µ˚ ` Σ˚ f Σ´1f f p f ´ µq

S “ Σ˚˚ ´ Σ˚ f Σ´1f f Σ f˚

We need to compute (cross-)covariances betweenunknown function valuesThe kernel trick helps us out

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 18

Page 41: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Marginal and Conditional over Functions

Define f˚ :“ f test, f :“ f train.§ Marginal

pp f˚, f q “ N˜«

µ f

µ˚

ff

,

«

Σ f f Σ f˚

Σ˚ f Σ˚˚

ff¸

§ Conditional (predictive distribution)

pp f˚| f q “ N`

m, S˘

m “ µ˚ ` Σ˚ f Σ´1f f p f ´ µq

S “ Σ˚˚ ´ Σ˚ f Σ´1f f Σ f˚

We need to compute (cross-)covariances betweenunknown function valuesThe kernel trick helps us out

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 18

Page 42: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Marginal and Conditional over Functions

Define f˚ :“ f test, f :“ f train.§ Marginal

pp f˚, f q “ N˜«

µ f

µ˚

ff

,

«

Σ f f Σ f˚

Σ˚ f Σ˚˚

ff¸

§ Conditional (predictive distribution)

pp f˚| f q “ N`

m, S˘

m “ µ˚ ` Σ˚ f Σ´1f f p f ´ µq

S “ Σ˚˚ ´ Σ˚ f Σ´1f f Σ f˚

We need to compute (cross-)covariances betweenunknown function valuesThe kernel trick helps us out

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 18

Page 43: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Marginal and Conditional over Functions

Define f˚ :“ f test, f :“ f train.§ Marginal

pp f˚, f q “ N˜«

µ f

µ˚

ff

,

«

Σ f f Σ f˚

Σ˚ f Σ˚˚

ff¸

§ Conditional (predictive distribution)

pp f˚| f q “ N`

m, S˘

m “ µ˚ ` Σ˚ f Σ´1f f p f ´ µq

S “ Σ˚˚ ´ Σ˚ f Σ´1f f Σ f˚

We need to compute (cross-)covariances betweenunknown function values

The kernel trick helps us out

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 18

Page 44: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Marginal and Conditional over Functions

Define f˚ :“ f test, f :“ f train.§ Marginal

pp f˚, f q “ N˜«

µ f

µ˚

ff

,

«

Σ f f Σ f˚

Σ˚ f Σ˚˚

ff¸

§ Conditional (predictive distribution)

pp f˚| f q “ N`

m, S˘

m “ µ˚ ` Σ˚ f Σ´1f f p f ´ µq

S “ Σ˚˚ ´ Σ˚ f Σ´1f f Σ f˚

We need to compute (cross-)covariances betweenunknown function valuesThe kernel trick helps us out

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 18

Page 45: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Kernelization

pp f˚, f q “ Nˆ„

µ fµ˚

,„

Σ f f Σ f˚Σ˚ f Σ˚˚

˙

pp f˚| f q “ N`

m, S˘

m “ µ˚ ` Σ˚ f Σ´1f f p f ´ µq

S “ Σ˚˚ ´ Σ˚ f Σ´1f f Σ f˚

§ A kernel function k is symmetric and positive definite.§ Kernel computes covariances between unknown function values

f pxiq and f pxjq by just looking at the corresponding inputs xi, xj

Σpi,jqf f “ Covr f pxiq, f pxjqs “ kpxi, xjq

§ This yields the predictive distribution

pp f˚| f , X, X˚q “ N`

m, S˘

m “ µ˚ ` kpX˚, XqkpX, Xq´1 p f ´ µq

S “ K˚˚ ´ kpX˚, XqkpX, Xq´1kpX, X˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 19

Page 46: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Kernelization

pp f˚, f q “ Nˆ„

µ fµ˚

,„

Σ f f Σ f˚Σ˚ f Σ˚˚

˙

pp f˚| f q “ N`

m, S˘

m “ µ˚ ` Σ˚ f Σ´1f f p f ´ µq

S “ Σ˚˚ ´ Σ˚ f Σ´1f f Σ f˚

§ A kernel function k is symmetric and positive definite.

§ Kernel computes covariances between unknown function valuesf pxiq and f pxjq by just looking at the corresponding inputs xi, xj

Σpi,jqf f “ Covr f pxiq, f pxjqs “ kpxi, xjq

§ This yields the predictive distribution

pp f˚| f , X, X˚q “ N`

m, S˘

m “ µ˚ ` kpX˚, XqkpX, Xq´1 p f ´ µq

S “ K˚˚ ´ kpX˚, XqkpX, Xq´1kpX, X˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 19

Page 47: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Kernelization

pp f˚, f q “ Nˆ„

µ fµ˚

,„

Σ f f Σ f˚Σ˚ f Σ˚˚

˙

pp f˚| f q “ N`

m, S˘

m “ µ˚ ` Σ˚ f Σ´1f f p f ´ µq

S “ Σ˚˚ ´ Σ˚ f Σ´1f f Σ f˚

§ A kernel function k is symmetric and positive definite.§ Kernel computes covariances between unknown function values

f pxiq and f pxjq by just looking at the corresponding inputs xi, xj

Σpi,jqf f “ Covr f pxiq, f pxjqs “ kpxi, xjq

§ This yields the predictive distribution

pp f˚| f , X, X˚q “ N`

m, S˘

m “ µ˚ ` kpX˚, XqkpX, Xq´1 p f ´ µq

S “ K˚˚ ´ kpX˚, XqkpX, Xq´1kpX, X˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 19

Page 48: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Kernelization

pp f˚, f q “ Nˆ„

µ fµ˚

,„

Σ f f Σ f˚Σ˚ f Σ˚˚

˙

pp f˚| f q “ N`

m, S˘

m “ µ˚ ` Σ˚ f Σ´1f f p f ´ µq

S “ Σ˚˚ ´ Σ˚ f Σ´1f f Σ f˚

§ A kernel function k is symmetric and positive definite.§ Kernel computes covariances between unknown function values

f pxiq and f pxjq by just looking at the corresponding inputs xi, xj

Σpi,jqf f “ Covr f pxiq, f pxjqs “ kpxi, xjq

§ This yields the predictive distribution

pp f˚| f , X, X˚q “ N`

m, S˘

m “ µ˚ ` kpX˚, XqkpX, Xq´1 p f ´ µq

S “ K˚˚ ´ kpX˚, XqkpX, Xq´1kpX, X˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 19

Page 49: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2ε I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f , Xqpp f qd fPosterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “ mpxiq ` kpX, xiqJpK` σ2

ε Iq´1py´mpxiqq

kpostpxi, xjq “ kpxi, xjq ´ kpX, xiqJpK` σ2

ε Iq´1kpX, xjq

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 20

Page 50: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2ε I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f , Xqpp f qd fPosterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “ mpxiq ` kpX, xiqJpK` σ2

ε Iq´1py´mpxiqq

kpostpxi, xjq “ kpxi, xjq ´ kpX, xiqJpK` σ2

ε Iq´1kpX, xjq

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 20

Page 51: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2ε I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f , Xqpp f qd fPosterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “ mpxiq ` kpX, xiqJpK` σ2

ε Iq´1py´mpxiqq

kpostpxi, xjq “ kpxi, xjq ´ kpX, xiqJpK` σ2

ε Iq´1kpX, xjq

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 20

Page 52: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2ε I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f , Xqpp f qd fPosterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “ mpxiq ` kpX, xiqJpK` σ2

ε Iq´1py´mpxiqq

kpostpxi, xjq “ kpxi, xjq ´ kpX, xiqJpK` σ2

ε Iq´1kpX, xjq

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 20

Page 53: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2ε I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f , Xqpp f qd f

Posterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “ mpxiq ` kpX, xiqJpK` σ2

ε Iq´1py´mpxiqq

kpostpxi, xjq “ kpxi, xjq ´ kpX, xiqJpK` σ2

ε Iq´1kpX, xjq

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 20

Page 54: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2ε I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f , Xqpp f qd fPosterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “ mpxiq ` kpX, xiqJpK` σ2

ε Iq´1py´mpxiqq

kpostpxi, xjq “ kpxi, xjq ´ kpX, xiqJpK` σ2

ε Iq´1kpX, xjq

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 20

Page 55: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2ε I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f , Xqpp f qd fPosterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “ mpxiq ` kpX, xiqJpK` σ2

ε Iq´1py´mpxiqq

kpostpxi, xjq “ kpxi, xjq ´ kpX, xiqJpK` σ2

ε Iq´1kpX, xjq

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 20

Page 56: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

GP Regression as a Bayesian Inference Problem (2)

Posterior Gaussian process:

mpostpxiq “ mpxiq ` kpX, xiqJpK` σ2

ε Iq´1py´mpxiqq

kpostpxi, xjq “ kpxi, xjq ´ kpX, xiqJpK` σ2

ε Iq´1kpX, xjq

Predictive distribution pp f˚|X, y, x˚q at test inputs x˚:

pp f˚|X, y, x˚q “ N`

Er f˚s, Vr f˚s˘

Er f˚|X, y, x˚s “ mpostpx˚q “ mpx˚q ` kpX, x˚qJpK` σ2ε Iq´1py´mpx˚qq

Vr f˚|X, y, x˚s “ kpostpx˚, x˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

From now: Set prior mean function m ” 0

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 21

Page 57: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

GP Regression as a Bayesian Inference Problem (2)

Posterior Gaussian process:

mpostpxiq “ mpxiq ` kpX, xiqJpK` σ2

ε Iq´1py´mpxiqq

kpostpxi, xjq “ kpxi, xjq ´ kpX, xiqJpK` σ2

ε Iq´1kpX, xjq

Predictive distribution pp f˚|X, y, x˚q at test inputs x˚:

pp f˚|X, y, x˚q “ N`

Er f˚s, Vr f˚s˘

Er f˚|X, y, x˚s “ mpostpx˚q “ mpx˚q ` kpX, x˚qJpK` σ2ε Iq´1py´mpx˚qq

Vr f˚|X, y, x˚s “ kpostpx˚, x˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

From now: Set prior mean function m ” 0

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 21

Page 58: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

GP Regression as a Bayesian Inference Problem (2)

Posterior Gaussian process:

mpostpxiq “ mpxiq ` kpX, xiqJpK` σ2

ε Iq´1py´mpxiqq

kpostpxi, xjq “ kpxi, xjq ´ kpX, xiqJpK` σ2

ε Iq´1kpX, xjq

Predictive distribution pp f˚|X, y, x˚q at test inputs x˚:

pp f˚|X, y, x˚q “ N`

Er f˚s, Vr f˚s˘

Er f˚|X, y, x˚s “ mpostpx˚q “ mpx˚q ` kpX, x˚qJpK` σ2ε Iq´1py´mpx˚qq

Vr f˚|X, y, x˚s “ kpostpx˚, x˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

From now: Set prior mean function m ” 0

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 21

Page 59: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Prior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚,∅s “ mpx˚q “ 0Vr f px˚q|x˚,∅s “ σ2px˚q “ kpx˚, x˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 22

Page 60: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Prior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚,∅s “ mpx˚q “ 0Vr f px˚q|x˚,∅s “ σ2px˚q “ kpx˚, x˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 22

Page 61: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 22

Page 62: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 22

Page 63: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 22

Page 64: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 22

Page 65: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 22

Page 66: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 22

Page 67: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 22

Page 68: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 22

Page 69: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 22

Page 70: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Covariance Function

§ A Gaussian process is fully specified by a mean function m and akernel/covariance function k

§ Covariance function encodes high-level structural assumptionsabout the latent function f (e.g., smoothness, differentiability,periodicity)

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 23

Page 71: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Gaussian Covariance FunctionkGausspxi, xjq “ θ2

1 exp`

´ pxi ´ xjqJpxi ´ xjq{θ

22˘

§ θ1: Amplitude of the latent function§ θ2: Length scale. How far do we have to move in input space

before the function value changes significantlySmoothness parameter

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 24

Page 72: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Matern Covariance Function

kMat,3{2pxi, xjq “ θ21

´

1`?

3pxi´xjq

θ2

¯

exp´

´

?3pxi´xjq

θ2

¯

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 25

Page 73: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Periodic Covariance Function

kperpxi, xjq “ θ21 exp

´

´2 sin2 ` θ3pxi´xjq

˘

θ22

¯

“ kGausspupxiq, upxjqq, upxq “„

cospxqsinpxq

θ3: Periodicity parameter

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 26

Page 74: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection

A GP is fully specified by a mean function m and a covariancefunction (kernel) k. Both functions possess parameters tθm, θku “: θ

§ How do we find good parameters θ?

§ How do we choose m and k?

Model selection

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 27

Page 75: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection I: Length-Scales

Length scales determine how wiggly the function is

x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Data

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 28

Page 76: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection I: Length-Scales

Length scales determine how wiggly the function is

x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Data

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 28

Page 77: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection I: Length-Scales

Length scales determine how wiggly the function is

x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Data

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 28

Page 78: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection I: Length-Scales

Length scales determine how wiggly the function is

x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Data

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 28

Page 79: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection: Hyper-Parameters

GP TrainingFind good GP hyper-parameters θ (kernel and mean functionparameters)

§ Assign a prior ppθq over hyper-parameters§ Posterior over hyper-parameters:

ppθ|X, yq “ppθqppy|X, θq

ppy|Xq, ppy|X, θq “

ż

ppy| f pXqqpp f |θqd f

§ Choose MAP hyper-parameters θ˚, such that

θ˚ P arg maxθ

log ppθq ` log ppy|X, θq

Maximize marginal likelihood if ppθq “ U (uniform prior)

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 29

Page 80: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection: Hyper-Parameters

GP TrainingFind good GP hyper-parameters θ (kernel and mean functionparameters)

§ Assign a prior ppθq over hyper-parameters§ Posterior over hyper-parameters:

ppθ|X, yq “ppθqppy|X, θq

ppy|Xq, ppy|X, θq “

ż

ppy| f pXqqpp f |θqd f

§ Choose MAP hyper-parameters θ˚, such that

θ˚ P arg maxθ

log ppθq ` log ppy|X, θq

Maximize marginal likelihood if ppθq “ U (uniform prior)

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 29

Page 81: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection: Hyper-Parameters

GP TrainingFind good GP hyper-parameters θ (kernel and mean functionparameters)

§ Assign a prior ppθq over hyper-parameters§ Posterior over hyper-parameters:

ppθ|X, yq “ppθqppy|X, θq

ppy|Xq, ppy|X, θq “

ż

ppy| f pXqqpp f |θqd f

§ Choose MAP hyper-parameters θ˚, such that

θ˚ P arg maxθ

log ppθq ` log ppy|X, θq

Maximize marginal likelihood if ppθq “ U (uniform prior)

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 29

Page 82: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection: Hyper-Parameters

GP TrainingFind good GP hyper-parameters θ (kernel and mean functionparameters)

§ Assign a prior ppθq over hyper-parameters§ Posterior over hyper-parameters:

ppθ|X, yq “ppθqppy|X, θq

ppy|Xq, ppy|X, θq “

ż

ppy| f pXqqpp f |θqd f

§ Choose MAP hyper-parameters θ˚, such that

θ˚ P arg maxθ

log ppθq ` log ppy|X, θq

Maximize marginal likelihood if ppθq “ U (uniform prior)

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 29

Page 83: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Training via Marginal Likelihood Maximization

GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters)

θ˚ P arg maxθ

log ppy|X, θq

log ppy|X, θq “ ´12 yJK´1

θ y ´ 12 log |Kθ| ` const

§ Automatic trade-off between data fit and model complexity§ Gradient-based optimization possible:

B log ppy|X, θq

Bθ“ 1

2 yJK´1 BKBθ

K´1y´ 12 tr

`

K´1 BKBθ

˘

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 30

Page 84: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Training via Marginal Likelihood Maximization

GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters)

θ˚ P arg maxθ

log ppy|X, θq

log ppy|X, θq “ ´12 yJK´1

θ y ´ 12 log |Kθ| ` const

§ Automatic trade-off between data fit and model complexity

§ Gradient-based optimization possible:

B log ppy|X, θq

Bθ“ 1

2 yJK´1 BKBθ

K´1y´ 12 tr

`

K´1 BKBθ

˘

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 30

Page 85: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Training via Marginal Likelihood Maximization

GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters)

θ˚ P arg maxθ

log ppy|X, θq

log ppy|X, θq “ ´12 yJK´1

θ y ´ 12 log |Kθ| ` const

§ Automatic trade-off between data fit and model complexity§ Gradient-based optimization possible:

B log ppy|X, θq

Bθ“ 1

2 yJK´1 BKBθ

K´1y´ 12 tr

`

K´1 BKBθ

˘

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 30

Page 86: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Example

x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3DataLML=20.4914LML=13.6109LML=33.4869

§ Compromise between fitting the data and simplicity of the model

§ Not fitting the data is explained by a larger noise variance(treated as an additional hyper-parameter and learned jointly)s

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 31

Page 87: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection II—Mean Function and Kernel

§ Assume we have a finite set of models Mi, each one specifying amean function mi and a kernel ki. How do we find the best one?

§ Assign a prior ppMiq for each model

§ Posterior model probability:

ppMi|X, yq “ppMiqppy|X, Miq

ppy|Xq

§ Choose MAP model M˚, such that

M˚ P arg maxM

log ppMq ` log ppy|X, Mq

Compare marginal likelihoods if ppMiq “ 1{|M|

Marginal likelihood is central for model selection

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 32

Page 88: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection II—Mean Function and Kernel

§ Assume we have a finite set of models Mi, each one specifying amean function mi and a kernel ki. How do we find the best one?

§ Assign a prior ppMiq for each model

§ Posterior model probability:

ppMi|X, yq “ppMiqppy|X, Miq

ppy|Xq

§ Choose MAP model M˚, such that

M˚ P arg maxM

log ppMq ` log ppy|X, Mq

Compare marginal likelihoods if ppMiq “ 1{|M|

Marginal likelihood is central for model selection

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 32

Page 89: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection II—Mean Function and Kernel

§ Assume we have a finite set of models Mi, each one specifying amean function mi and a kernel ki. How do we find the best one?

§ Assign a prior ppMiq for each model

§ Posterior model probability:

ppMi|X, yq “ppMiqppy|X, Miq

ppy|Xq

§ Choose MAP model M˚, such that

M˚ P arg maxM

log ppMq ` log ppy|X, Mq

Compare marginal likelihoods if ppMiq “ 1{|M|

Marginal likelihood is central for model selection

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 32

Page 90: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection II—Mean Function and Kernel

§ Assume we have a finite set of models Mi, each one specifying amean function mi and a kernel ki. How do we find the best one?

§ Assign a prior ppMiq for each model

§ Posterior model probability:

ppMi|X, yq “ppMiqppy|X, Miq

ppy|Xq

§ Choose MAP model M˚, such that

M˚ P arg maxM

log ppMq ` log ppy|X, Mq

Compare marginal likelihoods if ppMiq “ 1{|M|

Marginal likelihood is central for model selection

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 32

Page 91: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Model Selection II—Mean Function and Kernel

§ Assume we have a finite set of models Mi, each one specifying amean function mi and a kernel ki. How do we find the best one?

§ Assign a prior ppMiq for each model

§ Posterior model probability:

ppMi|X, yq “ppMiqppy|X, Miq

ppy|Xq

§ Choose MAP model M˚, such that

M˚ P arg maxM

log ppMq ` log ppy|X, Mq

Compare marginal likelihoods if ppMiq “ 1{|M|

Marginal likelihood is central for model selectionGaussian Processes Marc Deisenroth @MLSS, 14 April 2015 32

Page 92: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3

§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 33

Page 93: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Constant kernel, LML=-1.1073

§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 33

Page 94: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Linear kernel, LML=-1.0065

§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 33

Page 95: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Matern kernel, LML=-0.8625

§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 33

Page 96: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Gaussian kernel, LML=-0.69308

§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 33

Page 97: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Quick Summary

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

§ Gaussian processes are extremely flexible models

§ Consistent confidence bounds!

§ Based on simple manipulations of plain Gaussian distributions

§ Few hyper-parameters Generally easy to train

§ First choice for black-box regression

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 34

Page 98: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Application Areas

§ Reinforcement Learning and RoboticsModel value functions and/or dynamics with GPs

§ Bayesian Optimization (Experimental Design)Model unknown utility functions with GPs

§ GeostatisticsSpatial modeling (e.g., landscapes, resources)

§ Sensor networksGaussian Processes Marc Deisenroth @MLSS, 14 April 2015 35

Page 99: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Limitations of Gaussian Processes

Computational and memory complexity

§ Training scales in OpN3q

§ Prediction (variances) scales in OpN2q

§ Memory requirement: OpND` N2q

Practical limit N « 10, 000

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 36

Page 100: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Table of Contents

IntroductionGaussian Distribution

Gaussian ProcessesDefinition and DerivationInferenceCovariance FunctionsModel SelectionLimitations

Scaling Gaussian Processes to Large Data SetsSparse Gaussian ProcessesDistributed Gaussian Processes

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 37

Page 101: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

GP Factor Graph

f(x1) f(xN ) f∗1 f∗

L

Training data Test data

f(u1) f(uM )

Inducing function values Training function values fi

§ Probabilistic graphical model (factor graph) of a GP§ All function values are jointly Gaussian distributed (e.g., training

and test function values)

§ GP prior

pp f , f˚q “ N˜«

00

ff

,

«

K f f K f˚

K˚ f K˚˚

ff¸

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 38

Page 102: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

GP Factor Graph

f(x1) f(xN ) f∗1 f∗

L

Training data Test data

f(u1) f(uM )

Inducing function values Training function values fi

§ Probabilistic graphical model (factor graph) of a GP§ All function values are jointly Gaussian distributed (e.g., training

and test function values)§ GP prior

pp f , f˚q “ N˜«

00

ff

,

«

K f f K f˚

K˚ f K˚˚

ff¸

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 38

Page 103: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Inducing Variables

f(x1) f(xN ) f∗1 f∗

L

Training data Test data

f(u1) f(uM )

Inducing function values Training function values fi

Hypothetical function values fu

§ Introduce inducing function values fu

“Hypothetical” function values

§ All function values are still jointly Gaussian distributed (e.g.,training, test and inducing function values)

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 39

Page 104: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Inducing Variables

f(x1) f(xN ) f∗1 f∗

L

Training data Test data

f(u1) f(uM )

Inducing function values Training function values fi

Hypothetical function values fu

§ Introduce inducing function values fu

“Hypothetical” function values

§ All function values are still jointly Gaussian distributed (e.g.,training, test and inducing function values)

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 39

Page 105: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Central Approximation Scheme

f(x1) f(xN ) f∗1 f∗

L

Training data Test data

f(u1) f(uM )

Inducing function values

§ Approximation: Training and test set are conditionallyindependent given the inducing function values: f KK f˚| fu

§ Then, the effective GP prior is

qp f , f˚q “ż

pp f | fuqpp f˚| fuqpp fuqd fu

“ N

¨

˝

«

00

ff

,

»

K f f Q f˚

Q˚ f K˚˚

fi

fl

˛

‚ , Qab :“ Ka fu K´1fu fu

K fub

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 40

Page 106: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Central Approximation Scheme

f(x1) f(xN ) f∗1 f∗

L

Training data Test data

f(u1) f(uM )

Inducing function values

§ Approximation: Training and test set are conditionallyindependent given the inducing function values: f KK f˚| fu

§ Then, the effective GP prior is

qp f , f˚q “ż

pp f | fuqpp f˚| fuqpp fuqd fu

“ N

¨

˝

«

00

ff

,

»

K f f Q f˚

Q˚ f K˚˚

fi

fl

˛

‚ , Qab :“ Ka fu K´1fu fu

K fub

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 40

Page 107: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

FI(T)C Sparse Approximation

f(x1) f(xN ) f∗1 f∗

L

Training data Test data

f(u1) f(uM )

Inducing function values

§ Assume that training (and test sets) arefully independent given the inducingvariables (Snelson & Ghahramani, 2006)

§ Effective GP prior with this approximation

qp f , f˚q “ N

¨

˝

«

00

ff

,

»

Q f f ´ diagpQ f f ´K f f q Q f˚

Q˚ f K˚˚

fi

fl

˛

§ Q˚˚ ´ diagpQ˚˚ ´K˚˚q can be used instead of K˚˚ FIC

§ Training: OpNM2q, Prediction: OpM2q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 41

Page 108: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

FI(T)C Sparse Approximation

f(x1) f(xN ) f∗1 f∗

L

Training data Test data

f(u1) f(uM )

Inducing function values

§ Assume that training (and test sets) arefully independent given the inducingvariables (Snelson & Ghahramani, 2006)

§ Effective GP prior with this approximation

qp f , f˚q “ N

¨

˝

«

00

ff

,

»

Q f f ´ diagpQ f f ´K f f q Q f˚

Q˚ f K˚˚

fi

fl

˛

§ Q˚˚ ´ diagpQ˚˚ ´K˚˚q can be used instead of K˚˚ FIC

§ Training: OpNM2q, Prediction: OpM2q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 41

Page 109: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

FI(T)C Sparse Approximation

f(x1) f(xN ) f∗1 f∗

L

Training data Test data

f(u1) f(uM )

Inducing function values

§ Assume that training (and test sets) arefully independent given the inducingvariables (Snelson & Ghahramani, 2006)

§ Effective GP prior with this approximation

qp f , f˚q “ N

¨

˝

«

00

ff

,

»

Q f f ´ diagpQ f f ´K f f q Q f˚

Q˚ f K˚˚

fi

fl

˛

§ Q˚˚ ´ diagpQ˚˚ ´K˚˚q can be used instead of K˚˚ FIC

§ Training: OpNM2q, Prediction: OpM2q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 41

Page 110: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Inducing Inputs

§ FI(T)C sparse approximation relies on inducing function valuesf puiq, where ui are the corresponding inputs

§ These inputs are unknown a priori Find “optimal” ones

§ Find them by maximizing the FI(T)C marginal likelihood withrespect to the inducing inputs(and the standard hyper-parameters):

u˚1:M P arg maxu1:M

qFITCpy|X, u1:M, θq

§ Intuitively: The marginal likelihood is not only parameterized bythe hyper-parameters θ, but also by the inducing inputs u1:M.

§ End up with a high-dimensional non-convex optimizationproblem with MD additional parameters

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 42

Page 111: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Inducing Inputs

§ FI(T)C sparse approximation relies on inducing function valuesf puiq, where ui are the corresponding inputs

§ These inputs are unknown a priori Find “optimal” ones

§ Find them by maximizing the FI(T)C marginal likelihood withrespect to the inducing inputs(and the standard hyper-parameters):

u˚1:M P arg maxu1:M

qFITCpy|X, u1:M, θq

§ Intuitively: The marginal likelihood is not only parameterized bythe hyper-parameters θ, but also by the inducing inputs u1:M.

§ End up with a high-dimensional non-convex optimizationproblem with MD additional parameters

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 42

Page 112: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Inducing Inputs

§ FI(T)C sparse approximation relies on inducing function valuesf puiq, where ui are the corresponding inputs

§ These inputs are unknown a priori Find “optimal” ones

§ Find them by maximizing the FI(T)C marginal likelihood withrespect to the inducing inputs(and the standard hyper-parameters):

u˚1:M P arg maxu1:M

qFITCpy|X, u1:M, θq

§ Intuitively: The marginal likelihood is not only parameterized bythe hyper-parameters θ, but also by the inducing inputs u1:M.

§ End up with a high-dimensional non-convex optimizationproblem with MD additional parameters

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 42

Page 113: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Inducing Inputs

§ FI(T)C sparse approximation relies on inducing function valuesf puiq, where ui are the corresponding inputs

§ These inputs are unknown a priori Find “optimal” ones

§ Find them by maximizing the FI(T)C marginal likelihood withrespect to the inducing inputs(and the standard hyper-parameters):

u˚1:M P arg maxu1:M

qFITCpy|X, u1:M, θq

§ Intuitively: The marginal likelihood is not only parameterized bythe hyper-parameters θ, but also by the inducing inputs u1:M.

§ End up with a high-dimensional non-convex optimizationproblem with MD additional parameters

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 42

Page 114: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Inducing Inputs

§ FI(T)C sparse approximation relies on inducing function valuesf puiq, where ui are the corresponding inputs

§ These inputs are unknown a priori Find “optimal” ones

§ Find them by maximizing the FI(T)C marginal likelihood withrespect to the inducing inputs(and the standard hyper-parameters):

u˚1:M P arg maxu1:M

qFITCpy|X, u1:M, θq

§ Intuitively: The marginal likelihood is not only parameterized bythe hyper-parameters θ, but also by the inducing inputs u1:M.

§ End up with a high-dimensional non-convex optimizationproblem with MD additional parameters

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 42

Page 115: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

FITC Example

Figure from Ed Snelson

§ Pink: Original data

§ Red crosses: Initialization ofinducing inputs

§ Blue crosses: Location of inducinginputs after optimization

§ Efficient compression of the original data set

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 43

Page 116: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Summary Sparse Gaussian Processes

§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points

§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)

§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)

§ Computational complexity: OpM3q or OpNM2q for training§ Practical limit M ď 104. Often: M P Op102q in the case of

inducing variables§ If we set M “ N{100, i.e., each inducing function value

summarizes 100 real function values, our practical limit isN P Op106q

Let’s try something different

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 44

Page 117: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Summary Sparse Gaussian Processes

§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points

§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)

§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)

§ Computational complexity: OpM3q or OpNM2q for training§ Practical limit M ď 104. Often: M P Op102q in the case of

inducing variables§ If we set M “ N{100, i.e., each inducing function value

summarizes 100 real function values, our practical limit isN P Op106q

Let’s try something different

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 44

Page 118: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Summary Sparse Gaussian Processes

§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points

§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)

§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)

§ Computational complexity: OpM3q or OpNM2q for training§ Practical limit M ď 104. Often: M P Op102q in the case of

inducing variables§ If we set M “ N{100, i.e., each inducing function value

summarizes 100 real function values, our practical limit isN P Op106q

Let’s try something different

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 44

Page 119: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Summary Sparse Gaussian Processes

§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points

§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)

§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)

§ Computational complexity: OpM3q or OpNM2q for training

§ Practical limit M ď 104. Often: M P Op102q in the case ofinducing variables

§ If we set M “ N{100, i.e., each inducing function valuesummarizes 100 real function values, our practical limit isN P Op106q

Let’s try something different

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 44

Page 120: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Summary Sparse Gaussian Processes

§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points

§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)

§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)

§ Computational complexity: OpM3q or OpNM2q for training§ Practical limit M ď 104. Often: M P Op102q in the case of

inducing variables

§ If we set M “ N{100, i.e., each inducing function valuesummarizes 100 real function values, our practical limit isN P Op106q

Let’s try something different

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 44

Page 121: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Summary Sparse Gaussian Processes

§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points

§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)

§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)

§ Computational complexity: OpM3q or OpNM2q for training§ Practical limit M ď 104. Often: M P Op102q in the case of

inducing variables§ If we set M “ N{100, i.e., each inducing function value

summarizes 100 real function values, our practical limit isN P Op106q

Let’s try something different

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 44

Page 122: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Summary Sparse Gaussian Processes

§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points

§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)

§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)

§ Computational complexity: OpM3q or OpNM2q for training§ Practical limit M ď 104. Often: M P Op102q in the case of

inducing variables§ If we set M “ N{100, i.e., each inducing function value

summarizes 100 real function values, our practical limit isN P Op106q

Let’s try something differentGaussian Processes Marc Deisenroth @MLSS, 14 April 2015 44

Page 123: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

An Orthogonal Approximation: Distributed GPs

−1

−1 −1

−1

−1

−1=

§ Randomly split the full data set into M chunks§ Place M independent GP models (experts) on these small chunks§ Independent computations can be distributed§ Block-diagonal approximation of kernel matrix K§ Combine independent computations to an overall result

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 45

Page 124: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

An Orthogonal Approximation: Distributed GPs

−1

−1 −1

−1

−1

−1=

§ Randomly split the full data set into M chunks

§ Place M independent GP models (experts) on these small chunks§ Independent computations can be distributed§ Block-diagonal approximation of kernel matrix K§ Combine independent computations to an overall result

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 45

Page 125: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

An Orthogonal Approximation: Distributed GPs

−1

−1 −1

−1

−1

−1=

§ Randomly split the full data set into M chunks§ Place M independent GP models (experts) on these small chunks

§ Independent computations can be distributed§ Block-diagonal approximation of kernel matrix K§ Combine independent computations to an overall result

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 45

Page 126: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

An Orthogonal Approximation: Distributed GPs

−1

−1 −1

−1

−1

−1=

§ Randomly split the full data set into M chunks§ Place M independent GP models (experts) on these small chunks§ Independent computations can be distributed

§ Block-diagonal approximation of kernel matrix K§ Combine independent computations to an overall result

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 45

Page 127: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

An Orthogonal Approximation: Distributed GPs

−1

−1 −1

−1

−1

−1=

§ Randomly split the full data set into M chunks§ Place M independent GP models (experts) on these small chunks§ Independent computations can be distributed§ Block-diagonal approximation of kernel matrix K

§ Combine independent computations to an overall result

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 45

Page 128: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

An Orthogonal Approximation: Distributed GPs

−1

−1 −1

−1

−1

−1=

§ Randomly split the full data set into M chunks§ Place M independent GP models (experts) on these small chunks§ Independent computations can be distributed§ Block-diagonal approximation of kernel matrix K§ Combine independent computations to an overall result

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 45

Page 129: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Training the Distributed GP

§ Split data set of size N into M chunks of size P

§ Independence of experts Factorization of marginal likelihood:

log ppy|X, θq «ÿM

k“1log pkpypkq|Xpkq, θq

§ Distributed optimization and training straightforward

§ Computational complexity: OpMP3q [instead of OpN3q]But distributed over many machines

§ Memory footprint: OpMP2 ` NDq [instead of OpN2 ` NDq]

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 46

Page 130: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Training the Distributed GP

§ Split data set of size N into M chunks of size P

§ Independence of experts Factorization of marginal likelihood:

log ppy|X, θq «ÿM

k“1log pkpypkq|Xpkq, θq

§ Distributed optimization and training straightforward

§ Computational complexity: OpMP3q [instead of OpN3q]But distributed over many machines

§ Memory footprint: OpMP2 ` NDq [instead of OpN2 ` NDq]

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 46

Page 131: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Training the Distributed GP

§ Split data set of size N into M chunks of size P

§ Independence of experts Factorization of marginal likelihood:

log ppy|X, θq «ÿM

k“1log pkpypkq|Xpkq, θq

§ Distributed optimization and training straightforward

§ Computational complexity: OpMP3q [instead of OpN3q]But distributed over many machines

§ Memory footprint: OpMP2 ` NDq [instead of OpN2 ` NDq]

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 46

Page 132: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Empirical Training Time

103 104 105 106 107 108

Training data set size

10-2

10-1

100

101

102

103

Com

puta

tion t

ime o

f N

LML

and its

gra

die

nts

(s)

101

102

103

104

105

106

107

Num

ber

of

GP e

xpert

s (D

GP)

Computation time (DGP)

Computation time (Full GP)

Computation time (FITC)

Number of GP experts (DGP)

§ NLML is proportional to training time

§ Full GP (16K training points) « sparse GP (50K training points)« distributed GP (16M training points)

Push practical limit by order(s) of magnitude

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 47

Page 133: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Empirical Training Time

103 104 105 106 107 108

Training data set size

10-2

10-1

100

101

102

103

Com

puta

tion t

ime o

f N

LML

and its

gra

die

nts

(s)

101

102

103

104

105

106

107

Num

ber

of

GP e

xpert

s (D

GP)

Computation time (DGP)

Computation time (Full GP)

Computation time (FITC)

Number of GP experts (DGP)

§ NLML is proportional to training time

§ Full GP (16K training points) « sparse GP (50K training points)« distributed GP (16M training points)

Push practical limit by order(s) of magnitudeGaussian Processes Marc Deisenroth @MLSS, 14 April 2015 47

Page 134: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Practical Training Times

§ Training* with N “ 106, D “ 1 on a laptop: « 30 min

§ Training* with N “ 5ˆ 106, D “ 8 on a workstation: « 4 hours

*: Maximize the marginal likelihood, stop when converged**

**: Convergence often after 30–80 line searches***

***: Line search « 2–3 evaluations of marginal likelihood andits gradient (usually OpN3q)

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 48

Page 135: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Practical Training Times

§ Training* with N “ 106, D “ 1 on a laptop: « 30 min

§ Training* with N “ 5ˆ 106, D “ 8 on a workstation: « 4 hours

*: Maximize the marginal likelihood, stop when converged**

**: Convergence often after 30–80 line searches***

***: Line search « 2–3 evaluations of marginal likelihood andits gradient (usually OpN3q)

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 48

Page 136: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Practical Training Times

§ Training* with N “ 106, D “ 1 on a laptop: « 30 min

§ Training* with N “ 5ˆ 106, D “ 8 on a workstation: « 4 hours

*: Maximize the marginal likelihood, stop when converged**

**: Convergence often after 30–80 line searches***

***: Line search « 2–3 evaluations of marginal likelihood andits gradient (usually OpN3q)

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 48

Page 137: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Predictions with the Distributed GP

µ, σ

µ1, σ1 µ2, σ2 µ3, σ3

§ Prediction of each GP expert is Gaussian N`

µi, σ2i

˘

§ How to combine them to an overall prediction N`

µ, σ2˘

?

Product-of-GP-experts

§ PoE (product of experts) (Ng & Deisenroth, 2014)

§ gPoE (generalized product of experts) (Cao & Fleet, 2014)

§ BCM (Bayesian Committee Machine) (Tresp, 2000)

§ rBCM (robust BCM) (Deisenroth & Ng, 2015)

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 49

Page 138: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Predictions with the Distributed GP

µ, σ

µ1, σ1 µ2, σ2 µ3, σ3

§ Prediction of each GP expert is Gaussian N`

µi, σ2i

˘

§ How to combine them to an overall prediction N`

µ, σ2˘

?

Product-of-GP-experts

§ PoE (product of experts) (Ng & Deisenroth, 2014)

§ gPoE (generalized product of experts) (Cao & Fleet, 2014)

§ BCM (Bayesian Committee Machine) (Tresp, 2000)

§ rBCM (robust BCM) (Deisenroth & Ng, 2015)Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 49

Page 139: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Objectives

µ, σ µ, σ

µ1, σ1 µ2, σ2 µ3, σ3

µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33

Figure: Two computational graphs

§ Scale to large data sets 3

§ Good approximation of full GP (“ground truth”)

§ Predictions independent of computational graphRuns on heterogeneous computing infrastructures (laptop,

cluster, ...)

§ Reasonable predictive variances

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 50

Page 140: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Objectives

µ, σ µ, σ

µ1, σ1 µ2, σ2 µ3, σ3

µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33

Figure: Two computational graphs

§ Scale to large data sets 3

§ Good approximation of full GP (“ground truth”)

§ Predictions independent of computational graphRuns on heterogeneous computing infrastructures (laptop,

cluster, ...)

§ Reasonable predictive variances

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 50

Page 141: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Objectives

µ, σ µ, σ

µ1, σ1 µ2, σ2 µ3, σ3

µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33

Figure: Two computational graphs

§ Scale to large data sets 3

§ Good approximation of full GP (“ground truth”)

§ Predictions independent of computational graphRuns on heterogeneous computing infrastructures (laptop,

cluster, ...)

§ Reasonable predictive variances

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 50

Page 142: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Objectives

µ, σ µ, σ

µ1, σ1 µ2, σ2 µ3, σ3

µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33

Figure: Two computational graphs

§ Scale to large data sets 3

§ Good approximation of full GP (“ground truth”)

§ Predictions independent of computational graphRuns on heterogeneous computing infrastructures (laptop,

cluster, ...)

§ Reasonable predictive variances

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 50

Page 143: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Running Example

Investigate various product-of-experts modelsSame training procedure, but different mechanisms for predictions

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 51

Page 144: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Product of GP Experts

§ Prediction model (independent predictors):

pp f˚|x˚,Dq “Mź

k“1

pkp f˚|x˚,Dpkqq ,

pkp f˚|x˚,Dpkqq “ N`

f˚ | µkpx˚q, σ2k px˚q

˘

§ Predictive precision (inverse variance) and mean:

pσpoe˚ q´2 “

ÿ

kσ´2

k px˚q

µpoe˚ “ pσ

poe˚ q2

ÿ

kσ´2

k px˚qµkpx˚q

§ Independent of the computational graph 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 52

Page 145: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Product of GP Experts

§ Prediction model (independent predictors):

pp f˚|x˚,Dq “Mź

k“1

pkp f˚|x˚,Dpkqq ,

pkp f˚|x˚,Dpkqq “ N`

f˚ | µkpx˚q, σ2k px˚q

˘

§ Predictive precision (inverse variance) and mean:

pσpoe˚ q´2 “

ÿ

kσ´2

k px˚q

µpoe˚ “ pσ

poe˚ q2

ÿ

kσ´2

k px˚qµkpx˚q

§ Independent of the computational graph 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 52

Page 146: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Product of GP Experts

§ Prediction model (independent predictors):

pp f˚|x˚,Dq “Mź

k“1

pkp f˚|x˚,Dpkqq ,

pkp f˚|x˚,Dpkqq “ N`

f˚ | µkpx˚q, σ2k px˚q

˘

§ Predictive precision (inverse variance) and mean:

pσpoe˚ q´2 “

ÿ

kσ´2

k px˚q

µpoe˚ “ pσ

poe˚ q2

ÿ

kσ´2

k px˚qµkpx˚q

§ Independent of the computational graph 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 52

Page 147: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Product of GP Experts

§ Unreasonable variances for M ą 1:

pσpoe˚ q´2 “

ÿ

kσ´2

k px˚q

§ The more experts the more certain the prediction, even if everyexpert itself is very uncertain 7 Cannot fall back to the prior

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 53

Page 148: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Product of GP Experts

§ Unreasonable variances for M ą 1:

pσpoe˚ q´2 “

ÿ

kσ´2

k px˚q

§ The more experts the more certain the prediction, even if everyexpert itself is very uncertain 7 Cannot fall back to the prior

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 53

Page 149: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Generalized Product of GP Experts

§ Weight the responsiblity of each expert in PoE with βk

§ Prediction model (independent predictors):

pp f˚|x˚,Dq “Mź

k“1

pβk

k p f˚|x˚,Dpkqq

pkp f˚|x˚,Dpkqq “ N`

f˚ | µkpx˚q, σ2k px˚q

˘

§ Predictive precision and mean:

pσgpoe˚ q´2 “

ÿ

kβkσ´2

k px˚q

µgpoe˚ “ pσ

gpoe˚ q2

ÿ

kβkσ´2

k px˚q µkpx˚q

§ Withř

k βk “ 1, the model can fall back to the prior 3

“Log-opinion pool” model (Heskes, 1998)§ Independent of computational graph for βk “ 1{M 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 54

Page 150: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Generalized Product of GP Experts

§ Weight the responsiblity of each expert in PoE with βk

§ Prediction model (independent predictors):

pp f˚|x˚,Dq “Mź

k“1

pβk

k p f˚|x˚,Dpkqq

pkp f˚|x˚,Dpkqq “ N`

f˚ | µkpx˚q, σ2k px˚q

˘

§ Predictive precision and mean:

pσgpoe˚ q´2 “

ÿ

kβkσ´2

k px˚q

µgpoe˚ “ pσ

gpoe˚ q2

ÿ

kβkσ´2

k px˚q µkpx˚q

§ Withř

k βk “ 1, the model can fall back to the prior 3

“Log-opinion pool” model (Heskes, 1998)§ Independent of computational graph for βk “ 1{M 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 54

Page 151: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Generalized Product of GP Experts

§ Weight the responsiblity of each expert in PoE with βk

§ Prediction model (independent predictors):

pp f˚|x˚,Dq “Mź

k“1

pβk

k p f˚|x˚,Dpkqq

pkp f˚|x˚,Dpkqq “ N`

f˚ | µkpx˚q, σ2k px˚q

˘

§ Predictive precision and mean:

pσgpoe˚ q´2 “

ÿ

kβkσ´2

k px˚q

µgpoe˚ “ pσ

gpoe˚ q2

ÿ

kβkσ´2

k px˚q µkpx˚q

§ Withř

k βk “ 1, the model can fall back to the prior 3

“Log-opinion pool” model (Heskes, 1998)§ Independent of computational graph for βk “ 1{M 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 54

Page 152: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Generalized Product of GP Experts

§ Weight the responsiblity of each expert in PoE with βk

§ Prediction model (independent predictors):

pp f˚|x˚,Dq “Mź

k“1

pβk

k p f˚|x˚,Dpkqq

pkp f˚|x˚,Dpkqq “ N`

f˚ | µkpx˚q, σ2k px˚q

˘

§ Predictive precision and mean:

pσgpoe˚ q´2 “

ÿ

kβkσ´2

k px˚q

µgpoe˚ “ pσ

gpoe˚ q2

ÿ

kβkσ´2

k px˚q µkpx˚q

§ Withř

k βk “ 1, the model can fall back to the prior 3

“Log-opinion pool” model (Heskes, 1998)

§ Independent of computational graph for βk “ 1{M 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 54

Page 153: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Generalized Product of GP Experts

§ Weight the responsiblity of each expert in PoE with βk

§ Prediction model (independent predictors):

pp f˚|x˚,Dq “Mź

k“1

pβk

k p f˚|x˚,Dpkqq

pkp f˚|x˚,Dpkqq “ N`

f˚ | µkpx˚q, σ2k px˚q

˘

§ Predictive precision and mean:

pσgpoe˚ q´2 “

ÿ

kβkσ´2

k px˚q

µgpoe˚ “ pσ

gpoe˚ q2

ÿ

kβkσ´2

k px˚q µkpx˚q

§ Withř

k βk “ 1, the model can fall back to the prior 3

“Log-opinion pool” model (Heskes, 1998)§ Independent of computational graph for βk “ 1{M 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 54

Page 154: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Generalized Product of GP Experts

§ Same mean as PoE§ Model no longer overconfident and falls back to prior 3

§ Very conservative variances 7Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 55

Page 155: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Bayesian Committee Machine

§ Apply Bayes’ theorem when combining predictions (and not onlyfor computing predictions)

§ Prediction model (Dpjq KK Dpkq| f˚):

pp f˚|x˚,Dq “śM

k“1 pkp f˚|x˚,DpkqqpM´1p f˚q

§ Predictive precision and mean:

pσbcm˚ q´2 “

ÿM

k“1σ´2

k px˚q ´pM´ 1qσ´2˚˚

µbcm˚ “ pσbcm

˚ q2ÿM

k“1σ´2

k px˚qµkpx˚q

§ Product of GP experts, divided by M´ 1 times the prior§ Guaranteed to fall back to the prior outside data regime 3

§ Independent of computational graph 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 56

Page 156: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Bayesian Committee Machine

§ Apply Bayes’ theorem when combining predictions (and not onlyfor computing predictions)

§ Prediction model (Dpjq KK Dpkq| f˚):

pp f˚|x˚,Dq “śM

k“1 pkp f˚|x˚,DpkqqpM´1p f˚q

§ Predictive precision and mean:

pσbcm˚ q´2 “

ÿM

k“1σ´2

k px˚q ´pM´ 1qσ´2˚˚

µbcm˚ “ pσbcm

˚ q2ÿM

k“1σ´2

k px˚qµkpx˚q

§ Product of GP experts, divided by M´ 1 times the prior§ Guaranteed to fall back to the prior outside data regime 3

§ Independent of computational graph 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 56

Page 157: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Bayesian Committee Machine

§ Apply Bayes’ theorem when combining predictions (and not onlyfor computing predictions)

§ Prediction model (Dpjq KK Dpkq| f˚):

pp f˚|x˚,Dq “śM

k“1 pkp f˚|x˚,DpkqqpM´1p f˚q

§ Predictive precision and mean:

pσbcm˚ q´2 “

ÿM

k“1σ´2

k px˚q ´pM´ 1qσ´2˚˚

µbcm˚ “ pσbcm

˚ q2ÿM

k“1σ´2

k px˚qµkpx˚q

§ Product of GP experts, divided by M´ 1 times the prior§ Guaranteed to fall back to the prior outside data regime 3

§ Independent of computational graph 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 56

Page 158: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Bayesian Committee Machine

§ Apply Bayes’ theorem when combining predictions (and not onlyfor computing predictions)

§ Prediction model (Dpjq KK Dpkq| f˚):

pp f˚|x˚,Dq “śM

k“1 pkp f˚|x˚,DpkqqpM´1p f˚q

§ Predictive precision and mean:

pσbcm˚ q´2 “

ÿM

k“1σ´2

k px˚q ´pM´ 1qσ´2˚˚

µbcm˚ “ pσbcm

˚ q2ÿM

k“1σ´2

k px˚qµkpx˚q

§ Product of GP experts, divided by M´ 1 times the prior

§ Guaranteed to fall back to the prior outside data regime 3

§ Independent of computational graph 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 56

Page 159: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Bayesian Committee Machine

§ Apply Bayes’ theorem when combining predictions (and not onlyfor computing predictions)

§ Prediction model (Dpjq KK Dpkq| f˚):

pp f˚|x˚,Dq “śM

k“1 pkp f˚|x˚,DpkqqpM´1p f˚q

§ Predictive precision and mean:

pσbcm˚ q´2 “

ÿM

k“1σ´2

k px˚q ´pM´ 1qσ´2˚˚

µbcm˚ “ pσbcm

˚ q2ÿM

k“1σ´2

k px˚qµkpx˚q

§ Product of GP experts, divided by M´ 1 times the prior§ Guaranteed to fall back to the prior outside data regime 3

§ Independent of computational graph 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 56

Page 160: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Bayesian Committee Machine

§ Apply Bayes’ theorem when combining predictions (and not onlyfor computing predictions)

§ Prediction model (Dpjq KK Dpkq| f˚):

pp f˚|x˚,Dq “śM

k“1 pkp f˚|x˚,DpkqqpM´1p f˚q

§ Predictive precision and mean:

pσbcm˚ q´2 “

ÿM

k“1σ´2

k px˚q ´pM´ 1qσ´2˚˚

µbcm˚ “ pσbcm

˚ q2ÿM

k“1σ´2

k px˚qµkpx˚q

§ Product of GP experts, divided by M´ 1 times the prior§ Guaranteed to fall back to the prior outside data regime 3

§ Independent of computational graph 3

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 56

Page 161: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Bayesian Committee Machine

§ Variance estimates are about right 3

§ When leaving the data regime, the BCM can produce junk 7

Robustify

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 57

Page 162: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Robust Bayesian Committee Machine

§ Merge gPoE (weighting of experts) with the BCM (Bayes’theorem when combining predictions)

§ Prediction model (conditional independence Dpjq KK Dpkq| f˚):

pp f˚|x˚,Dq “śM

k“1 pβk

k p f˚|x˚,Dpkqqpř

k βk´1p f˚q

§ Predictive precision and mean:

pσrbcm˚ q´2 “

ÿM

k“1βkσ´2

k px˚q `p1´řM

k“1 βkqσ´2˚˚ ,

µrbcm˚ “ pσrbcm

˚ q2ÿ

kβkσ´2

k px˚q µkpx˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 58

Page 163: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Robust Bayesian Committee Machine

§ Merge gPoE (weighting of experts) with the BCM (Bayes’theorem when combining predictions)

§ Prediction model (conditional independence Dpjq KK Dpkq| f˚):

pp f˚|x˚,Dq “śM

k“1 pβk

k p f˚|x˚,Dpkqqpř

k βk´1p f˚q

§ Predictive precision and mean:

pσrbcm˚ q´2 “

ÿM

k“1βkσ´2

k px˚q `p1´řM

k“1 βkqσ´2˚˚ ,

µrbcm˚ “ pσrbcm

˚ q2ÿ

kβkσ´2

k px˚q µkpx˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 58

Page 164: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Robust Bayesian Committee Machine

§ Merge gPoE (weighting of experts) with the BCM (Bayes’theorem when combining predictions)

§ Prediction model (conditional independence Dpjq KK Dpkq| f˚):

pp f˚|x˚,Dq “śM

k“1 pβk

k p f˚|x˚,Dpkqqpř

k βk´1p f˚q

§ Predictive precision and mean:

pσrbcm˚ q´2 “

ÿM

k“1βkσ´2

k px˚q `p1´řM

k“1 βkqσ´2˚˚ ,

µrbcm˚ “ pσrbcm

˚ q2ÿ

kβkσ´2

k px˚q µkpx˚q

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 58

Page 165: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Robust Bayesian Committee Machine

§ Does not break down in case of weak experts Robustified 3

§ Robust version of BCM Reasonable predictions 3

§ Independent of computational graph (for all choices of βk) 3Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 59

Page 166: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Empirical Approximation Error

100 101 102 103

#Gradient time in sec

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

RM

SE

rBCM

BCM

gPoE

PoE

SOR

GP

39 156 625 2500 10000#Points/Expert

§ Simulated robot arm data (10K training, 10K test)§ Hyper-parameters of ground-truth full GP§ RMSE as a function of the training time§ Sparse GP (SOR) performs worse than any distributed GP§ rBCM performs best with “weak” GP experts

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 60

Page 167: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Empirical Approximation Error (2)

100 101 102 103

#Gradient time in sec

3

2

1

0

1

2

NLP

DrBCM

BCM

gPoE

PoE

SOR

GP

39 156 625 2500 10000#Points/Expert

§ NLPD as a function of the training time Mean and variance§ BCM and PoE are not robust for weak experts§ gPoE suffers from too conservative variances§ rBCM consistently outperforms other methods

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 61

Page 168: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Summary: Distributed Gaussian Processes

µ, σ

µ1, σ1 µ2, σ2 µ3, σ3

µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33

§ Scale Gaussian processes to large data (beyond 106)§ Model conceptually straightforward and easy to train§ Key: Distributed computation§ Currently tested with N P Op107q

§ Scales to arbitrarily large data sets (with enough computingpower)

[email protected]

Thank you for your attention

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 62

Page 169: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Expressiveness of the Standard Model

Consider the universal function approximator

f pxq “ÿ

iPZ

limNÑ8

1N

Nÿ

n“1

γn exp

˜

´px´ pi` n

N qq2

λ2

¸

, x P R , λ P R`

with γn „ N`

0, 1˘

(random weights)Gaussian-shaped basis functions (with variance λ2{2) everywhere

on the real axis

f pxq “ÿ

iPZ

ż i`1

iγpsq exp

ˆ

´px´ sq2

λ2

˙

ds “ż 8

´8

γpsq expˆ

´px´ sq2

λ2

˙

ds

§ Mean: Er f pxqs “ 0

§ Covariance: Covr f pxq, f px1qs “ θ21 exp

´

´px´x1q2

2λ2

¯

for suitable θ21

GP with mean 0 and Gaussian covariance function

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 63

Page 170: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Expressiveness of the Standard Model

Consider the universal function approximator

f pxq “ÿ

iPZ

limNÑ8

1N

Nÿ

n“1

γn exp

˜

´px´ pi` n

N qq2

λ2

¸

, x P R , λ P R`

with γn „ N`

0, 1˘

(random weights)Gaussian-shaped basis functions (with variance λ2{2) everywhere

on the real axis

f pxq “ÿ

iPZ

ż i`1

iγpsq exp

ˆ

´px´ sq2

λ2

˙

ds “ż 8

´8

γpsq expˆ

´px´ sq2

λ2

˙

ds

§ Mean: Er f pxqs “ 0

§ Covariance: Covr f pxq, f px1qs “ θ21 exp

´

´px´x1q2

2λ2

¯

for suitable θ21

GP with mean 0 and Gaussian covariance function

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 63

Page 171: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Expressiveness of the Standard Model

Consider the universal function approximator

f pxq “ÿ

iPZ

limNÑ8

1N

Nÿ

n“1

γn exp

˜

´px´ pi` n

N qq2

λ2

¸

, x P R , λ P R`

with γn „ N`

0, 1˘

(random weights)Gaussian-shaped basis functions (with variance λ2{2) everywhere

on the real axis

f pxq “ÿ

iPZ

ż i`1

iγpsq exp

ˆ

´px´ sq2

λ2

˙

ds “ż 8

´8

γpsq expˆ

´px´ sq2

λ2

˙

ds

§ Mean: Er f pxqs “ 0

§ Covariance: Covr f pxq, f px1qs “ θ21 exp

´

´px´x1q2

2λ2

¯

for suitable θ21

GP with mean 0 and Gaussian covariance functionGaussian Processes Marc Deisenroth @MLSS, 14 April 2015 63

Page 172: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Two-level inference

§ level-1 inference:

pp f |X, y, θq “ppy|X, f q pp f |θq

ppy|X, θq,

ppy|X, θq “

ż

ppy|X, f q pp f |θqdh

§ level-2 inference

ppθ|X, yq “ppy|X, θq ppθq

ppy|Xq

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 64

Page 173: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

References I

[1] E. Brochu, V. M. Cora, and N. de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, withApplication to Active User Modeling and Hierarchical Reinforcement Learning. Technical Report TR-2009-023,Department of Computer Science, University of British Columbia, 2009.

[2] Y. Cao and D. J. Fleet. Generalized Product of Experts for Automatic and Principled Fusion of Gaussian ProcessPredictions. http://arxiv.org/abs/1410.7827, October 2014.

[3] K. Chalupka, C. K. I. Williams, and I. Murray. A Framework for Evaluating Approximate Methods for Gaussian ProcessRegression. Journal of Machine Learning Research, 14:333–350, February 2013.

[4] N. A. C. Cressie. Statistics for Spatial Data. Wiley-Interscience, 1993.

[5] M. P. Deisenroth. Efficient Reinforcement Learning using Gaussian Processes, volume 9 of Karlsruhe Series on IntelligentSensor-Actuator-Systems. KIT Scientific Publishing, November 2010. ISBN 978-3-86644-569-7.

[6] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian Processes for Data-Efficient Learning in Robotics and Control.IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, February 2015.

[7] M. P. Deisenroth and J. Ng. Distributed Gaussian Processes. http://arxiv.org/abs/1502.02843, February 2015.

[8] Y. Gal, M. van der Wilk, and C. E. Rasmussen. Distributed Variational Inference in Sparse Gaussian Process Regressionand Latent Variable Models. In Advances in Neural Information Processing Systems. 2014.

[9] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian Processes for Big Data. In A. Nicholson and P. Smyth, editors,Proceedings of the Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2013.

[10] T. Heskes. Selecting Weighting Factors in Logarithmic Opinion Pools. In Advances in Neural Information Processing Systems,pages 266–272. Morgan Kaufman, 1998.

[11] J. Kern. Bayesian Process-Convolution Approaches to Specifying Spatial Dependence Structure. PhD thesis, Institue of Statisticsand Decision Sciences, Duke University, 2000.

[12] A. Krause, A. Singh, and C. Guestrin. Near-Optimal Sensor Placements in Gaussian Processes: Theory, EfficientAlgorithms and Empirical Studies. Journal of Machine Learning Research, 9:235–284, February 2008.

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 65

Page 174: Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015  · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

References II

[13] N. Lawrence. Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models.Journal of Machine Learning Research, 6:1783–1816, November 2005.

[14] D. J. C. MacKay. Introduction to Gaussian Processes. In C. M. Bishop, editor, Neural Networks and Machine Learning,volume 168, pages 133–165. Springer, Berlin, Germany, 1998.

[15] J. Ng and M. P. Deisenroth. Hierarchical Mixture-of-Experts Model for Large-Scale Gaussian Process Regression.http://arxiv.org/abs/1412.3078, December 2014.

[16] J. Quinonero-Candela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression.Journal of Machine Learning Research, 6(2):1939–1960, 2005.

[17] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and MachineLearning. The MIT Press, Cambridge, MA, USA, 2006.

[18] B. Scholkopf and A. J. Smola. Learning with Kernels—Support Vector Machines, Regularization, Optimization, and Beyond.Adaptive Computation and Machine Learning. The MIT Press, Cambridge, MA, USA, 2002.

[19] E. Snelson and Z. Ghahramani. Sparse Gaussian Processes using Pseudo-inputs. In Y. Weiss, B. Scholkopf, and J. C. Platt,editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. The MIT Press, Cambridge, MA, USA, 2006.

[20] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Proceedings of the InternationalConference on Artificial Intelligence and Statistics, 2009.

[21] V. Tresp. A Bayesian Committee Machine. Neural Computation, 12(11):2719–2741, 2000.

Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 66