110
Introduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 1 / 58

Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Introduction to Gaussian Processes

Raquel Urtasun

TTI Chicago

August 2, 2013

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 1 / 58

Page 2: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Which problems we will be looking at?

In this first lecture:

Understand GPs from two perspectives:

weight viewfunction view

Applications in Computer Vision

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 2 / 58

Page 3: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Let’s look at the regression problem

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 3 / 58

Page 4: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Regression ProblemThe Prediction Problem

1960 1980 2000 2020

320

340

360

380

400

420

year

CO

2 con

cent

ratio

n, p

pm ?

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 2 / 55Figure: from C. Rasmussen

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 4 / 58

Page 5: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Regression ProblemThe Prediction Problem

1960 1980 2000 2020

320

340

360

380

400

420

year

CO

2 con

cent

ratio

n, p

pm

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 3 / 55Figure: from C. Rasmussen

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 4 / 58

Page 6: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Regression ProblemThe Prediction Problem

1960 1980 2000 2020

320

340

360

380

400

420

year

CO

2 con

cent

ratio

n, p

pm

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 4 / 55Figure: from C. Rasmussen

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 4 / 58

Page 7: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Regression ProblemThe Prediction Problem

1960 1980 2000 2020

320

340

360

380

400

420

year

CO

2 con

cent

ratio

n, p

pm

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 4 / 55Figure: from C. Rasmussen

What’s the difference between this two results?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 4 / 58

Page 8: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Questions

Model Selection

how to find out which model to use?

Model Fitting

how do I fit the parameters?what about over fitting?

Can I trust the predictions, even if I am not sure of the parameters and themodel structure?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 5 / 58

Page 9: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Supervised Regression

Assume an underlying process which generates ”clean” data.

Task: Recover underlying process from noisy observed data {x(i), yi}

Introduction

Supervised Learning: Regression (1)

−1 −0.5 0 0.5 1−1.5

−1

−0.5

0

0.5

1

1.5

input, x

outp

ut, f

(x)

underlying function and noisy data

training data

Assume an underlying process which generates “clean” data.

Goal: recover underlying process from noisy observed data.

Hanna M. Wallach [email protected]

Introduction to Gaussian Process Regression

Figure: from H. Wallach

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 6 / 58

Page 10: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

The weight view on GPs

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 7 / 58

Page 11: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Bayesian Linear Regression

The linear regression model is

f (x|w) = wT x, y = f + η

with i.i.d. noise η ∼ N (0, σ2)

Bayesian Linear Regression

Bayesian Linear Regression (1)

−1 −0.5 0 0.5 1−1.5

−1

−0.5

0

0.5

1

1.5

input, x

outp

ut, f

(x)

training data

Assuming noise ε ∼ N (0, σ2), the linear regression model is:

f (x|w) = x>w, y = f + ε.

Hanna M. Wallach [email protected]

Introduction to Gaussian Process Regression

Figure: from H. Wallach

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 8 / 58

Page 12: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Bayesian Linear Regression

The linear regression model is

f (x|w) = wT x, y = f + η

with η ∼ N (0, σ2)

Likelihood of the parameters is

P(y|X,w) = N (wT X, σ2I)

Applying the Bayes Theorem we have

P(w|y,X)︸ ︷︷ ︸posterior

∝ P(y|X,w)︸ ︷︷ ︸likelihood

P(w)︸ ︷︷ ︸prior

Typically we assume a Gaussian prior over the parameters

P(w|y,X) ∝ N (wT X, σ2I)N (0,Σ)

What’s the form of the posterior then? Why did we use a Gaussian prior?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 9 / 58

Page 13: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Bayesian Linear Regression

The linear regression model is

f (x|w) = wT x, y = f + η

with η ∼ N (0, σ2)

Likelihood of the parameters is

P(y|X,w) = N (wT X, σ2I)

Applying the Bayes Theorem we have

P(w|y,X)︸ ︷︷ ︸posterior

∝ P(y|X,w)︸ ︷︷ ︸likelihood

P(w)︸ ︷︷ ︸prior

Typically we assume a Gaussian prior over the parameters

P(w|y,X) ∝ N (wT X, σ2I)N (0,Σ)

What’s the form of the posterior then? Why did we use a Gaussian prior?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 9 / 58

Page 14: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Bayesian Linear Regression

The linear regression model is

f (x|w) = wT x, y = f + η

with η ∼ N (0, σ2)

Likelihood of the parameters is

P(y|X,w) = N (wT X, σ2I)

Applying the Bayes Theorem we have

P(w|y,X)︸ ︷︷ ︸posterior

∝ P(y|X,w)︸ ︷︷ ︸likelihood

P(w)︸ ︷︷ ︸prior

Typically we assume a Gaussian prior over the parameters

P(w|y,X) ∝ N (wT X, σ2I)N (0,Σ)

What’s the form of the posterior then? Why did we use a Gaussian prior?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 9 / 58

Page 15: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Bayesian Linear Regression

The linear regression model is

f (x|w) = wT x, y = f + η

with η ∼ N (0, σ2)

Likelihood of the parameters is

P(y|X,w) = N (wT X, σ2I)

Applying the Bayes Theorem we have

P(w|y,X)︸ ︷︷ ︸posterior

∝ P(y|X,w)︸ ︷︷ ︸likelihood

P(w)︸ ︷︷ ︸prior

Typically we assume a Gaussian prior over the parameters

P(w|y,X) ∝ N (wT X, σ2I)N (0,Σ)

What’s the form of the posterior then? Why did we use a Gaussian prior?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 9 / 58

Page 16: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Bayesian Linear Regression

The linear regression model is

f (x|w) = wT x, y = f + η

with η ∼ N (0, σ2)

Likelihood of the parameters is

P(y|X,w) = N (wT X, σ2I)

Applying the Bayes Theorem we have

P(w|y,X)︸ ︷︷ ︸posterior

∝ P(y|X,w)︸ ︷︷ ︸likelihood

P(w)︸ ︷︷ ︸prior

Typically we assume a Gaussian prior over the parameters

P(w|y,X) ∝ N (wT X, σ2I)N (0,Σ)

What’s the form of the posterior then? Why did we use a Gaussian prior?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 9 / 58

Page 17: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Bayesian Linear Regression

The posterior is Gaussian!

P(w|y,X) = N (1

σ2A−1Xy,A−1)

with A = Σ−1 + 1σ2 XT X

How can we make predictions?

The predictive distribution is also Gaussian!

P(f ∗|x∗,X, y) =

∫P(f ∗,w|x∗,X, y)dw

= N (1

σ2x∗T A−1Xy, x∗T A−1x∗)

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 10 / 58

Page 18: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Bayesian Linear Regression

The posterior is Gaussian!

P(w|y,X) = N (1

σ2A−1Xy,A−1)

with A = Σ−1 + 1σ2 XT X

How can we make predictions?

The predictive distribution is also Gaussian!

P(f ∗|x∗,X, y) =

∫P(f ∗,w|x∗,X, y)dw

= N (1

σ2x∗T A−1Xy, x∗T A−1x∗)

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 10 / 58

Page 19: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Bayesian Linear Regression

The posterior is Gaussian!

P(w|y,X) = N (1

σ2A−1Xy,A−1)

with A = Σ−1 + 1σ2 XT X

How can we make predictions?

The predictive distribution is also Gaussian!

P(f ∗|x∗,X, y) =

∫P(f ∗,w|x∗,X, y)dw

= N (1

σ2x∗T A−1Xy, x∗T A−1x∗)

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 10 / 58

Page 20: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Non-linear Bayesian Regression

The linear regression model is

f (x|w) = wT x, y = f + η

with i.i.d. noise η ∼ N (0, σ2)

How to model more complex functions?

P(f ∗|x∗,X, y) can be expressed in terms of inner products of φ(x)

Why is this interesting?

Well we can apply the kernel trick: we do not need to define φ(x) explicitly

How many basis functions should we use?

GPs come at our rescue!

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 11 / 58

Page 21: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Non-linear Bayesian Regression

The linear regression model is

f (x|w) = wT x, y = f + η

with i.i.d. noise η ∼ N (0, σ2)

How to model more complex functions?

P(f ∗|x∗,X, y) can be expressed in terms of inner products of φ(x)

Why is this interesting?

Well we can apply the kernel trick: we do not need to define φ(x) explicitly

How many basis functions should we use?

GPs come at our rescue!

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 11 / 58

Page 22: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Non-linear Bayesian Regression

The linear regression model is

f (x|w) = wTφ(x), y = f + η

with η =∼ N (0, σ2) and basis functions φ(x)

How to model more complex functions?

P(f ∗|x∗,X, y) can be expressed in terms of inner products of φ(x)

Why is this interesting?

Well we can apply the kernel trick: we do not need to define φ(x) explicitly

How many basis functions should we use?

GPs come at our rescue!

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 11 / 58

Page 23: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Non-linear Bayesian Regression

The linear regression model is

f (x|w) = wTφ(x), y = f + η

with η =∼ N (0, σ2) and basis functions φ(x)

How to model more complex functions?

P(f ∗|x∗,X, y) can be expressed in terms of inner products of φ(x)

Why is this interesting?

Well we can apply the kernel trick: we do not need to define φ(x) explicitly

How many basis functions should we use?

GPs come at our rescue!

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 11 / 58

Page 24: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Non-linear Bayesian Regression

The linear regression model is

f (x|w) = wTφ(x), y = f + η

with η =∼ N (0, σ2) and basis functions φ(x)

How to model more complex functions?

P(f ∗|x∗,X, y) can be expressed in terms of inner products of φ(x)

Why is this interesting?

Well we can apply the kernel trick: we do not need to define φ(x) explicitly

How many basis functions should we use?

GPs come at our rescue!

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 11 / 58

Page 25: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Non-linear Bayesian Regression

The linear regression model is

f (x|w) = wTφ(x), y = f + η

with η =∼ N (0, σ2) and basis functions φ(x)

How to model more complex functions?

P(f ∗|x∗,X, y) can be expressed in terms of inner products of φ(x)

Why is this interesting?

Well we can apply the kernel trick: we do not need to define φ(x) explicitly

How many basis functions should we use?

GPs come at our rescue!

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 11 / 58

Page 26: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Non-linear Bayesian Regression

The linear regression model is

f (x|w) = wTφ(x), y = f + η

with η =∼ N (0, σ2) and basis functions φ(x)

How to model more complex functions?

P(f ∗|x∗,X, y) can be expressed in terms of inner products of φ(x)

Why is this interesting?

Well we can apply the kernel trick: we do not need to define φ(x) explicitly

How many basis functions should we use?

GPs come at our rescue!

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 11 / 58

Page 27: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Non-linear Bayesian Regression

The linear regression model is

f (x|w) = wTφ(x), y = f + η

with η =∼ N (0, σ2) and basis functions φ(x)

How to model more complex functions?

P(f ∗|x∗,X, y) can be expressed in terms of inner products of φ(x)

Why is this interesting?

Well we can apply the kernel trick: we do not need to define φ(x) explicitly

How many basis functions should we use?

GPs come at our rescue!

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 11 / 58

Page 28: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

The function view of GPs

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 12 / 58

Page 29: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

What’s a GP?

A Gaussian process is a generalization of a multivariate Gaussian distributionto infinitely many variables.

Definition: a Gaussian process is a collection of random variables, any finitenumber of which have (consistent) Gaussian distributions.

A Gaussian is fully specified by a mean vector µ and a covariance matrix Σ

f = (f1, · · · , fn)T ∼ N (µ,Σ)

(index i)

A Gaussian process is fully specified by a mean function m(x) and a PSDcovariance function k(x, x′)

f (x) ∼ GP (m(x), k(x, x′))

(index x)

Typically m(x) = 0

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 13 / 58

Page 30: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

What’s a GP?

A Gaussian process is a generalization of a multivariate Gaussian distributionto infinitely many variables.

Definition: a Gaussian process is a collection of random variables, any finitenumber of which have (consistent) Gaussian distributions.

A Gaussian is fully specified by a mean vector µ and a covariance matrix Σ

f = (f1, · · · , fn)T ∼ N (µ,Σ)

(index i)

A Gaussian process is fully specified by a mean function m(x) and a PSDcovariance function k(x, x′)

f (x) ∼ GP (m(x), k(x, x′))

(index x)

Typically m(x) = 0

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 13 / 58

Page 31: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

What’s a GP?

A Gaussian process is a generalization of a multivariate Gaussian distributionto infinitely many variables.

Definition: a Gaussian process is a collection of random variables, any finitenumber of which have (consistent) Gaussian distributions.

A Gaussian is fully specified by a mean vector µ and a covariance matrix Σ

f = (f1, · · · , fn)T ∼ N (µ,Σ)

(index i)

A Gaussian process is fully specified by a mean function m(x) and a PSDcovariance function k(x, x′)

f (x) ∼ GP (m(x), k(x, x′))

(index x)

Typically m(x) = 0

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 13 / 58

Page 32: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

What’s a GP?

A Gaussian process is a generalization of a multivariate Gaussian distributionto infinitely many variables.

Definition: a Gaussian process is a collection of random variables, any finitenumber of which have (consistent) Gaussian distributions.

A Gaussian is fully specified by a mean vector µ and a covariance matrix Σ

f = (f1, · · · , fn)T ∼ N (µ,Σ)

(index i)

A Gaussian process is fully specified by a mean function m(x) and a PSDcovariance function k(x, x′)

f (x) ∼ GP (m(x), k(x, x′))

(index x)

Typically m(x) = 0

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 13 / 58

Page 33: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

What’s a GP?

A Gaussian process is a generalization of a multivariate Gaussian distributionto infinitely many variables.

Definition: a Gaussian process is a collection of random variables, any finitenumber of which have (consistent) Gaussian distributions.

A Gaussian is fully specified by a mean vector µ and a covariance matrix Σ

f = (f1, · · · , fn)T ∼ N (µ,Σ)

(index i)

A Gaussian process is fully specified by a mean function m(x) and a PSDcovariance function k(x, x′)

f (x) ∼ GP (m(x), k(x, x′))

(index x)

Typically m(x) = 0

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 13 / 58

Page 34: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Marginalization property

Thinking of a GP as a Gaussian distribution with an infinitely long meanvector and an infinite by infinite covariance matrix may seem impractical

Marginalization property

p(x) =

∫p(x , y)dy

For Gaussians(y (1)

y (2)

)∼ N

((µ1

µ2

),

(Σ11 Σ12

Σ21 Σ22

))then y (1) ∼ N (µ1,Σ11)

Therefore I only need to consider the points that I observe!

This is the consistency property

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 14 / 58

Page 35: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Marginalization property

Thinking of a GP as a Gaussian distribution with an infinitely long meanvector and an infinite by infinite covariance matrix may seem impractical

Marginalization property

p(x) =

∫p(x , y)dy

For Gaussians(y (1)

y (2)

)∼ N

((µ1

µ2

),

(Σ11 Σ12

Σ21 Σ22

))then y (1) ∼ N (µ1,Σ11)

Therefore I only need to consider the points that I observe!

This is the consistency property

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 14 / 58

Page 36: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Marginalization property

Thinking of a GP as a Gaussian distribution with an infinitely long meanvector and an infinite by infinite covariance matrix may seem impractical

Marginalization property

p(x) =

∫p(x , y)dy

For Gaussians(y (1)

y (2)

)∼ N

((µ1

µ2

),

(Σ11 Σ12

Σ21 Σ22

))then y (1) ∼ N (µ1,Σ11)

Therefore I only need to consider the points that I observe!

This is the consistency property

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 14 / 58

Page 37: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

The Gaussian DistributionThe Gaussian Distribution

The Gaussian distribution is given by

p(x|µ,Σ) = N(µ,Σ) = (2π)−D/2|Σ|−1/2 exp(− 1

2 (x − µ)>Σ−1(x − µ))

where µ is the mean vector and Σ the covariance matrix.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 8 / 55

Figure: from C. Rasmussen

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 15 / 58

Page 38: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Conditionals and Marginals of a GaussianConditionals and Marginals of a Gaussian

joint Gaussianconditional

joint Gaussianmarginal

Both the conditionals and the marginals of a joint Gaussian are again Gaussian.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 9 / 55

Figure: from C. Rasmussen

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 16 / 58

Page 39: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

GP from Bayesian linear model

The Bayesian linear model is

f (x) = wT x with w ∼ N (0,Σ)

The mean function isE[f (x)] = E[wT ]x = 0

Covariance isE[f (x)f (x′)] = xT E[wwT ]x′ = xT Σx′

For any set of m basis functions, φ(x), the corresponding covariancefunction is

K (x(p), x(q)) = φ(x(p))T Σφ(x(q))

Conversely, for every covariance function K , there is a possibly infiniteexpansion in terms of basis functions

K (x(p), x(q)) =∞∑

i=1

λiφi (x(p))Tφi (x(q))

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 17 / 58

Page 40: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Covariance

For any set of inputs x(1), · · · , x(n) we can compute K , which defines a jointdistribution over function values

f (x(1)), · · · , f (x(n)) ∼ N (0,K)

Therefore, a GP specifies a distribution over functions

Encode the prior knowledge by defining the kernel, which specifies thecovariance between pairs of random variables, e.g.,

K (x(p), x(q)) = exp(1

2||x(p) − x(q)||22)

Gaussian Process Regression

The Covariance Function

Specifies the covariance between pairs of random variables.

e.g. Squared exponential covariance function:

Cov(f (x(p)), f (x(q))) = K (x(p), x(q)) = exp (−1

2|x(p) − x(q)|2).

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1K(x(p) = 5, x(q)) as a function of x(q)

Hanna M. Wallach [email protected]

Introduction to Gaussian Process Regression

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 18 / 58

Page 41: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Covariance

For any set of inputs x(1), · · · , x(n) we can compute K , which defines a jointdistribution over function values

f (x(1)), · · · , f (x(n)) ∼ N (0,K)

Therefore, a GP specifies a distribution over functions

Encode the prior knowledge by defining the kernel, which specifies thecovariance between pairs of random variables, e.g.,

K (x(p), x(q)) = exp(1

2||x(p) − x(q)||22)

Some values of the random function

−5 0 5−1.5

−1

−0.5

0

0.5

1

1.5

input, x

outp

ut, f

(x)

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 13 / 55

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 18 / 58

Page 42: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Gaussian Process Prior

Given a set of inputs x(1), · · · , x(n), we can draw samples f (x(1)), · · · , f (x(n))

Example when using an RBF kernel

Gaussian Process Regression

Gaussian Process Prior

Given a set of inputs x(1), . . . , x(n) we may draw samplesf (x(1)), . . . , f (x(n)) from the GP prior:

f (x(1)), . . . , f (x(n)) ∼ N (0,K ).

Four samples:

0 0.2 0.4 0.6 0.8 1−1

0

1

2

3samples from the prior

input, x

outp

ut, f

(x)

Hanna M. Wallach [email protected]

Introduction to Gaussian Process Regression

Figure: from H. Wallac

How can we generate this samples?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 19 / 58

Page 43: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Sampling

Let’s do sequential generation

p(f1, · · · , fn|x1, · · · , xn) =n∏

i=1

p(fi |fi−1, · · · , f1, xi , · · · , x1)

Each term is again Gaussian since

p(x , y) = N((

ab

),

(A B

BT C

))⇒ p(x |y) = N (a + BC−1(y − b),A− BC−1BT )Prior and Posterior

−5 0 5

−2

−1

0

1

2

input, x

outp

ut, f

(x)

−5 0 5

−2

−1

0

1

2

input, x

outp

ut, f

(x)

Predictive distribution:

p(y∗|x∗, x, y) ∼ N

(k(x∗

, x)>[K + σ2noiseI]

−1y,

k(x∗, x∗) + σ2

noise − k(x∗, x)>[K + σ2

noiseI]−1k(x∗

, x))

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 20 / 55

Figure: from C. Rasmussen

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 20 / 58

Page 44: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Noise free Observations

Given noise-free training data D = {(x(i), f (i))} we want to make predictionsf ∗ about new points X∗

The GP prior says

[ff∗

]∼ N

(0,

[K (X,X) K (X,X∗)K (X∗,X) K (X∗,X∗)

])

Condition {X∗, f∗} on the training data {X, f} to obtain the posterior

This restricts the posterior to contain functions which agree with thetraining data

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 21 / 58

Page 45: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Noise free Observations

Given noise-free training data D = {(x(i), f (i))} we want to make predictionsf ∗ about new points X∗

The GP prior says

[ff∗

]∼ N

(0,

[K (X,X) K (X,X∗)K (X∗,X) K (X∗,X∗)

])

Condition {X∗, f∗} on the training data {X, f} to obtain the posterior

This restricts the posterior to contain functions which agree with thetraining data

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 21 / 58

Page 46: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Noise free Observations

Given noise-free training data D = {(x(i), f (i))} we want to make predictionsf ∗ about new points X∗

The GP prior says

[ff∗

]∼ N

(0,

[K (X,X) K (X,X∗)K (X∗,X) K (X∗,X∗)

])

Condition {X∗, f∗} on the training data {X, f} to obtain the posterior

This restricts the posterior to contain functions which agree with thetraining data

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 21 / 58

Page 47: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Noise free Observations

Given noise-free training data D = {(x(i), f (i))} we want to make predictionsf ∗ about new points X∗

The GP prior says

[ff∗

]∼ N

(0,

[K (X,X) K (X,X∗)K (X∗,X) K (X∗,X∗)

])

Condition {X∗, f∗} on the training data {X, f} to obtain the posterior

This restricts the posterior to contain functions which agree with thetraining data

The posterior is Gaussian p(f ∗|X∗,X, ) = N (µ,Σ) with

µ = K (X,X∗)K (X,X)−1f

Σ = K (X∗,X∗)− K (X,X∗)K (X,X)−1K (X∗,X)

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 21 / 58

Page 48: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Example of PosteriorPrior and Posterior

−5 0 5

−2

−1

0

1

2

input, x

outp

ut, f

(x)

−5 0 5

−2

−1

0

1

2

input, x

outp

ut, f

(x)

Predictive distribution:

p(y∗|x∗, x, y) ∼ N

(k(x∗

, x)>[K + σ2noiseI]

−1y,

k(x∗, x∗) + σ2

noise − k(x∗, x)>[K + σ2

noiseI]−1k(x∗

, x))

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 20 / 55

Figure: from C. Rasmussen

All samples agree with observations

Highest variance in regions with few training points

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 22 / 58

Page 49: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

How to deal with noise?

We have noisy observations {X, y} with

y = f + η with η ∼ N (0, σ2I)

Conditioning on the training data {X, y} gives a Gaussian predictivedistribution p(f∗|X∗,X, y)

µ = K (X,X∗)[K (X,X) + σ2I]−1y

Σ = K (X∗,X∗)− K (X,X∗)[K (X,X) + σ2I]K (X∗,X)

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 23 / 58

Page 50: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Model Selection: Hyperparameters

Let’s talk about the most employed kernel

K (x(p), x(q)) = exp(− 1

2θ2||x(p) − x(q)||22)

How can we choose θ?

Gaussian Process Regression

Model Selection: Hyperparameters

e.g. the ARD covariance function:

k(x (p), x (q)) = exp (− 1

2θ2(x (p) − x (q))2).

How best to choose θ?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1

1.5

samples from the posterior, θ = 0.1

input, x

outp

ut, f

(x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

samples from the posterior, θ = 0.3

input, x

outp

ut, f

(x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

samples from the posterior, θ = 0.5

input, x

outp

ut, f

(x)

Hanna M. Wallach [email protected]

Introduction to Gaussian Process Regression

Figure: from H. Wallach

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 24 / 58

Page 51: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Maximum Likelihood

If we don’t have a prior on P(θ), the posterior for hyper parameter θ is

P(θ|X, y) ∝ P(y|X, θ)

In the log domain

logP(y|X, θ) = − 1

2log |K (X,X) + σ2I|

︸ ︷︷ ︸capacity

− 1

2yT (K (X,X) + σ2I)−1y︸ ︷︷ ︸

model fitting

−n

2log 2π

Obtain hyperparameters

argminθ− logP(y|X, θ)

What if we have P(θ)?

This is not the ”right” thing to do a Bayesian would say, as θ should beintegrated out

A pragmatic is very happy with this

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 25 / 58

Page 52: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Maximum Likelihood

If we don’t have a prior on P(θ), the posterior for hyper parameter θ is

P(θ|X, y) ∝ P(y|X, θ)

In the log domain

logP(y|X, θ) = − 1

2log |K (X,X) + σ2I|

︸ ︷︷ ︸capacity

− 1

2yT (K (X,X) + σ2I)−1y︸ ︷︷ ︸

model fitting

−n

2log 2π

Obtain hyperparameters

argminθ− logP(y|X, θ)

What if we have P(θ)?

This is not the ”right” thing to do a Bayesian would say, as θ should beintegrated out

A pragmatic is very happy with this

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 25 / 58

Page 53: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Maximum Likelihood

If we don’t have a prior on P(θ), the posterior for hyper parameter θ is

P(θ|X, y) ∝ P(y|X, θ)

In the log domain

logP(y|X, θ) = − 1

2log |K (X,X) + σ2I|

︸ ︷︷ ︸capacity

− 1

2yT (K (X,X) + σ2I)−1y︸ ︷︷ ︸

model fitting

−n

2log 2π

Obtain hyperparameters

argminθ− logP(y|X, θ)

What if we have P(θ)?

This is not the ”right” thing to do a Bayesian would say, as θ should beintegrated out

A pragmatic is very happy with this

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 25 / 58

Page 54: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Maximum Likelihood

If we don’t have a prior on P(θ), the posterior for hyper parameter θ is

P(θ|X, y) ∝ P(y|X, θ)

In the log domain

logP(y|X, θ) = − 1

2log |K (X,X) + σ2I|

︸ ︷︷ ︸capacity

− 1

2yT (K (X,X) + σ2I)−1y︸ ︷︷ ︸

model fitting

−n

2log 2π

Obtain hyperparameters

argminθ− logP(y|X, θ)

What if we have P(θ)?

This is not the ”right” thing to do a Bayesian would say, as θ should beintegrated out

A pragmatic is very happy with this

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 25 / 58

Page 55: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Coming back to the example

θML = 0.3255

Gaussian Process Regression

Model Selection: Optimizing Marginal Likelihood (2)

θML = 0.3255:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

samples from the posterior, θ = 0.3255

input, x

outp

ut, f

(x)

Using θML is an approximation to the true Bayesian method ofintegrating over all θ values weighted by their posterior.

Hanna M. Wallach [email protected]

Introduction to Gaussian Process Regression

Figure: from H. Wallach

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 26 / 58

Page 56: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Interpretations

Recall that the predictive distribution p(f∗|x∗,X, y) is Gaussian with

µ(x∗) = k(X, x∗)[K (X,X) + σ2I]−1y

Σ(x∗) = k(x∗, x∗)− k(X, x∗)[K (X,X) + σ2I]k(x∗,X)

Notice that the mean in linear in two forms

µ =n∑

i=1

βi y(i) =

n∑

i=1

αik(x∗, x(i))

Note that the last one you might have seen e.g., SVMs, Representer theorem

Cool thing is that α has closed form solution, no need to optimize over!

The variance is the difference between the prior variance and a term thatsays how much the data X has explained.

The variance is independent of the observed outputs y

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 27 / 58

Page 57: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Interpretations

Recall that the predictive distribution p(f∗|x∗,X, y) is Gaussian with

µ(x∗) = k(X, x∗)[K (X,X) + σ2I]−1y

Σ(x∗) = k(x∗, x∗)− k(X, x∗)[K (X,X) + σ2I]k(x∗,X)

Notice that the mean in linear in two forms

µ =n∑

i=1

βi y(i) =

n∑

i=1

αik(x∗, x(i))

Note that the last one you might have seen e.g., SVMs, Representer theorem

Cool thing is that α has closed form solution, no need to optimize over!

The variance is the difference between the prior variance and a term thatsays how much the data X has explained.

The variance is independent of the observed outputs y

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 27 / 58

Page 58: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Interpretations

Recall that the predictive distribution p(f∗|x∗,X, y) is Gaussian with

µ(x∗) = k(X, x∗)[K (X,X) + σ2I]−1y

Σ(x∗) = k(x∗, x∗)− k(X, x∗)[K (X,X) + σ2I]k(x∗,X)

Notice that the mean in linear in two forms

µ =n∑

i=1

βi y(i) =

n∑

i=1

αik(x∗, x(i))

Note that the last one you might have seen e.g., SVMs, Representer theorem

Cool thing is that α has closed form solution, no need to optimize over!

The variance is the difference between the prior variance and a term thatsays how much the data X has explained.

The variance is independent of the observed outputs y

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 27 / 58

Page 59: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Interpretations

Recall that the predictive distribution p(f∗|x∗,X, y) is Gaussian with

µ(x∗) = k(X, x∗)[K (X,X) + σ2I]−1y

Σ(x∗) = k(x∗, x∗)− k(X, x∗)[K (X,X) + σ2I]k(x∗,X)

Notice that the mean in linear in two forms

µ =n∑

i=1

βi y(i) =

n∑

i=1

αik(x∗, x(i))

Note that the last one you might have seen e.g., SVMs, Representer theorem

Cool thing is that α has closed form solution, no need to optimize over!

The variance is the difference between the prior variance and a term thatsays how much the data X has explained.

The variance is independent of the observed outputs y

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 27 / 58

Page 60: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Interpretations

Recall that the predictive distribution p(f∗|x∗,X, y) is Gaussian with

µ(x∗) = k(X, x∗)[K (X,X) + σ2I]−1y

Σ(x∗) = k(x∗, x∗)− k(X, x∗)[K (X,X) + σ2I]k(x∗,X)

Notice that the mean in linear in two forms

µ =n∑

i=1

βi y(i) =

n∑

i=1

αik(x∗, x(i))

Note that the last one you might have seen e.g., SVMs, Representer theorem

Cool thing is that α has closed form solution, no need to optimize over!

The variance is the difference between the prior variance and a term thatsays how much the data X has explained.

The variance is independent of the observed outputs y

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 27 / 58

Page 61: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Other Covariances: Periodic smooth

First map the inputs to u = (cos(x), sin(x))T , and then measure distance onthe u space.

Combine with the squared exponential we have

kperiodic (x , x ′) = exp(−2 sin2(π(x − x ′)/l2)

Periodic, smooth functions

To create a distribution over periodic functions of x, we can first map the inputsto u = (sin(x), cos(x))>, and then measure distances in the u space. Combinedwith the SE covariance function, which characteristic length scale `, we get:

kperiodic(x, x ′) = exp(−2 sin2(π(x − x ′))/`2)

−2 −1 0 1 2−3

−2

−1

0

1

2

3

−2 −1 0 1 2−3

−2

−1

0

1

2

3

Three functions drawn at random; left ` > 1, and right ` < 1.Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 35 / 55

Figure: from C. Rasmussen

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 28 / 58

Page 62: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Other Covariances: mattern

Mattern form stationary covariance but not necessarily differentiable

Complicated function, lazy to write it ;)

Matérn covariance functions II

Univariate Matérn covariance function with unit characteristic length scale andunit variance:

0 1 2 30

0.5

1

covariance function

input distance

cova

rianc

e

−5 0 5

−2

−1

0

1

2

sample functions

input, xou

tput

, f(x

)

ν=1/2ν=1ν=2ν→∞

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 34 / 55

Figure: from C. Rasmussen

More complex covariances can be created by summing and multiplyingcovariances

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 29 / 58

Page 63: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Other Covariances: mattern

Mattern form stationary covariance but not necessarily differentiable

Complicated function, lazy to write it ;)

Matérn covariance functions II

Univariate Matérn covariance function with unit characteristic length scale andunit variance:

0 1 2 30

0.5

1

covariance function

input distance

cova

rianc

e

−5 0 5

−2

−1

0

1

2

sample functions

input, xou

tput

, f(x

)

ν=1/2ν=1ν=2ν→∞

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 34 / 55

Figure: from C. Rasmussen

More complex covariances can be created by summing and multiplyingcovariances

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 29 / 58

Page 64: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

What if you want to do classification?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 30 / 58

Page 65: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Binary Gaussian Process Classification

Simplest thing is to use regression for classification

More principled is to relate the class probability to the latent function f viaan additional function

p(y = 1|f (x)) = π(x) = ψ(f (x))

with ψ a sigmoid function such as the logistic or cumulative Gaussian

The likelihood is

p(y |f ) =n∏

i=1

p(yi |fi ) =n∏

i=1

ψ(yi fi )Binary Gaussian Process Classification

−4

−2

0

2

4

input, x

late

nt fu

nctio

n, f(

x)

0

1

input, x

clas

s pr

obab

ility

, π(x

)

The class probability is related to the latent function, f , through:

p(y = 1|f (x)) = π(x) = Φ(f (x)

),

where Φ is a sigmoid function, such as the logistic or cumulative Gaussian.Observations are independent given f , so the likelihood is

p(y|f) =

n∏

i=1

p(yi|fi) =

n∏

i=1

Φ(yifi).

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 40 / 55

Figure: from C. RasmussenR. Urtasun (TTIC) Gaussian Processes August 2, 2013 31 / 58

Page 66: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Houston we have a problem!

We have a GP prior on the latent function

p(f|X) = N (0,K)

The posterior becomes

p(f|X, y) =p(y|f)p(f|X)

p(X, y)=N (f|0,K)

p(X, y)

n∏

i=1

ψ(yi fi )

This is non-Gaussian!

The prediction of the latent function at a new test point is intractable

p(f∗|X, y, x∗) =

∫p(f∗|f,X, x∗)p(f|X, y)df

Same problem from the predictive class probability

p(y∗|X, y, x∗) =

∫p(y∗|f∗)p(f∗|X, y, x∗)df∗

Resort to approximations: Laplace, EP, Variational Bounds

In practice for the mean prediction, doing GP regression works as well!

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 32 / 58

Page 67: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Houston we have a problem!

We have a GP prior on the latent function

p(f|X) = N (0,K)

The posterior becomes

p(f|X, y) =p(y|f)p(f|X)

p(X, y)=N (f|0,K)

p(X, y)

n∏

i=1

ψ(yi fi )

This is non-Gaussian!

The prediction of the latent function at a new test point is intractable

p(f∗|X, y, x∗) =

∫p(f∗|f,X, x∗)p(f|X, y)df

Same problem from the predictive class probability

p(y∗|X, y, x∗) =

∫p(y∗|f∗)p(f∗|X, y, x∗)df∗

Resort to approximations: Laplace, EP, Variational Bounds

In practice for the mean prediction, doing GP regression works as well!R. Urtasun (TTIC) Gaussian Processes August 2, 2013 32 / 58

Page 68: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Houston we have a problem!

We have a GP prior on the latent function

p(f|X) = N (0,K)

The posterior becomes

p(f|X, y) =p(y|f)p(f|X)

p(X, y)=N (f|0,K)

p(X, y)

n∏

i=1

ψ(yi fi )

This is non-Gaussian!

The prediction of the latent function at a new test point is intractable

p(f∗|X, y, x∗) =

∫p(f∗|f,X, x∗)p(f|X, y)df

Same problem from the predictive class probability

p(y∗|X, y, x∗) =

∫p(y∗|f∗)p(f∗|X, y, x∗)df∗

Resort to approximations: Laplace, EP, Variational Bounds

In practice for the mean prediction, doing GP regression works as well!R. Urtasun (TTIC) Gaussian Processes August 2, 2013 32 / 58

Page 69: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Are GPs useful in computer vision?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 33 / 58

Page 70: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Applications in Computer Vision

Many applications, we will concentrate on a few if time permits

Multiple kernel learning: object recognition

GPs as an optimization tool: weakly supervised segmentation

Human pose estimation from single images

Flow estimation

Fashion show

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 34 / 58

Page 71: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

1) Object Recognition

Task: Given an image x, predict the class of the object present in the imagey ∈ Y

y → {car , bus, bicycle}

Although this is a classification task, one can treat the categories as realvalues and formulate the problem as regression.

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 35 / 58

Page 72: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

1) Object Recognition

Task: Given an image x, predict the class of the object present in the imagey ∈ Y

y → {car , bus, bicycle}

Although this is a classification task, one can treat the categories as realvalues and formulate the problem as regression.

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 35 / 58

Page 73: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

How do we do Object Recognition?

Given this two images, we will like to say if they are of the same class.

Choose a representation for the images

Global descriptor of the full imageLocal features: SIFT, SURF, etc.

We need to choose a way to compute similarities

Histograms of local features (i.e., bags of words), pyramids, etc.Kernels on global descriptors, e.g., RBF· · ·

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 36 / 58

Page 74: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Multiple Kernel Learning (MKL)

Why do we need to choose a single representation and a single similarityfunction?

Which one is the best among all possible ones?

Multiple kernel learning comes at our rescue, by learning which cues andsimilarities are more important for the prediction task.

K =∑

i

αi Ki

This is just hyperparameter learning in GPs! No need for specialized SW!

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 37 / 58

Page 75: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Multiple Kernel Learning (MKL)

Why do we need to choose a single representation and a single similarityfunction?

Which one is the best among all possible ones?

Multiple kernel learning comes at our rescue, by learning which cues andsimilarities are more important for the prediction task.

K =∑

i

αi Ki

This is just hyperparameter learning in GPs! No need for specialized SW!

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 37 / 58

Page 76: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Multiple Kernel Learning (MKL)

Why do we need to choose a single representation and a single similarityfunction?

Which one is the best among all possible ones?

Multiple kernel learning comes at our rescue, by learning which cues andsimilarities are more important for the prediction task.

K =∑

i

αi Ki

This is just hyperparameter learning in GPs! No need for specialized SW!

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 37 / 58

Page 77: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Efficient Learning Using GPs for Multiclass Problems

Supposed we want to emulate a 1-vs-all strategy as |Y| > 2

We define y ∈ {−1, 1}|Y|

We can employ maximum likelihood and learn all the parameters for allclassifiers at once

minθ,α>0

−∑

i

log p(y(i)|X,θ,α) + γ1||α||1 + γ2||α||2

with y(i) ∈ {−1, 1} each of the individual problems.

Efficient to do joint learning as we can share the covariance across all classes

What’s the difference between θ and α?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 38 / 58

Page 78: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Efficient Learning Using GPs for Multiclass Problems

Supposed we want to emulate a 1-vs-all strategy as |Y| > 2

We define y ∈ {−1, 1}|Y|

We can employ maximum likelihood and learn all the parameters for allclassifiers at once

minθ,α>0

−∑

i

log p(y(i)|X,θ,α) + γ1||α||1 + γ2||α||2

with y(i) ∈ {−1, 1} each of the individual problems.

Efficient to do joint learning as we can share the covariance across all classes

What’s the difference between θ and α?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 38 / 58

Page 79: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Efficient Learning Using GPs for Multiclass Problems

Supposed we want to emulate a 1-vs-all strategy as |Y| > 2

We define y ∈ {−1, 1}|Y|

We can employ maximum likelihood and learn all the parameters for allclassifiers at once

minθ,α>0

−∑

i

log p(y(i)|X,θ,α) + γ1||α||1 + γ2||α||2

with y(i) ∈ {−1, 1} each of the individual problems.

Efficient to do joint learning as we can share the covariance across all classes

What’s the difference between θ and α?

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 38 / 58

Page 80: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Caltech 101 dataset

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 39 / 58

Page 81: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Results: Caltech 101

[A. Kapoor, K. Graumann, R. Urtasun and T. Darrell, IJCV 2009]

Comparison with SVM kernel combination: kernels based on Geometric Blur(with and without distortion), dense PMK and spatial PMK on SIFT, etc.

0 2 4 6 8 10 12 14 1630

40

50

60

70

80

90

Number of labeled images per class

Rec

ogni

tion

accu

racy

Statistical Efficiency (Caltech−101, 15 Examples per Class)

GP (6 kernels)Varma & Ray (6 kernels)

Figure: Average precision.

0 5 10 150

1000

2000

3000

4000

5000

6000

Number of labeled images per class

Tim

e in

sec

onds

Computational Efficiency (Caltech−101, 15 Examples per Class)

GP (6 kernels)Varma & Ray (6 kernels)

Figure: Time of computation.

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 40 / 58

Page 82: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Results: Caltech 101

[A. Kapoor, K. Graumann, R. Urtasun and T. Darrell, IJCV 2009]

0 5 10 15 20 25 300

10

20

30

40

50

60

70

80

90

number of training examples per class

mea

n re

cogn

ition

rat

e pe

r cl

ass

Caltech 101 Categories Data Set

GP−MultiBoiman et al. (CVPR08)Jain, Kulis, & Grauman (CVPR08)Frome, Singer, Sha, & Malik (ICCV07)Zhang, Berg, Maire, & Malik(CVPR06)Lazebnik, Schmid, & Ponce (CVPR06)Berg (thesis)Mutch, & Lowe(CVPR06)Grauman & Darrell(ICCV 2005)Berg, Berg, & Malik(CVPR05)Zhang, Marszalek, Lazebnik, & SchmidWang, Zhang, & Fei−Fei (CVPR06)Holub, Welling, & Perona(ICCV05)Serre, Wolf, & Poggio(CVPR05)Fei−Fei, Fergus, & PeronaSSD baseline

Figure: Comparison with the state of the art as in late 2008.

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 41 / 58

Page 83: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Other forms of MKL

Convex combination of kernels is too simple (not big boost reported), we needmore complex (non-linear) combinations

Localized comb.: (the weighting varies locally) (Christioudias et al. 09)

K(v) = K(v)np �K(v)

p

use structure to define K(v)np , e.g., low-rank

Bayesian co-training (Yu et al. 07)

Kc =

j

(Kj + σ2j I)−1

−1

Heteroscedastic Bayesian Co-training: model noise with full covariance(Christoudias et al. 09)

Check out Mario Christoudias PhD thesis for more details

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 42 / 58

Page 84: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

2) Optimization non-differentiable functions

Supposed you have a function that you want to optimize, but it isnon-differentiable and also computationally expensive to evaluate, you can

Discretize your space and evaluate discretized values in a grid(combinatorial)

Randomly sample your parameters

Utilize GPs to query where to look

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 43 / 58

Page 85: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

GPs as an optimization tool[N. Srinivas, A. Krause, S. Kakade and M. Seeger, ICML 2010]

Suppose we want to compute max f (x), we can simply

repeatChoose xt = arg maxx∈D µt−1(x) +

√βtσt−1(x)

Evaluate yt = f (xt) + εt

Evaluate µt and σt

until budget reached

−2 −1 1 2

−3

−2

−1

1

2

3

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 44 / 58

Page 86: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

GPs as an optimization tool[N. Srinivas, A. Krause, S. Kakade and M. Seeger, ICML 2010]

Suppose we want to compute max f (x), we can simply

repeatChoose xt = arg maxx∈D µt−1(x) +

√βtσt−1(x)

Evaluate yt = f (xt) + εt

Evaluate µt and σt

until budget reached

−2 −1 1 2

−3

−2

−1

1

2

3

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 44 / 58

Page 87: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

GPs as an optimization tool[N. Srinivas, A. Krause, S. Kakade and M. Seeger, ICML 2010]

Suppose we want to compute max f (x), we can simply

repeatChoose xt = arg maxx∈D µt−1(x) +

√βtσt−1(x)

Evaluate yt = f (xt) + εt

Evaluate µt and σt

until budget reached

−2 −1 1 2

−3

−2

−1

1

2

3

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 44 / 58

Page 88: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

GPs as an optimization tool in vision

[A. Vezhnevets, V. Ferrari and J. Buhmann, CVPR 2012]

Image segmentation in the weakly supervised setting, where the only labelsare which classes are present in the scene.

y ∈ {sky , building , tree}

Train based on expected agreement, if I partition the dataset on two setsand I train on the first, it should predict the same as if I train on the second.

This function is sum of indicator functions and thus non-differentiable.

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 45 / 58

Page 89: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

GPs as an optimization tool in vision

[A. Vezhnevets, V. Ferrari and J. Buhmann, CVPR 2012]

Image segmentation in the weakly supervised setting, where the only labelsare which classes are present in the scene.

y ∈ {sky , building , tree}

Train based on expected agreement, if I partition the dataset on two setsand I train on the first, it should predict the same as if I train on the second.

This function is sum of indicator functions and thus non-differentiable.

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 45 / 58

Page 90: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Examples of Good Segmentations and Results

[A. Vezhnevets, V. Ferrari and J. Buhmann, CVPR 2012]

[Tighe 10] [Vezhnevets 11] GMIM

supervision fulll weak weakaverage accuracy 29 14 21

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 46 / 58

Page 91: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

3) Discriminative Approaches to Human Pose Estimation

Task: given an image x, estimate the 3D location and orientation of thebody parts y.

We can treat this problem as a multi-output regression problem, where theinput are image features, e.g., BOW, HOG, etc.

The main challenges are

Poor imaging: motion blurred, occlusions, etc.Need of large number of examples to represent all possible poses:represent variations in appearance and in pose.

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 47 / 58

Page 92: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

3) Discriminative Approaches to Human Pose Estimation

Task: given an image x, estimate the 3D location and orientation of thebody parts y.

We can treat this problem as a multi-output regression problem, where theinput are image features, e.g., BOW, HOG, etc.

The main challenges are

Poor imaging: motion blurred, occlusions, etc.Need of large number of examples to represent all possible poses:represent variations in appearance and in pose.

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 47 / 58

Page 93: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Challenges for GPs

GP have complexity O(n3), with n the number of examples, and cannot dealwith large datasets in their standard form.

This problem can’t be solved directly as a regression task, since the mappingis multimodal: an image observation can represent more than one pose.

Solutions to the first problem exist in the literature, they are calledsparsification techniques

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 48 / 58

Page 94: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Dealing with multimodal mappings

We can represent the regression problem as a mixture of experts, where eachexpert is a local GP.

The experts should be selected online to avoid the possible boundaryproblems of clustering.

Fast solution with up to millions of examples if combined with fast NNretrieval, e.g., LSH.

ONLINE: Inference of test point x∗T : number of experts, S: size of each expert

Find NN in x of x∗Find Modes in y of the NN retrievedfor i = 1 . . .T do

Create a local GP for each mode iRetrieve hyper-parametersCompute mean µ and variance σ

end forp(f∗|y) ≈

∑Ti=1 πiN (µi , σ

2i )

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 49 / 58

Page 95: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Online vs Clustering

−8 −6 −4 −2 0 2 4 6 8−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Full GP

−8 −6 −4 −2 0 2 4 6 8−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Local Offline GP

−8 −6 −4 −2 0 2 4 6 8−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Local online GP

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5−1.8

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

Full GP

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5−1.8

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

Local Offline GP

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5−1.8

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

Local online GP

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 50 / 58

Page 96: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Single GP vs Mixture of Online GPs

−8 −6 −4 −2 0 2 4 6 8−10

−5

0

5

10

15

20

25

Global GP

−8 −6 −4 −2 0 2 4 6 8−10

−5

0

5

10

15

20

25

Local Online GP

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 51 / 58

Page 97: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Results: Humaneva

[R. Urtasun and T. Darrell, CVPR 2008]

walk jog box mono. discrim. dyn.Lee et al. I 3.4 – – yes no noLee et al. II 3.1 – – yes no yes

Pope 4.53 4.38 9.43 yes yes noMuendermann et al. 5.31 – 4.54 no no yes

Li et al. – – 20.0 yes no yesBrubaker et al. 10.4 – – yes no yesOur approach 3.27 3.12 3.85 yes yes no

Table: Comparison with state of the art (error in cm).

Caviat: Oracle has to select the optimal mixture component

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 52 / 58

Page 98: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

4) Flow Estimation with Gaussian Process

Model a trajectory as a continuous dense flow field from a sparse set ofvector sequences using Gaussian Process Regression

Each velocity component modeled with an independent GP

The flow can be expressed as

φ(x) = y(u)(x)i + y(v)(x)j + y(t)(x)k ∈ <3

where x = (u, v , t)

Difficulties:

How to model a GPRF from different trajectories, which may havedifferent lengthsHow to handle multiple GPRF models trained from different numbersof trajectories with heterogeneous scales and frame rates

Solution: normalize the length of the tracks before modeling with a GP, aswell as the number of samples

Classification based on the likelihood for each class

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 53 / 58

Page 99: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

4) Flow Estimation with Gaussian Process

Model a trajectory as a continuous dense flow field from a sparse set ofvector sequences using Gaussian Process Regression

Each velocity component modeled with an independent GP

The flow can be expressed as

φ(x) = y(u)(x)i + y(v)(x)j + y(t)(x)k ∈ <3

where x = (u, v , t)

Difficulties:

How to model a GPRF from different trajectories, which may havedifferent lengthsHow to handle multiple GPRF models trained from different numbersof trajectories with heterogeneous scales and frame rates

Solution: normalize the length of the tracks before modeling with a GP, aswell as the number of samples

Classification based on the likelihood for each class

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 53 / 58

Page 100: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

4) Flow Estimation with Gaussian Process

Model a trajectory as a continuous dense flow field from a sparse set ofvector sequences using Gaussian Process Regression

Each velocity component modeled with an independent GP

The flow can be expressed as

φ(x) = y(u)(x)i + y(v)(x)j + y(t)(x)k ∈ <3

where x = (u, v , t)

Difficulties:

How to model a GPRF from different trajectories, which may havedifferent lengthsHow to handle multiple GPRF models trained from different numbersof trajectories with heterogeneous scales and frame rates

Solution: normalize the length of the tracks before modeling with a GP, aswell as the number of samples

Classification based on the likelihood for each class

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 53 / 58

Page 101: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Flow Classification and Anomaly Detection

[K. Kim, D. Lee and I. Essa, ICCV 2011]

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 54 / 58

Page 102: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

5) 3D Shape Recovery for Online Shopping

Interactive system for quickly modelling 3D body shapes from a single image

Obtain their 3D body shapes so as to try on virtual garments online

Interface for users to conveniently extract anthropometric measurementsfrom a single photo, while using readily available scene cues for automaticimage rectification

GPs to predict the body parameters from input measurements whilecorrecting the aspect ratio ambiguity resulting from photo rectification

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 55 / 58

Page 103: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

5) 3D Shape Recovery for Online Shopping

Interactive system for quickly modelling 3D body shapes from a single image

Obtain their 3D body shapes so as to try on virtual garments online

Interface for users to conveniently extract anthropometric measurementsfrom a single photo, while using readily available scene cues for automaticimage rectification

GPs to predict the body parameters from input measurements whilecorrecting the aspect ratio ambiguity resulting from photo rectification

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 55 / 58

Page 104: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

5) 3D Shape Recovery for Online Shopping

Interactive system for quickly modelling 3D body shapes from a single image

Obtain their 3D body shapes so as to try on virtual garments online

Interface for users to conveniently extract anthropometric measurementsfrom a single photo, while using readily available scene cues for automaticimage rectification

GPs to predict the body parameters from input measurements whilecorrecting the aspect ratio ambiguity resulting from photo rectification

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 55 / 58

Page 105: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

5) 3D Shape Recovery for Online Shopping

Interactive system for quickly modelling 3D body shapes from a single image

Obtain their 3D body shapes so as to try on virtual garments online

Interface for users to conveniently extract anthropometric measurementsfrom a single photo, while using readily available scene cues for automaticimage rectification

GPs to predict the body parameters from input measurements whilecorrecting the aspect ratio ambiguity resulting from photo rectification

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 55 / 58

Page 106: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Creating the 3D Shape from Single Images

Manually annotate a set of five 2D measurements

Well-defined by the anthropometric positions, easy to discern andunambiguous to users.

Good correlations with the corresponding tape measurements and conveyenough information for estimating the 3D body shape

User’s effort for annotation should be minimised

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 56 / 58

Page 107: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

The role of the GPs

A body shape estimator is learned to predict the 3D body shape from user’sinput, including both image measurements and actual measurements.

Training set is (CAESAR) dataset (Robinette et al. 99), with 2000 bodies.

Register each 3D instance in the dataset with a 3D morphable human body

A 3D body is decomposed into a linear combination of body morphs

Shape-from-measurements estimator can be formulated into a regressionproblem, y is the morph parameters and x is the user specified parameters.

Multi-output done as independent predictors, each with a GP

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 57 / 58

Page 108: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

The role of the GPs

A body shape estimator is learned to predict the 3D body shape from user’sinput, including both image measurements and actual measurements.

Training set is (CAESAR) dataset (Robinette et al. 99), with 2000 bodies.

Register each 3D instance in the dataset with a 3D morphable human body

A 3D body is decomposed into a linear combination of body morphs

Shape-from-measurements estimator can be formulated into a regressionproblem, y is the morph parameters and x is the user specified parameters.

Multi-output done as independent predictors, each with a GP

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 57 / 58

Page 109: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

The role of the GPs

A body shape estimator is learned to predict the 3D body shape from user’sinput, including both image measurements and actual measurements.

Training set is (CAESAR) dataset (Robinette et al. 99), with 2000 bodies.

Register each 3D instance in the dataset with a 3D morphable human body

A 3D body is decomposed into a linear combination of body morphs

Shape-from-measurements estimator can be formulated into a regressionproblem, y is the morph parameters and x is the user specified parameters.

Multi-output done as independent predictors, each with a GP

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 57 / 58

Page 110: Introduction to Gaussian Processeshelper.ipam.ucla.edu/publications/gss2013/gss2013_11336.pdfIntroduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun

Online Shopping

[Y. Chen and D. Robertson and R. Cipolla, BMVC 2011]

Chest Waist Hips Inner leg lengthError(cm) 1.52 ± 1.36 1.88 ± 1.06 3.10 ± 1.86 0.79 ± 0.90

R. Urtasun (TTIC) Gaussian Processes August 2, 2013 58 / 58