71
1 MACHINE LEARNING MACHINE LEARNING Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian Process Regression

Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

1

MACHINE LEARNING MACHINE LEARNING

MACHINE LEARNING

Kernel Methods for Regression

Support Vector Regression

Gaussian Mixture Regression

Gaussian Process Regression

Page 2: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

2

MACHINE LEARNING MACHINE LEARNING

Problem Statement

Predict given input through a non-linear function :y x f

y f x

y

x

1,...

Estimate that best predict set of training points , ?i i

i Mf x y

1x

1y

2x

2y

3x

3y

4x

4y

Page 3: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

3

MACHINE LEARNING MACHINE LEARNING

Non-linear regression and the Kernel Trick

Non-Linear regression: Fit data with a function that is not linear in the

parameters

Non-parametric regression: use the data to determine the parameters

of the function so that the problem can be again phrased as a linear

regression problem.

Kernel Trick: Send data in feature space with non-linear function and

perform linear regression in feature space

; ; : parameters of the functiony f x

,

x : datapoints, k: kernel fct.

i

i

i

i

y k x x

Page 4: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

5

MACHINE LEARNING MACHINE LEARNING

Data-driven Regression

y

Good prediction depends on the choice of datapoints.

x

Blue: true function

Red: estimated function

Page 5: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

6

MACHINE LEARNING MACHINE LEARNING

Data-driven Regression

y

Good prediction depends on the choice of datapoints.

The more datapoints, the better the fit.

Computational costs increase dramatically with number of datapoints

x

Blue: true function

Red: estimated function

Page 6: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

7

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y

Several methods in ML for performing non-linear regression.

Differ in the objective function, in the amount of parameters.

Gaussian Process Regression (GPR) uses all datapoints

x

Blue: true function

Red: estimated function

Page 7: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

8

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y

Several methods in ML for performing non-linear regression.

Differ in the objective function, in the amount of parameters.

Gaussian Process Regression (GPR) uses all datapoints

Support Vector Regression (SVR) picks a subset of datapoints (support vectors)

x

Blue: true function

Red: estimated function

Page 8: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

9

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y

x

Several methods in ML for performing non-linear regression.

Differ in the objective function, in the amount of parameters.

Gaussian Process Regression (GPR) uses all datapoints

Support Vector Regression (SVR) picks a subset of datapoints (support vectors)

Gaussian Mixture Regression (GMR) generates a new set of datapoints

(centers of Gaussian functions) Blue: true function

Red: estimated function

Page 9: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

10

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y

x

, , Ny f x x y Deterministic regressive model

2, with 0,y f x N Probabilistic regressive model

Build an estimate of the noise model and then compute f directly

(Support Vector Regression)

Page 10: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

11

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

Page 11: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

12

MACHINE LEARNING MACHINE LEARNING

1,...

Assume a nonlinear mapping , s.t. .

How to estimate to best predict the pair of training points , ?i i

i M

f y f x

f x y

How to generalize the support vector machine framework for classification

to estimate continuous functions?

1. Assume a non-linear mapping through feature space and then perform

linear regression in feature space

2. Supervised learning – minimizes an error function.

First determine a way to measure error on testing set in the linear case!

Support Vector Regression (SVR)

Page 12: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

13

MACHINE LEARNING MACHINE LEARNING

Assume a linear mapping , s.t. .Tf y f x w x b

x

Measure the error on prediction

b is estimated as in SVR through

least-square regression on

support vectors; hence we omit it

from the rest of the

developments .

y f x

Support Vector Regression

1,...

How to estimate and to best predict the pair of training points , ?i i

i Mw b x y

Page 13: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

14

MACHINE LEARNING MACHINE LEARNING

x

Support Vector Regression

+𝜀

−𝜀

Set an upper bound on the error and

consider as correctly classified all points

such that ( ) ,

Penalize only datapoints that are

not contained in the -tube.

f x y

Page 14: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

15

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

X

-margin

The -margin is a measure of the

width of the -insensitive tube and

hence of the precision of the

regression.

A small ||w|| corresponds to a

small slope for f. In the linear case,

f is more horizontal.

y wx b

Page 15: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

16

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

-margin

A large ||w|| corresponds to a large

slope for f. In the linear case, f is

more vertical.

The flatter the slope of the function

f, the larger the margin

To maximize the margin, we

must minimize the norm of w.

X

y wx b

Page 16: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

17

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

2

1,...

This can be rephrased as a constraint-based optimization problem

of the form:

1minimize

2

,subject to

,

i

i

i

i

i M

w

w x b y

y w x b

Need to penalize

points outside

the -insensitive

tube.

𝜀

Page 17: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

18

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

Need to penalize

points outside

the -insensitive

tube.

*

2 *

1

*

*

Introduce slack variables , , 0 :

1 Cminimize +

2

,

subject to ,

0, 0

i

i

i i

M

i i

i

i

i

i

i

i i

C

wM

w x b y

y w x b

i*

i𝜀

Page 18: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

19

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

All points outside

the -tube become

Support Vectors

i*

i𝜀

*

2 *

1

*

*

Introduce slack variables , , 0 :

1 Cminimize +

2

,

subject to ,

0, 0

i

i

i i

M

i i

i

i

i

i

i

i i

C

wM

w x b y

y w x b

We now have the solution to the linear regression problem.

How to generalize this to the nonlinear case?

Page 19: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

20

MACHINE LEARNING MACHINE LEARNING

Lift x into feature space and then perform

linear regression in feature space.

Linear Case:

,

Non-Linear Case:

,

y f x w x b

x x

y f x w x b

w lives in feature space!

x x

Support Vector Regression

Page 20: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

21

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

2 *

1

*

*

In feature space, we obtain the same constrained optimization problem:

1 Cminimize +

2

,

subject to ,

0, 0

i

i

M

i i

i

i

i

i

i

i i

wM

w x b y

y w x b

Page 21: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

22

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

2 * * *

1 1

1

* *

1

1 C CL , , *, = +

2

,

,

i

ii

i

i

M M

i i i i i

i i

Mi

i

i

Mi

i

i

w b wM M

y w x b

y w x b

Again, we can solve this quadratic problem by introducing sets of

Lagrange multipliers and writing the Lagrangian :

Lagrangian = Objective function + l * constraints

Page 22: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

23

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

Requiring that the partial derivatives are all zero

And replacing in the primal Lagrangian, we get the Dual optimization problem:

*

1

L0;i

M

i

ib

*

1

L0;i

Mi

i

i

w xw

* *

*

L0i ii

C

M

*

* *

, 1

* *,

1 1

* *

1

1,

2max

subject to 0 and , 0,

i

i i

i i

Mi j

i j j

i j

M Mi

i i

i i

Mi

i i

i

k x x

y

C

M

Page 23: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

24

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

The solution is given by:

*

1

,i

Mi

i

i

y f x k x x b

Linear Coefficients

(Lagrange multipliers for

each constraint).

If Gaussian Kernel,

M Gaussians centered on

each training datapoint.

Page 24: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

25

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

y

x

*

1

,i

Mi

i

i

y f x k x x b

The solution is given by:

Kernel places a Gauss function on each SV

Page 25: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

26

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

y

x

*

1

,i

Mi

i

i

y f x k x x b

The solution is given by:

The Lagrange multipliers

define the importance of each

Gaussian function

*

1 1.5 2 2 4 3 *

3 1.5 *

5 1 6 2.5

b

Converges to b when SV

effect vanished

1x 2x 3x 4x 5x6x

Page 26: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

27

MACHINE LEARNING MACHINE LEARNING

2 *

1

*

*

1 Cminimize +

2

,

subject to ,

0, 0

i

i

M

i i

i

i

i

i

i

i i

wM

w x b y

y w x b

SVR: Hyperparameters

The solution to SVR we just saw is referred to as SVR

Two Hyperparameters

C controls the penalty term on poor fit

determines the minimal required precision

Page 27: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

28

MACHINE LEARNING MACHINE LEARNING

Effect of the RBF kernel width on the fit. Here fit using C=1000, =0.01,

kernel width=0.1.

SVR: Effect of Hyperparameters

Page 28: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

29

MACHINE LEARNING MACHINE LEARNING

Effect of the RBF kernel width on the fit. Here fit using C=1000, =0.01,

kernel width=0.01 Overfitting

SVR: Effect of Hyperparameters

Page 29: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

30

MACHINE LEARNING MACHINE LEARNING

SVR: Effect of Hyperparameters

Effect of the RBF kernel width on the fit. Here fit using C=100, =0.03,

kernel width=0.1 Reduction of the effect of the kernel width on the fit by

choosing appropriate hyperparameters.

.

Page 30: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

31

MACHINE LEARNING MACHINE LEARNING

As the number of data grows, so does the number of support vectors.

Introduce a new parameter as in SVM:

n-SVR puts an upper bound on the support vectors

It allows to fit automatically the epsilon-tube!

0,1

Support Vector Regression: -SVR

2 *

1, ,

1min

2

M

j jjww

Page 31: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

32

MACHINE LEARNING MACHINE LEARNING

SVR: Example

Effect of the automatic adaptation of using -SVR

Support Vector Regression: -SVR

Page 32: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

33

MACHINE LEARNING MACHINE LEARNING

SVR: Example

Effect of the automatic adaptation of using -SVR

Added noise on data

Support Vector Regression: -SVR

Page 33: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

34

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

, , Ny f x x y Deterministic regressive model

2, with 0,y f x N Probabilistic regressive model

Build an estimate of the noise model and then compute f directly

(Support Vector Regression)

Page 34: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

35

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y

, , Ny f x x y Deterministic regressive model

2, with 0,y f x N Probabilistic regressive model

Probabilistic estimate of the nonlinear relationship between y and x

through the conditional density: (estimates the noise model and f)

And then compute the estimate by taking the expectation over the

conditional density:

|p y x

|

y E f x f x

y E p y x

Page 35: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

36

MACHINE LEARNING MACHINE LEARNING

x

Gaussian Mixture Regression (GMR)

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

1

, , ; , , with , ; , ,

, : mean and covariance matrix of Gaussian

Ki i i i i i

i

i

i i

p x y p x y p x y N

i

2D projection of a

Gauss function

Ellipse contour ~ 2

std deviation

x

y

y

Page 36: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

37

MACHINE LEARNING MACHINE LEARNING

x

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

1

, , ; , , with , ; , ,

, : mean and covariance matrix of Gaussian

Ki i i i i i

i

i

i i

p x y p x y p x y N

i

Parameters are learned through Expectation-maximization.

Iterative procedure. Start with random initialization.

y

Gaussian Mixture Regression (GMR)

Page 37: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

38

MACHINE LEARNING MACHINE LEARNING

y

x

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

1

, , ; , , with , ; , ,

, : mean and covariance matrix of Gaussian

Ki i i i i i

i

i

i i

p x y p x y p x y N

i

1

1K

i

i

Mixing Coefficients

Probability that all M datapoints

were generated by Gaussian i:

1

|M

j

i

j

p i p i x

1 2

Gaussian Mixture Regression (GMR)

Page 38: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

40

MACHINE LEARNING MACHINE LEARNING

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

2) Compute the regressive signal, by taking p(y|x)

y

x

1

| | ; ,K

i i

i

i

p y x x p y x

1

; ,with

; ,

i i

i

i Kj j

j

j

p xx

p x

Gauss function

The variance changes depending on the query point

Gaussian Mixture Regression (GMR)

Page 39: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

41

MACHINE LEARNING MACHINE LEARNING

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

2) Compute the regressive signal, by taking p(y|x)

y

x

1

| | ; ,K

i i

i

i

p y x x p y x

The factors 𝛽 give a

measure of the relative

importance of each K

regressive model.

They are computed at each

query point weighted

regression

1

; ,with

; ,

i i

i

i Kj j

j

j

p xx

p x

2 x

1 x

Influence of each marginal

is modulated by

Query point

Gaussian Mixture Regression (GMR)

Page 40: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

42

MACHINE LEARNING MACHINE LEARNING

y

x

3) The regressive signal is then obtained by computing E{p(y|x)}:

1

1

1

|

|

Y YX XX X

K

i i i i

i

i

i x

Ki

i

i

x xE p y x

E p y x x x

Linear combination of K local

regressive models

1 x

2 x

2 x

1 x

Gaussian Mixture Regression (GMR)

Page 41: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

43

MACHINE LEARNING MACHINE LEARNING

y

x

1 x

2 x

Computing the variance var{p(x,y)} provides information on the uncertainty

of the prediction computed from the conditional distribution.

Careful: This is not the uncertainty of the model. Use the likelihood to

compute the uncertainty of the predictor.!

22 2

1 1

var |K K

i i i

i i

i i

p y x x x x x

Gaussian Mixture Regression (GMR)

Page 42: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

44

MACHINE LEARNING – 2012 MACHINE LEARNING

var |p y x

|E p y x

Gaussian Mixture Regression (GMR)

Computing the variance var{p(x,y)} provides information on the uncertainty

of the prediction computed from the conditional distribution.

Page 43: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

45

MACHINE LEARNING – 2012 MACHINE LEARNING

GMR: Sensitivity to Choice of K and

Initialization

Page 44: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

46

MACHINE LEARNING – 2012 MACHINE LEARNING

Fit with 4 Gaussians

Uniform initialization

GMR: Sensitivity to Choice of K and

Initialization

Page 45: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

47

MACHINE LEARNING – 2012 MACHINE LEARNING

Fit with 4 Gaussians

Random initialization

GMR: Sensitivity to Choice of K and

Initialization

Page 46: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

48

MACHINE LEARNING – 2012 MACHINE LEARNING

Fit with 10 Gaussians

Random initialization

GMR: Sensitivity to Choice of K and

Initialization

Page 47: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

50

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y

, , Ny f x x y Deterministic regressive model

2, with 0,y f x N Probabilistic regressive model

Probabilistic estimate of the nonlinear relationship between y and x

through the conditional density: (estimates the noise model and f)

And then compute the estimate by taking the expectation over the

conditional density:

|p y x

|

y E f x f x

y E p y x

Gaussian Mixture Regression (GMR)

computes first p(x, y) and, then,

derives p(y|x).

Gaussian Process Regression (GPR)

computes directly p(y|x).

Page 48: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

51

MACHINE LEARNING MACHINE LEARNING

Probability Density Function and Regression

A signal y can be estimated through regression y=f(x) by taking the

expectation over the conditional probability of p on x, for a choice of

parameters for p:

|y E p y x

The simplest way to estimate p(y|x) is through Probabilistic Regression

that estimates a linear regressive model.

Page 49: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

52

MACHINE LEARNING MACHINE LEARNING

, , , T Ny f x w w x w x

PR is a statistical approach to classical linear regression that estimates the

relationship between zero-mean variables y and x by building a linear model

of the form:

2, with 0,Ty w x N

If one assumes that the observed values of y differ from f(x) by an additive

noise that follows a zero-mean Gaussian distribution (such an

assumption consists of putting a prior distribution over the noise), then:

Probabilistic Regression (PR) Change wT

Page 50: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

53

MACHINE LEARNING MACHINE LEARNING

Probabilistic Regression

1

2

21

Data points are independently and identically distributed (i.i.d)

| , , ~ | , ,

1 exp

22

i

Mi i

i

T iM

i

p y X w p y x w

y w x

1

2

Training set of pairs of data points , ,

Likelihood of the regressive model

0,

~ | , ,

i iM

i

T

M X y x y

y w X N

y p y X w

Parameters of the model

Page 51: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

54

MACHINE LEARNING MACHINE LEARNING

1

2

Training set of pairs of data points , ,

Likelihood of the regressive model

0,

~ | , ,

i iM

i

T

M X y x y

y w X N

y p y X w

Probabilistic Regression

Prior model on distribution of parameter w:

110, exp

2

T

w wp w N w w

Hyperparameters

Given by user

Page 52: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

55

MACHINE LEARNING MACHINE LEARNING

Probabilistic Regression

1

2

1 1

2

1 with

1| , , ,

T

w

T T

A XX

p y x X N x A X x A x

y y

Testing point

Training datapoints

Page 53: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

58

MACHINE LEARNING – 2012 MACHINE LEARNING

How to extend the simple linear Bayesian regressive model for nonlinear

regression, such that the non-linear problem becomes linear again?

From Probabilistic Regression

to Gaussian Process Regression

x

y

?

20,Ty w x N

x

2, ~ 0,Ty w x N

x

y

v

Page 54: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

59

MACHINE LEARNING MACHINE LEARNING

1

2

1 1

2

11| , , , , T

w

T TA XXp y x X N x A X x A x

y y

x Non-Linear Transformation

Gaussian Process Regression

1 1

2

2 1

1| , , ,

with

T T

T

w

p y x X N x A X x A x

A X X

y y

20,Ty w x N

How to extend the simple linear Bayesian regressive model for nonlinear

regression, such that the non-linear problem becomes linear again?

Page 55: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

60

MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression

1 1

2

2 1

1| , , ,

with

T T

T

w

p y x X N x A X x A x

A X X

y y

Inner product in feature space

, ' 'T

wk x x x x Take as kernel

1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

Page 56: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

61

MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression

1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y >0

All datapoints

are used in the

computation!

Page 57: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

62

MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression

1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

The kernel and its hyperparameters are given by the user.

These can be optimized through maximum likelihood over the

marginal likelihood, i.e. p(y|X;parameters).

Page 58: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

63

MACHINE LEARNING MACHINE LEARNING

Sensitivity to the choice of kernel width (called lengthscale in most books)

when using Gaussian kernels (also called RBF or square exponential).

Kernel Width=0.1

'

, '

x x

lk x x e

Gaussian Process Regression

Page 59: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

64

MACHINE LEARNING MACHINE LEARNING

Kernel Width=0.5

'

, '

x x

lk x x e

Sensitivity to the choice of kernel width (called lengthscale in most books)

when using Gaussian kernels (also called RBF or square exponential).

Gaussian Process Regression

Page 60: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

65

MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression

1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

The value for the noise needs to pre-set by hand.

The larger the noise, the more uncertainty. The noise is <=1.

1

2cov | , , , ,p y x K x x K x X K X X I K X x

Page 61: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

67

MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression

Low noise: 0.05

Gaussian Process Regression

Page 62: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

68

MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression

High noise: 0.2

Gaussian Process Regression

Page 63: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

70

MACHINE LEARNING MACHINE LEARNING

Comparison Across Methods

GPR Predict y=0 away from datapoints!

Generalization – prediction away from datapoints

1

| , , , i

M

i

i

y E y x X k x x

y

*

1

,i

Mi

i

i

y k x x b

SVR predicts y=b away from datapoints

Page 64: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

71

MACHINE LEARNING MACHINE LEARNING

Comparison Across Methods

Generalization – prediction away from datapoints

GMR predicts the trend away from data

Page 65: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

72

MACHINE LEARNING MACHINE LEARNING

Comparison Across Methods

Generalization – prediction away from datapoints

But prediction depends on

initialization and solution found

during training

Page 66: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

Variance in p(y|x) in GMR

represents uncertainty of

predictive model.

Variance in p(y|x) in GPR

represents uncertainty of

predictive mode.

Page 67: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

75

MACHINE LEARNING MACHINE LEARNING

Variance in SVR represents the epsilon-tube

and does not represent uncertainty of the model

either! No measure of uncertainty in SVR!

Comparison Across Methods

Page 68: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

76

MACHINE LEARNING MACHINE LEARNING

SVR, GPR, GMR: Similarities

• SVR, GMR and GPR are based on the same regressive model:

• GMR and GPR are Gaussian Conditional Distributions

• SVR, GMR and GPR compute a weighted combination of local

predictors

• SVR, GMR and GPR Separate input space into regions modeled by

Gaussian distributions (true only when using Gaussian kernels for

GPR and SVR)

• GMR allows to predict multi-dimensional outputs, while SVR and

GPR can predict only a uni-dimensional output y.

2, ~ 0,y f x N

Page 69: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

77

MACHINE LEARNING MACHINE LEARNING

SVR, GPR, GMR: Differences

SVR, GMR and GPR are based on the same probabilistic regressive model.

But they do not optimize the same objective function find different solutions.

• SVR:

• minimizes reconstruction error through convex optimization

ensured to find the optimal estimate; but not unique solution

• usually finds a nm of models <= nm of datapoints (support vectors)

• GMR:

• learns p(x,y) through maximum likelihood finds local optimum

• compute a generative model p(x,y) from which it derives p(y|x)

• starts with a low nm of models << nm of datapoints

• GPR:

• No optimization; analytical (optimal) solution

• expresses p(y|x) as a full density model

• nm of models = nm of datapoints!

Page 70: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

78

MACHINE LEARNING MACHINE LEARNING

Hyperparameters of SVR, GPR, GMR

SVR, GMR and GPR all depend on hyperparameters that need to be

determined beforehand. These are:

• SVR

• choice of error margin , which can be replaced by choice of in

SVM.

• choice of kernel and associated kernel parameters

• GMR:

• choice of the number of Gaussians

• choice of initialization (affects convergence to local optimum)

• GPR:

• choice of noise parameter

• choice of kernel width (length-scale)

The hyperparamaters can be optimized separately; e.g. the nm of Gaussians in GMR can be

estimated using BIC; the lengthscale and noise of GPR can be estimated through maximum

likelihood and the kernel parameters of SVR can be optimized through grid search.

Page 71: Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

MACHINE LEARNING – 2012

79

MACHINE LEARNING MACHINE LEARNING

Conclusion

No easy way to determine which regression technique fits best your problem

SVR GMR

GPR

SVR

Training Testing

Few SV or Gaussian fct

Small fraction of original data

Keeps all the datapoints

GPR Analytical solution

One shot, but uses all data-points

Convex optimization

GMR

EM, iterative technique,

needs several runs