Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression

1

MACHINE LEARNING MACHINE LEARNING

MACHINE LEARNING

Kernel Methods for Regression

Support Vector Regression

Gaussian Mixture Regression

Gaussian Process Regression

MACHINE LEARNING – 2012

2


Problem Statement

Predict given input through a non-linear function :y x f

y f x

y

x

1,...

Estimate that best predict set of training points , ?i i

i Mf x y

1x

1y

2x

2y

3x

3y

4x

4y


3


Non-linear regression and the Kernel Trick

Non-Linear regression: Fit data with a function that is not linear in the

parameters

Non-parametric regression: use the data to determine the parameters

of the function so that the problem can be again phrased as a linear

regression problem.

Kernel Trick: Send data in feature space with non-linear function and

perform linear regression in feature space

; ; : parameters of the functiony f x

,

x : datapoints, k: kernel fct.

i

i

i

i

y k x x


5


Data-driven Regression

y

Good prediction depends on the choice of datapoints.

x

Blue: true function

Red: estimated function


6


Data-driven Regression

y

Good prediction depends on the choice of datapoints.

The more datapoints, the better the fit.

Computational costs increase dramatically with number of datapoints

x

Blue: true function



7


Kernel methods for regression

y

Several methods in ML for performing non-linear regression.

Differ in the objective function, in the amount of parameters.

Gaussian Process Regression (GPR) uses all datapoints

x

Blue: true function



8



y




Support Vector Regression (SVR) picks a subset of datapoints (support vectors)

x

Blue: true function



9



y

x




Support Vector Regression (SVR) picks a subset of datapoints (support vectors)

Gaussian Mixture Regression (GMR) generates a new set of datapoints

(centers of Gaussian functions) Blue: true function



10



y

x

, , Ny f x x y Deterministic regressive model

2, with 0,y f x N Probabilistic regressive model

Build an estimate of the noise model and then compute f directly

(Support Vector Regression)


11




12


1,...

Assume a nonlinear mapping , s.t. .

How to estimate to best predict the pair of training points , ?i i

i M

f y f x

f x y

How to generalize the support vector machine framework for classification

to estimate continuous functions?

1. Assume a non-linear mapping through feature space and then perform

linear regression in feature space

2. Supervised learning – minimizes an error function.

First determine a way to measure error on testing set in the linear case!

Support Vector Regression (SVR)


13


Assume a linear mapping , s.t. .Tf y f x w x b

x

Measure the error on prediction

b is estimated as in SVR through

least-square regression on

support vectors; hence we omit it

from the rest of the

developments .

y f x


1,...

How to estimate and to best predict the pair of training points , ?i i

i Mw b x y


14


x


+𝜀

−𝜀

Set an upper bound on the error and

consider as correctly classified all points

such that ( ) ,

Penalize only datapoints that are

not contained in the -tube.

f x y


15



X

-margin

The -margin is a measure of the

width of the -insensitive tube and

hence of the precision of the

regression.

A small ||w|| corresponds to a

small slope for f. In the linear case,

f is more horizontal.

y wx b


16



-margin

A large ||w|| corresponds to a large

slope for f. In the linear case, f is

more vertical.

The flatter the slope of the function

f, the larger the margin

To maximize the margin, we

must minimize the norm of w.

X

y wx b


17



2

1,...

This can be rephrased as a constraint-based optimization problem

of the form:

1minimize

2

,subject to

,

i

i

i

i

i M

w

w x b y

y w x b

Need to penalize

points outside

the -insensitive

tube.

𝜀


18



Need to penalize

points outside

the -insensitive

tube.

*

2 *

1

*

*

Introduce slack variables , , 0 :

1 Cminimize +

2

,

subject to ,

0, 0

i

i

i i

M

i i

i

i

i

i

i

i i

C

wM

w x b y

y w x b

i*

i𝜀


19



All points outside

the -tube become

Support Vectors

i*

i𝜀

*

2 *

1

*

*

Introduce slack variables , , 0 :

1 Cminimize +

2

,

subject to ,

0, 0

i

i

i i

M

i i

i

i

i

i

i

i i

C

wM

w x b y

y w x b

We now have the solution to the linear regression problem.

How to generalize this to the nonlinear case?


20


Lift x into feature space and then perform

linear regression in feature space.

Linear Case:

,

Non-Linear Case:

,

y f x w x b

x x

y f x w x b

w lives in feature space!

x x



21



2 *

1

*

*

In feature space, we obtain the same constrained optimization problem:

1 Cminimize +

2

,

subject to ,

0, 0

i

i

M

i i

i

i

i

i

i

i i

wM

w x b y

y w x b


22



2 * * *

1 1

1

* *

1

1 C CL , , *, = +

2

,

,

i

ii

i

i

M M

i i i i i

i i

Mi

i

i

Mi

i

i

w b wM M

y w x b

y w x b

Again, we can solve this quadratic problem by introducing sets of

Lagrange multipliers and writing the Lagrangian :

Lagrangian = Objective function + l * constraints


23



Requiring that the partial derivatives are all zero

And replacing in the primal Lagrangian, we get the Dual optimization problem:

*

1

L0;i

M

i

ib

*

1

L0;i

Mi

i

i

w xw

* *

*

L0i ii

C

M

*

* *

, 1

* *,

1 1

* *

1

1,

2max

subject to 0 and , 0,

i

i i

i i

Mi j

i j j

i j

M Mi

i i

i i

Mi

i i

i

k x x

y

C

M


24



The solution is given by:

*

1

,i

Mi

i

i

y f x k x x b

Linear Coefficients

(Lagrange multipliers for

each constraint).

If Gaussian Kernel,

M Gaussians centered on

each training datapoint.


25



y

x

*

1

,i

Mi

i

i

y f x k x x b


Kernel places a Gauss function on each SV


26



y

x

*

1

,i

Mi

i

i

y f x k x x b


The Lagrange multipliers

define the importance of each

Gaussian function

*

1 1.5 2 2 4 3 *

3 1.5 *

5 1 6 2.5

b

Converges to b when SV

effect vanished

1x 2x 3x 4x 5x6x


27


2 *

1

*

*

1 Cminimize +

2

,

subject to ,

0, 0

i

i

M

i i

i

i

i

i

i

i i

wM

w x b y

y w x b

SVR: Hyperparameters

The solution to SVR we just saw is referred to as SVR

Two Hyperparameters

C controls the penalty term on poor fit

determines the minimal required precision


28


Effect of the RBF kernel width on the fit. Here fit using C=1000, =0.01,

kernel width=0.1.

SVR: Effect of Hyperparameters


29



kernel width=0.01 Overfitting



30




kernel width=0.1 Reduction of the effect of the kernel width on the fit by

choosing appropriate hyperparameters.

.


31


As the number of data grows, so does the number of support vectors.

Introduce a new parameter as in SVM:

n-SVR puts an upper bound on the support vectors

It allows to fit automatically the epsilon-tube!

0,1

Support Vector Regression: -SVR

2 *

1, ,

1min

2

M

j jjww


32


SVR: Example

Effect of the automatic adaptation of using -SVR



33


SVR: Example

Effect of the automatic adaptation of using -SVR

Added noise on data



34





Build an estimate of the noise model and then compute f directly

(Support Vector Regression)


35



y



Probabilistic estimate of the nonlinear relationship between y and x

through the conditional density: (estimates the noise model and f)

And then compute the estimate by taking the expectation over the

conditional density:

|p y x

|

y E f x f x

y E p y x

36


x

Gaussian Mixture Regression (GMR)

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

1

, , ; , , with , ; , ,

, : mean and covariance matrix of Gaussian

Ki i i i i i

i

i

i i

p x y p x y p x y N

i

2D projection of a

Gauss function

Ellipse contour ~ 2

std deviation

x

y

y

37


x


1

, , ; , , with , ; , ,


Ki i i i i i

i

i

i i

p x y p x y p x y N

i

Parameters are learned through Expectation-maximization.

Iterative procedure. Start with random initialization.

y


38


y

x


1

, , ; , , with , ; , ,


Ki i i i i i

i

i

i i

p x y p x y p x y N

i

1

1K

i

i

Mixing Coefficients

Probability that all M datapoints

were generated by Gaussian i:

1

|M

j

i

j

p i p i x

1 2


40



2) Compute the regressive signal, by taking p(y|x)

y

x

1

| | ; ,K

i i

i

i

p y x x p y x

1

; ,with

; ,

i i

i

i Kj j

j

j

p xx

p x

Gauss function

The variance changes depending on the query point


41



2) Compute the regressive signal, by taking p(y|x)

y

x

1

| | ; ,K

i i

i

i

p y x x p y x

The factors 𝛽 give a

measure of the relative

importance of each K

regressive model.

They are computed at each

query point weighted

regression

1

; ,with

; ,

i i

i

i Kj j

j

j

p xx

p x

2 x

1 x

Influence of each marginal

is modulated by

Query point


42


y

x

3) The regressive signal is then obtained by computing E{p(y|x)}:

1

1

1

|

|

Y YX XX X

K

i i i i

i

i

i x

Ki

i

i

x xE p y x

E p y x x x

Linear combination of K local

regressive models

1 x

2 x

2 x

1 x


43


y

x

1 x

2 x

Computing the variance var{p(x,y)} provides information on the uncertainty

of the prediction computed from the conditional distribution.

Careful: This is not the uncertainty of the model. Use the likelihood to

compute the uncertainty of the predictor.!

22 2

1 1

var |K K

i i i

i i

i i

p y x x x x x


44

MACHINE LEARNING – 2012 MACHINE LEARNING

var |p y x

|E p y x


Computing the variance var{p(x,y)} provides information on the uncertainty

of the prediction computed from the conditional distribution.

45


GMR: Sensitivity to Choice of K and

Initialization

46


Fit with 4 Gaussians

Uniform initialization


Initialization

47



Random initialization


Initialization

48



Random initialization


Initialization


50



y



Probabilistic estimate of the nonlinear relationship between y and x

through the conditional density: (estimates the noise model and f)

And then compute the estimate by taking the expectation over the

conditional density:

|p y x

|

y E f x f x

y E p y x


computes first p(x, y) and, then,

derives p(y|x).

Gaussian Process Regression (GPR)

computes directly p(y|x).

51


Probability Density Function and Regression

A signal y can be estimated through regression y=f(x) by taking the

expectation over the conditional probability of p on x, for a choice of

parameters for p:

|y E p y x

The simplest way to estimate p(y|x) is through Probabilistic Regression

that estimates a linear regressive model.

52


, , , T Ny f x w w x w x

PR is a statistical approach to classical linear regression that estimates the

relationship between zero-mean variables y and x by building a linear model

of the form:

2, with 0,Ty w x N

If one assumes that the observed values of y differ from f(x) by an additive

noise that follows a zero-mean Gaussian distribution (such an

assumption consists of putting a prior distribution over the noise), then:

Probabilistic Regression (PR) Change wT

53


Probabilistic Regression

1

2

21

Data points are independently and identically distributed (i.i.d)

| , , ~ | , ,

1 exp

22

i

Mi i

i

T iM

i

p y X w p y x w

y w x

1

2

Training set of pairs of data points , ,

Likelihood of the regressive model

0,

~ | , ,

i iM

i

T

M X y x y

y w X N

y p y X w

Parameters of the model

54


1

2

Training set of pairs of data points , ,

Likelihood of the regressive model

0,

~ | , ,

i iM

i

T

M X y x y

y w X N

y p y X w


Prior model on distribution of parameter w:

110, exp

2

T

w wp w N w w

Hyperparameters

Given by user

55



1

2

1 1

2

1 with

1| , , ,

T

w

T T

A XX

p y x X N x A X x A x

y y

Testing point

Training datapoints

58


How to extend the simple linear Bayesian regressive model for nonlinear

regression, such that the non-linear problem becomes linear again?

From Probabilistic Regression

to Gaussian Process Regression

x

y

?

20,Ty w x N

x

2, ~ 0,Ty w x N

x

y

v


59


1

2

1 1

2

11| , , , , T

w

T TA XXp y x X N x A X x A x

y y

x Non-Linear Transformation


1 1

2

2 1

1| , , ,

with

T T

T

w


A X X

y y

20,Ty w x N

How to extend the simple linear Bayesian regressive model for nonlinear

regression, such that the non-linear problem becomes linear again?


60



1 1

2

2 1

1| , , ,

with

T T

T

w


A X X

y y

Inner product in feature space

, ' 'T

wk x x x x Take as kernel

1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y


61



1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y >0

All datapoints

are used in the

computation!


62



1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

The kernel and its hyperparameters are given by the user.

These can be optimized through maximum likelihood over the

marginal likelihood, i.e. p(y|X;parameters).


63


Sensitivity to the choice of kernel width (called lengthscale in most books)

when using Gaussian kernels (also called RBF or square exponential).

Kernel Width=0.1

'

, '

x x

lk x x e



64


Kernel Width=0.5

'

, '

x x

lk x x e

Sensitivity to the choice of kernel width (called lengthscale in most books)

when using Gaussian kernels (also called RBF or square exponential).



65



1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

The value for the noise needs to pre-set by hand.

The larger the noise, the more uncertainty. The noise is <=1.

1

2cov | , , , ,p y x K x x K x X K X X I K X x


67



Low noise: 0.05



68



High noise: 0.2



70


Comparison Across Methods

GPR Predict y=0 away from datapoints!

Generalization – prediction away from datapoints

1

| , , , i

M

i

i

y E y x X k x x

y

*

1

,i

Mi

i

i

y k x x b

SVR predicts y=b away from datapoints


71




GMR predicts the trend away from data


72




But prediction depends on

initialization and solution found

during training

Variance in p(y|x) in GMR

represents uncertainty of

predictive model.

Variance in p(y|x) in GPR

represents uncertainty of

predictive mode.


75


Variance in SVR represents the epsilon-tube

and does not represent uncertainty of the model

either! No measure of uncertainty in SVR!



76


SVR, GPR, GMR: Similarities

• SVR, GMR and GPR are based on the same regressive model:

• GMR and GPR are Gaussian Conditional Distributions

• SVR, GMR and GPR compute a weighted combination of local

predictors

• SVR, GMR and GPR Separate input space into regions modeled by

Gaussian distributions (true only when using Gaussian kernels for

GPR and SVR)

• GMR allows to predict multi-dimensional outputs, while SVR and

GPR can predict only a uni-dimensional output y.

2, ~ 0,y f x N


77


SVR, GPR, GMR: Differences

SVR, GMR and GPR are based on the same probabilistic regressive model.

But they do not optimize the same objective function find different solutions.

• SVR:

• minimizes reconstruction error through convex optimization

ensured to find the optimal estimate; but not unique solution

• usually finds a nm of models <= nm of datapoints (support vectors)

• GMR:

• learns p(x,y) through maximum likelihood finds local optimum

• compute a generative model p(x,y) from which it derives p(y|x)

• starts with a low nm of models << nm of datapoints

• GPR:

• No optimization; analytical (optimal) solution

• expresses p(y|x) as a full density model

• nm of models = nm of datapoints!


78


Hyperparameters of SVR, GPR, GMR

SVR, GMR and GPR all depend on hyperparameters that need to be

determined beforehand. These are:

• SVR

• choice of error margin , which can be replaced by choice of in

SVM.

• choice of kernel and associated kernel parameters

• GMR:

• choice of the number of Gaussians

• choice of initialization (affects convergence to local optimum)

• GPR:

• choice of noise parameter

• choice of kernel width (length-scale)

The hyperparamaters can be optimized separately; e.g. the nm of Gaussians in GMR can be

estimated using BIC; the lengthscale and noise of GPR can be estimated through maximum

likelihood and the kernel parameters of SVR can be optimized through grid search.


79


Conclusion

No easy way to determine which regression technique fits best your problem

SVR GMR

GPR

SVR

Training Testing

Few SV or Gaussian fct

Small fraction of original data

Keeps all the datapoints

GPR Analytical solution

One shot, but uses all data-points

Convex optimization

GMR

EM, iterative technique,

needs several runs

Documents

Kernel Methods for Regression - EPFLlasa.epfl.ch/.../ML_Phd/Slides/Non-linearRegression.pdfKernel methods for regression y x Several methods in ML for performing non-linear regression