Gaussian Mixture Regression - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/Slides/Non... · 19 MACHINE LEARNINGAPPLIED MACHINE LEARNING 19 Computing the variance var{p(x,y)} provides

$Page 1: Gaussian Mixture Regression - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/Slides/Non... · 19 MACHINE LEARNINGAPPLIED MACHINE LEARNING 19 Computing the variance var{p(x,y)} provides$
1

MACHINE LEARNINGAPPLIED MACHINE LEARNING

1

MACHINE LEARNING

Gaussian Mixture Regression

2


2

Brief summary of last week’s lecture

3


3

Locally Weighted Regression

,

= , , with , , , .i

i i i id x x

i x K d x x K d x x e d x x x x

Estimate is determined through local influence of each group of datapoints

X: query point

y x

Generates a smooth function y(x)

1 1

/ : weights function of xii j i

M M

i j

y x x y x x

y

4


4



1 1


M M

i j

y x x y x x

Model-free regression!

No longer explicit model of the formTy w x

Regression computed at each query point.

Depends on training points.

= ,i

i x K d x x

5


5

= ,i

i x K d x x



1 1


M M

i j

y x x y x x

Which training points?

Which kernel?

MACHINE LEARNING – 2012

6


Data-driven Regression

y

x

Green: true function

Blue: estimated function

Good prediction depends on the choice of datapoints.


7


y

Good prediction depends on the choice of datapoints.

The more datapoints, the better the fit.

Computational costs increase dramatically with number of datapoints

x



8


y

Several methods in ML for performing non-linear regression.

Differ in the objective function, in the amount of parameters.

Support Vector Regression (SVR) picks a subset of datapoints (support vectors)

x



9


y




x


y=f(x)

For illustrative purpose, we plot the negative Gauss function next to the SV, but they are distributed on the negative y axis.


10


y




x


*

1 1.5 2 2 4 3 *

3 1.5 *

5 1 6 2.5

b

y=f(x)

*

1

,i

Mi

i

i

y f x k x x b

Analytical solution found after solving a

convex optimization problem

The Lagrange multipliers define the importance of each Gaussian function.

Converges to b when SV effect vanishes.


11


y

x




Gaussian Mixture Regression (GMR) generates a new set of datapoints

(centers of Gaussian functions)



12


Gaussian Mixture Regression

13


13

x

Gaussian Mixture Regression (GMR)

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

1

, , ; , , with , ; , ,

, : mean and covariance matrix of Gaussian

Ki i i i i i

i

i

i i

p x y p x y p x y N

i

2D projection of a

Gauss function

Ellipse contour ~ 2

std deviation

x

y

y

14


14

x


1

, , ; , , with , ; , ,


Ki i i i i i

i

i

i i

p x y p x y p x y N

i

Parameters are learned through Expectation-maximization.

Iterative procedure. Start with random initialization.

y


15


15

y

x


1

, , ; , , with , ; , ,


Ki i i i i i

i

i

i i

p x y p x y p x y N

i

1

1K

i

i

Mixing Coefficients

Relative importance of each

Gaussian i:

1

1|

Mj

i

i

p i p i xM

1 2


16


16


2) Compute the regressive signal, by taking p(y|x)

y

x

1

| | ; ,K

i i

i

i

p y x x p y x

1

; ,with

; ,

i i

i

i Kj j

j

j

p xx

p x

Gauss function

The variance changes depending on the query point


17


17


2) Compute the regressive signal, by taking p(y|x)

y

x

1

| | ; ,K

i i

i

i

p y x x p y x

1

; ,with

; ,

i i

i

i Kj j

j

j

p xx

p x

2 x

1 x

Influence of each marginal

is modulated by

Query point


x

18


18

y

x

3) The regressive signal is then obtained by computing E{p(y|x)}:

1

|K

i

i

i

y E p y x x x

Linear combination of K local

regressive models

1 x

2 x

2 x

1 x


1

1

|

i

Ki i i i

i y xy xx x

i

x

xy E p y x x

x

The covariance matrix of

each Gauss function

can be decomposed into

blocks of matrices

,

with and the

covariance matrices on and and

the crosscovar

i i

i

i i

i i

i

xx xy

yx yy

xx yy

xy

i

x y

iance matrix.

19


19

Computing the variance var{p(x,y)} provides information on the uncertainty

of the prediction computed from the conditional distribution.

Careful: This is not the uncertainty of the model.

Use the likelihood to compute the uncertainty of the model!

y

x

1 x

2 x

2

22

1 1

var |K K

i i i

i i

i i

p y x x x x x


1 2x x

x

1

|K

i

i

i

E p y x x x

1

with yy yx xx xy

i i i i i

The variance of the model is a weighted

combination of the variances of the

models around the weighted mean.

20

MACHINE LEARNING – 2012APPLIED MACHINE LEARNING

20

var |p y x

|E p y x


Computing the variance var{p(x,y)} provides information on the uncertainty

of the prediction computed from the conditional distribution.

Color shading gives the

likelihood of the model

(uncertainty).

21


21

var |p y x

Observe the modulation of the variance from small variance in first

Gauss function to large variance in the second Gauss function

22


22

GMR: Sensitivity to Choice of K and

Initialization

23


23

Fit with 4 Gaussians

Uniform initialization


Initialization

24


24


Random initialization


Initialization

25


25


Random initialization


Initialization

26


26

Gaussian Mixture Regression: Summary

Such generative model provides more information than models that

directly compute p(y|x).

It allows to learn to predict a multi-dimensional output y.

It allows to query x given y, i.e. to compute p(x|y).

Parametrize the density p(x,y) and then estimate solely the parameters.

The density is constructed from a mixture of K Gaussians:

1

, , ; , , with , ; , ,


Ki i i i i i

i

i

i i

p x y p x y p x y N

i


27


Comparison Across Methods

Generalization – prediction away from datapoints

*

1

,i

Mi

i

i

y k x x b

SVR predicts y=b away from datapoints


28




GMR predicts the trend away from data

1

Ki

i

i

y x x

1

; ,with

; ,

i i

i

i Kj j

j

j

p xx

p x


29




But prediction depends on

model choice and initialization

(that influence solution found

during GMM training phase).GMR predicts the trend away from data

1

Ki

i

i

y x x


30




The prediction away from the datapoints is affected by all

regressive models. It may become meaningless!

use the likelihood of the model to determine whether it is safe

or not to use the prediction

Variance in p(y|x) in GMR represents the

modelled uncertainty of the value of y. It is not

a measure of the uncertainty of the model.

Variance in SVR represents the epsilon-tube, the

uncertainty around the predicted value of y. It does

not represent uncertainty of the model either!


32


SVR, GMR: Similarities

• SVR and GMR are based on the same regressive model:

|y E p y x

SVR Solution GMR Solution

Assume white noise: ~ 0,N

| |

|

E p y x E p y x

y E p y x

( )y f x


33


SVR, GMR: Similarities

• SVR and GMR are based on the same regressive model:

SVR and GMR compute a weighted combination of local predictors

Both separate input space into regions modeled by Gaussian

distributions (true only when using Gaussian/RBF kernels for SVR).

Model computed locally (locally weighted regression)!

*

1

,i

Mi

i

i

y k x x b

1

Ki

i

i

y x x



34


SVR, GMR: Differences

GMR allows to predict multi-dimensional outputs, while SVR can

predict only a uni-dimensional output y.

|y E p y x( )y f x


But starts by computing ,p y x

can compute |

, can have arbitrary dimensions

p x y

x y

is unidimensional

can be multi-dimensional

y

x


35


SVR and GMR and GPR are based on the same regressive model.

But they do not optimize the same objective function find different solutions.

• SVR:

• minimizes reconstruction error through convex optimization

ensured to find the optimal estimate; but not unique solution

• usually finds a nm of models <= nm of datapoints (support vectors)

• GMR:

• learns p(x,y) through maximum likelihood finds local optimum

• compute a generative model p(x,y) from which it derives p(y|x)

• starts with a low nm of models << nm of datapoints

SVR, GMR: Differences


36


Hyperparameters of SVR, GMR

SVR and GMR depend on hyperparameters that need to be determined

beforehand. These are:

• SVR

• choice of error margin and penalty factor C.

• choice of kernel and associated kernel parameters

• GMR:

• choice of the number of Gaussians

• choice of initialization (affects convergence to local optimum)

The hyperparamaters can be optimized separately; e.g. the nm of Gaussians in GMR

can be estimated using BIC; the kernel parameters of SVR can be optimized through

grid search.


37


Conclusion

No easy way to determine which regression technique fits best your problem

SVR

GMRGrows O(K)

SVR

Training Testing

Grows

O(number of SV)

Few SV - Small fraction

of original data

Convex optimization

(SMO solver)

Parameters grow O(M*N)

GMREM, iterative technique,

needs several runs

Parameters grow O(K*N2)

M: number of datapoints; N: Dimension of data; K: Number of Gauss Functions in GMM model

Documents

Gaussian Mixture Regression - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/Slides/Non... · 19 MACHINE LEARNINGAPPLIED MACHINE LEARNING 19 Computing the variance var{p(x,y)} provides