Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
1
MACHINE LEARNINGAPPLIED MACHINE LEARNING
1
MACHINE LEARNING
Gaussian Mixture Regression
2
MACHINE LEARNINGAPPLIED MACHINE LEARNING
2
Brief summary of last week’s lecture
3
MACHINE LEARNINGAPPLIED MACHINE LEARNING
3
Locally Weighted Regression
,
= , , with , , , .i
i i i id x x
i x K d x x K d x x e d x x x x
Estimate is determined through local influence of each group of datapoints
X: query point
y x
Generates a smooth function y(x)
1 1
/ : weights function of xii j i
M M
i j
y x x y x x
y
4
MACHINE LEARNINGAPPLIED MACHINE LEARNING
4
Locally Weighted Regression
Estimate is determined through local influence of each group of datapoints
1 1
/ : weights function of xii j i
M M
i j
y x x y x x
Model-free regression!
No longer explicit model of the formTy w x
Regression computed at each query point.
Depends on training points.
= ,i
i x K d x x
5
MACHINE LEARNINGAPPLIED MACHINE LEARNING
5
= ,i
i x K d x x
Locally Weighted Regression
Estimate is determined through local influence of each group of datapoints
1 1
/ : weights function of xii j i
M M
i j
y x x y x x
Which training points?
Which kernel?
MACHINE LEARNING – 2012
6
MACHINE LEARNINGAPPLIED MACHINE LEARNING
Data-driven Regression
y
x
Green: true function
Blue: estimated function
Good prediction depends on the choice of datapoints.
MACHINE LEARNING – 2012
7
MACHINE LEARNINGAPPLIED MACHINE LEARNING
y
Good prediction depends on the choice of datapoints.
The more datapoints, the better the fit.
Computational costs increase dramatically with number of datapoints
x
Data-driven Regression
MACHINE LEARNING – 2012
8
MACHINE LEARNINGAPPLIED MACHINE LEARNING
y
Several methods in ML for performing non-linear regression.
Differ in the objective function, in the amount of parameters.
Support Vector Regression (SVR) picks a subset of datapoints (support vectors)
x
Data-driven Regression
MACHINE LEARNING – 2012
9
MACHINE LEARNINGAPPLIED MACHINE LEARNING
y
Several methods in ML for performing non-linear regression.
Differ in the objective function, in the amount of parameters.
Support Vector Regression (SVR) picks a subset of datapoints (support vectors)
x
Data-driven Regression
y=f(x)
For illustrative purpose, we plot the negative Gauss function next to the SV, but they are distributed on the negative y axis.
MACHINE LEARNING – 2012
10
MACHINE LEARNINGAPPLIED MACHINE LEARNING
y
Several methods in ML for performing non-linear regression.
Differ in the objective function, in the amount of parameters.
Support Vector Regression (SVR) picks a subset of datapoints (support vectors)
x
Data-driven Regression
*
1 1.5 2 2 4 3 *
3 1.5 *
5 1 6 2.5
b
y=f(x)
*
1
,i
Mi
i
i
y f x k x x b
Analytical solution found after solving a
convex optimization problem
The Lagrange multipliers define the importance of each Gaussian function.
Converges to b when SV effect vanishes.
MACHINE LEARNING – 2012
11
MACHINE LEARNINGAPPLIED MACHINE LEARNING
y
x
Several methods in ML for performing non-linear regression.
Differ in the objective function, in the amount of parameters.
Support Vector Regression (SVR) picks a subset of datapoints (support vectors)
Gaussian Mixture Regression (GMR) generates a new set of datapoints
(centers of Gaussian functions)
Data-driven Regression
MACHINE LEARNING – 2012
12
MACHINE LEARNINGAPPLIED MACHINE LEARNING
Gaussian Mixture Regression
13
MACHINE LEARNINGAPPLIED MACHINE LEARNING
13
x
Gaussian Mixture Regression (GMR)
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
1
, , ; , , with , ; , ,
, : mean and covariance matrix of Gaussian
Ki i i i i i
i
i
i i
p x y p x y p x y N
i
2D projection of a
Gauss function
Ellipse contour ~ 2
std deviation
x
y
y
14
MACHINE LEARNINGAPPLIED MACHINE LEARNING
14
x
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
1
, , ; , , with , ; , ,
, : mean and covariance matrix of Gaussian
Ki i i i i i
i
i
i i
p x y p x y p x y N
i
Parameters are learned through Expectation-maximization.
Iterative procedure. Start with random initialization.
y
Gaussian Mixture Regression (GMR)
15
MACHINE LEARNINGAPPLIED MACHINE LEARNING
15
y
x
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
1
, , ; , , with , ; , ,
, : mean and covariance matrix of Gaussian
Ki i i i i i
i
i
i i
p x y p x y p x y N
i
1
1K
i
i
Mixing Coefficients
Relative importance of each
Gaussian i:
1
1|
Mj
i
i
p i p i xM
1 2
Gaussian Mixture Regression (GMR)
16
MACHINE LEARNINGAPPLIED MACHINE LEARNING
16
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
2) Compute the regressive signal, by taking p(y|x)
y
x
1
| | ; ,K
i i
i
i
p y x x p y x
1
; ,with
; ,
i i
i
i Kj j
j
j
p xx
p x
Gauss function
The variance changes depending on the query point
Gaussian Mixture Regression (GMR)
17
MACHINE LEARNINGAPPLIED MACHINE LEARNING
17
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
2) Compute the regressive signal, by taking p(y|x)
y
x
1
| | ; ,K
i i
i
i
p y x x p y x
1
; ,with
; ,
i i
i
i Kj j
j
j
p xx
p x
2 x
1 x
Influence of each marginal
is modulated by
Query point
Gaussian Mixture Regression (GMR)
x
18
MACHINE LEARNINGAPPLIED MACHINE LEARNING
18
y
x
3) The regressive signal is then obtained by computing E{p(y|x)}:
1
|K
i
i
i
y E p y x x x
Linear combination of K local
regressive models
1 x
2 x
2 x
1 x
Gaussian Mixture Regression (GMR)
1
1
|
i
Ki i i i
i y xy xx x
i
x
xy E p y x x
x
The covariance matrix of
each Gauss function
can be decomposed into
blocks of matrices
,
with and the
covariance matrices on and and
the crosscovar
i i
i
i i
i i
i
xx xy
yx yy
xx yy
xy
i
x y
iance matrix.
19
MACHINE LEARNINGAPPLIED MACHINE LEARNING
19
Computing the variance var{p(x,y)} provides information on the uncertainty
of the prediction computed from the conditional distribution.
Careful: This is not the uncertainty of the model.
Use the likelihood to compute the uncertainty of the model!
y
x
1 x
2 x
2
22
1 1
var |K K
i i i
i i
i i
p y x x x x x
Gaussian Mixture Regression (GMR)
1 2x x
x
1
|K
i
i
i
E p y x x x
1
with yy yx xx xy
i i i i i
The variance of the model is a weighted
combination of the variances of the
models around the weighted mean.
20
MACHINE LEARNING – 2012APPLIED MACHINE LEARNING
20
var |p y x
|E p y x
Gaussian Mixture Regression (GMR)
Computing the variance var{p(x,y)} provides information on the uncertainty
of the prediction computed from the conditional distribution.
Color shading gives the
likelihood of the model
(uncertainty).
21
MACHINE LEARNING – 2012APPLIED MACHINE LEARNING
21
var |p y x
Observe the modulation of the variance from small variance in first
Gauss function to large variance in the second Gauss function
22
MACHINE LEARNING – 2012APPLIED MACHINE LEARNING
22
GMR: Sensitivity to Choice of K and
Initialization
23
MACHINE LEARNING – 2012APPLIED MACHINE LEARNING
23
Fit with 4 Gaussians
Uniform initialization
GMR: Sensitivity to Choice of K and
Initialization
24
MACHINE LEARNING – 2012APPLIED MACHINE LEARNING
24
Fit with 4 Gaussians
Random initialization
GMR: Sensitivity to Choice of K and
Initialization
25
MACHINE LEARNING – 2012APPLIED MACHINE LEARNING
25
Fit with 10 Gaussians
Random initialization
GMR: Sensitivity to Choice of K and
Initialization
26
MACHINE LEARNINGAPPLIED MACHINE LEARNING
26
Gaussian Mixture Regression: Summary
Such generative model provides more information than models that
directly compute p(y|x).
It allows to learn to predict a multi-dimensional output y.
It allows to query x given y, i.e. to compute p(x|y).
Parametrize the density p(x,y) and then estimate solely the parameters.
The density is constructed from a mixture of K Gaussians:
1
, , ; , , with , ; , ,
, : mean and covariance matrix of Gaussian
Ki i i i i i
i
i
i i
p x y p x y p x y N
i
MACHINE LEARNING – 2012
27
MACHINE LEARNINGAPPLIED MACHINE LEARNING
Comparison Across Methods
Generalization – prediction away from datapoints
*
1
,i
Mi
i
i
y k x x b
SVR predicts y=b away from datapoints
MACHINE LEARNING – 2012
28
MACHINE LEARNINGAPPLIED MACHINE LEARNING
Comparison Across Methods
Generalization – prediction away from datapoints
GMR predicts the trend away from data
1
Ki
i
i
y x x
1
; ,with
; ,
i i
i
i Kj j
j
j
p xx
p x
MACHINE LEARNING – 2012
29
MACHINE LEARNINGAPPLIED MACHINE LEARNING
Comparison Across Methods
Generalization – prediction away from datapoints
But prediction depends on
model choice and initialization
(that influence solution found
during GMM training phase).GMR predicts the trend away from data
1
Ki
i
i
y x x
MACHINE LEARNING – 2012
30
MACHINE LEARNINGAPPLIED MACHINE LEARNING
Comparison Across Methods
Generalization – prediction away from datapoints
The prediction away from the datapoints is affected by all
regressive models. It may become meaningless!
use the likelihood of the model to determine whether it is safe
or not to use the prediction
Variance in p(y|x) in GMR represents the
modelled uncertainty of the value of y. It is not
a measure of the uncertainty of the model.
Variance in SVR represents the epsilon-tube, the
uncertainty around the predicted value of y. It does
not represent uncertainty of the model either!
MACHINE LEARNING – 2012
32
MACHINE LEARNINGAPPLIED MACHINE LEARNING
SVR, GMR: Similarities
• SVR and GMR are based on the same regressive model:
|y E p y x
SVR Solution GMR Solution
Assume white noise: ~ 0,N
| |
|
E p y x E p y x
y E p y x
( )y f x
MACHINE LEARNING – 2012
33
MACHINE LEARNINGAPPLIED MACHINE LEARNING
SVR, GMR: Similarities
• SVR and GMR are based on the same regressive model:
SVR and GMR compute a weighted combination of local predictors
Both separate input space into regions modeled by Gaussian
distributions (true only when using Gaussian/RBF kernels for SVR).
Model computed locally (locally weighted regression)!
*
1
,i
Mi
i
i
y k x x b
1
Ki
i
i
y x x
SVR Solution GMR Solution
MACHINE LEARNING – 2012
34
MACHINE LEARNINGAPPLIED MACHINE LEARNING
SVR, GMR: Differences
GMR allows to predict multi-dimensional outputs, while SVR can
predict only a uni-dimensional output y.
|y E p y x( )y f x
SVR Solution GMR Solution
But starts by computing ,p y x
can compute |
, can have arbitrary dimensions
p x y
x y
is unidimensional
can be multi-dimensional
y
x
MACHINE LEARNING – 2012
35
MACHINE LEARNINGAPPLIED MACHINE LEARNING
SVR and GMR and GPR are based on the same regressive model.
But they do not optimize the same objective function find different solutions.
• SVR:
• minimizes reconstruction error through convex optimization
ensured to find the optimal estimate; but not unique solution
• usually finds a nm of models <= nm of datapoints (support vectors)
• GMR:
• learns p(x,y) through maximum likelihood finds local optimum
• compute a generative model p(x,y) from which it derives p(y|x)
• starts with a low nm of models << nm of datapoints
SVR, GMR: Differences
MACHINE LEARNING – 2012
36
MACHINE LEARNINGAPPLIED MACHINE LEARNING
Hyperparameters of SVR, GMR
SVR and GMR depend on hyperparameters that need to be determined
beforehand. These are:
• SVR
• choice of error margin and penalty factor C.
• choice of kernel and associated kernel parameters
• GMR:
• choice of the number of Gaussians
• choice of initialization (affects convergence to local optimum)
The hyperparamaters can be optimized separately; e.g. the nm of Gaussians in GMR
can be estimated using BIC; the kernel parameters of SVR can be optimized through
grid search.
MACHINE LEARNING – 2012
37
MACHINE LEARNINGAPPLIED MACHINE LEARNING
Conclusion
No easy way to determine which regression technique fits best your problem
SVR
GMRGrows O(K)
SVR
Training Testing
Grows
O(number of SV)
Few SV - Small fraction
of original data
Convex optimization
(SMO solver)
Parameters grow O(M*N)
GMREM, iterative technique,
needs several runs
Parameters grow O(K*N2)
M: number of datapoints; N: Dimension of data; K: Number of Gauss Functions in GMM model