Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
1
MACHINE LEARNING MACHINE LEARNING
MACHINE LEARNING
Kernel Methods for Regression
Support Vector Regression
Gaussian Mixture Regression
Gaussian Process Regression
MACHINE LEARNING – 2012
2
MACHINE LEARNING MACHINE LEARNING
Problem Statement
Predict given input through a non-linear function :y x f
y f x
y
x
1,...
Estimate that best predict set of training points , ?i i
i Mf x y
1x
1y
2x
2y
3x
3y
4x
4y
MACHINE LEARNING – 2012
3
MACHINE LEARNING MACHINE LEARNING
Non-linear regression and the Kernel Trick
Non-Linear regression: Fit data with a function that is not linear in the
parameters
Non-parametric regression: use the data to determine the parameters
of the function so that the problem can be again phrased as a linear
regression problem.
Kernel Trick: Send data in feature space with non-linear function and
perform linear regression in feature space
; ; : parameters of the functiony f x
,
x : datapoints, k: kernel fct.
i
i
i
i
y k x x
MACHINE LEARNING – 2012
5
MACHINE LEARNING MACHINE LEARNING
Data-driven Regression
y
Good prediction depends on the choice of datapoints.
x
Blue: true function
Red: estimated function
MACHINE LEARNING – 2012
6
MACHINE LEARNING MACHINE LEARNING
Data-driven Regression
y
Good prediction depends on the choice of datapoints.
The more datapoints, the better the fit.
Computational costs increase dramatically with number of datapoints
x
Blue: true function
Red: estimated function
MACHINE LEARNING – 2012
7
MACHINE LEARNING MACHINE LEARNING
Kernel methods for regression
y
Several methods in ML for performing non-linear regression.
Differ in the objective function, in the amount of parameters.
Gaussian Process Regression (GPR) uses all datapoints
x
Blue: true function
Red: estimated function
MACHINE LEARNING – 2012
8
MACHINE LEARNING MACHINE LEARNING
Kernel methods for regression
y
Several methods in ML for performing non-linear regression.
Differ in the objective function, in the amount of parameters.
Gaussian Process Regression (GPR) uses all datapoints
Support Vector Regression (SVR) picks a subset of datapoints (support vectors)
x
Blue: true function
Red: estimated function
MACHINE LEARNING – 2012
9
MACHINE LEARNING MACHINE LEARNING
Kernel methods for regression
y
x
Several methods in ML for performing non-linear regression.
Differ in the objective function, in the amount of parameters.
Gaussian Process Regression (GPR) uses all datapoints
Support Vector Regression (SVR) picks a subset of datapoints (support vectors)
Gaussian Mixture Regression (GMR) generates a new set of datapoints
(centers of Gaussian functions) Blue: true function
Red: estimated function
MACHINE LEARNING – 2012
10
MACHINE LEARNING MACHINE LEARNING
Kernel methods for regression
y
x
, , Ny f x x y Deterministic regressive model
2, with 0,y f x N Probabilistic regressive model
Build an estimate of the noise model and then compute f directly
(Support Vector Regression)
MACHINE LEARNING – 2012
11
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression
MACHINE LEARNING – 2012
12
MACHINE LEARNING MACHINE LEARNING
1,...
Assume a nonlinear mapping , s.t. .
How to estimate to best predict the pair of training points , ?i i
i M
f y f x
f x y
How to generalize the support vector machine framework for classification
to estimate continuous functions?
1. Assume a non-linear mapping through feature space and then perform
linear regression in feature space
2. Supervised learning – minimizes an error function.
First determine a way to measure error on testing set in the linear case!
Support Vector Regression (SVR)
MACHINE LEARNING – 2012
13
MACHINE LEARNING MACHINE LEARNING
Assume a linear mapping , s.t. .Tf y f x w x b
x
Measure the error on prediction
b is estimated as in SVR through
least-square regression on
support vectors; hence we omit it
from the rest of the
developments .
y f x
Support Vector Regression
1,...
How to estimate and to best predict the pair of training points , ?i i
i Mw b x y
MACHINE LEARNING – 2012
14
MACHINE LEARNING MACHINE LEARNING
x
Support Vector Regression
+𝜀
−𝜀
Set an upper bound on the error and
consider as correctly classified all points
such that ( ) ,
Penalize only datapoints that are
not contained in the -tube.
f x y
MACHINE LEARNING – 2012
15
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression
X
-margin
The -margin is a measure of the
width of the -insensitive tube and
hence of the precision of the
regression.
A small ||w|| corresponds to a
small slope for f. In the linear case,
f is more horizontal.
y wx b
MACHINE LEARNING – 2012
16
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression
-margin
A large ||w|| corresponds to a large
slope for f. In the linear case, f is
more vertical.
The flatter the slope of the function
f, the larger the margin
To maximize the margin, we
must minimize the norm of w.
X
y wx b
MACHINE LEARNING – 2012
17
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression
2
1,...
This can be rephrased as a constraint-based optimization problem
of the form:
1minimize
2
,subject to
,
i
i
i
i
i M
w
w x b y
y w x b
Need to penalize
points outside
the -insensitive
tube.
𝜀
MACHINE LEARNING – 2012
18
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression
Need to penalize
points outside
the -insensitive
tube.
*
2 *
1
*
*
Introduce slack variables , , 0 :
1 Cminimize +
2
,
subject to ,
0, 0
i
i
i i
M
i i
i
i
i
i
i
i i
C
wM
w x b y
y w x b
i*
i𝜀
MACHINE LEARNING – 2012
19
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression
All points outside
the -tube become
Support Vectors
i*
i𝜀
*
2 *
1
*
*
Introduce slack variables , , 0 :
1 Cminimize +
2
,
subject to ,
0, 0
i
i
i i
M
i i
i
i
i
i
i
i i
C
wM
w x b y
y w x b
We now have the solution to the linear regression problem.
How to generalize this to the nonlinear case?
MACHINE LEARNING – 2012
20
MACHINE LEARNING MACHINE LEARNING
Lift x into feature space and then perform
linear regression in feature space.
Linear Case:
,
Non-Linear Case:
,
y f x w x b
x x
y f x w x b
w lives in feature space!
x x
Support Vector Regression
MACHINE LEARNING – 2012
21
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression
2 *
1
*
*
In feature space, we obtain the same constrained optimization problem:
1 Cminimize +
2
,
subject to ,
0, 0
i
i
M
i i
i
i
i
i
i
i i
wM
w x b y
y w x b
MACHINE LEARNING – 2012
22
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression
2 * * *
1 1
1
* *
1
1 C CL , , *, = +
2
,
,
i
ii
i
i
M M
i i i i i
i i
Mi
i
i
Mi
i
i
w b wM M
y w x b
y w x b
Again, we can solve this quadratic problem by introducing sets of
Lagrange multipliers and writing the Lagrangian :
Lagrangian = Objective function + l * constraints
MACHINE LEARNING – 2012
23
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression
Requiring that the partial derivatives are all zero
And replacing in the primal Lagrangian, we get the Dual optimization problem:
*
1
L0;i
M
i
ib
*
1
L0;i
Mi
i
i
w xw
* *
*
L0i ii
C
M
*
* *
, 1
* *,
1 1
* *
1
1,
2max
subject to 0 and , 0,
i
i i
i i
Mi j
i j j
i j
M Mi
i i
i i
Mi
i i
i
k x x
y
C
M
MACHINE LEARNING – 2012
24
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression
The solution is given by:
*
1
,i
Mi
i
i
y f x k x x b
Linear Coefficients
(Lagrange multipliers for
each constraint).
If Gaussian Kernel,
M Gaussians centered on
each training datapoint.
MACHINE LEARNING – 2012
25
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression
y
x
*
1
,i
Mi
i
i
y f x k x x b
The solution is given by:
Kernel places a Gauss function on each SV
MACHINE LEARNING – 2012
26
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression
y
x
*
1
,i
Mi
i
i
y f x k x x b
The solution is given by:
The Lagrange multipliers
define the importance of each
Gaussian function
*
1 1.5 2 2 4 3 *
3 1.5 *
5 1 6 2.5
b
Converges to b when SV
effect vanished
1x 2x 3x 4x 5x6x
MACHINE LEARNING – 2012
27
MACHINE LEARNING MACHINE LEARNING
2 *
1
*
*
1 Cminimize +
2
,
subject to ,
0, 0
i
i
M
i i
i
i
i
i
i
i i
wM
w x b y
y w x b
SVR: Hyperparameters
The solution to SVR we just saw is referred to as SVR
Two Hyperparameters
C controls the penalty term on poor fit
determines the minimal required precision
MACHINE LEARNING – 2012
28
MACHINE LEARNING MACHINE LEARNING
Effect of the RBF kernel width on the fit. Here fit using C=1000, =0.01,
kernel width=0.1.
SVR: Effect of Hyperparameters
MACHINE LEARNING – 2012
29
MACHINE LEARNING MACHINE LEARNING
Effect of the RBF kernel width on the fit. Here fit using C=1000, =0.01,
kernel width=0.01 Overfitting
SVR: Effect of Hyperparameters
MACHINE LEARNING – 2012
30
MACHINE LEARNING MACHINE LEARNING
SVR: Effect of Hyperparameters
Effect of the RBF kernel width on the fit. Here fit using C=100, =0.03,
kernel width=0.1 Reduction of the effect of the kernel width on the fit by
choosing appropriate hyperparameters.
.
MACHINE LEARNING – 2012
31
MACHINE LEARNING MACHINE LEARNING
As the number of data grows, so does the number of support vectors.
Introduce a new parameter as in SVM:
n-SVR puts an upper bound on the support vectors
It allows to fit automatically the epsilon-tube!
0,1
Support Vector Regression: -SVR
2 *
1, ,
1min
2
M
j jjww
MACHINE LEARNING – 2012
32
MACHINE LEARNING MACHINE LEARNING
SVR: Example
Effect of the automatic adaptation of using -SVR
Support Vector Regression: -SVR
MACHINE LEARNING – 2012
33
MACHINE LEARNING MACHINE LEARNING
SVR: Example
Effect of the automatic adaptation of using -SVR
Added noise on data
Support Vector Regression: -SVR
MACHINE LEARNING – 2012
34
MACHINE LEARNING MACHINE LEARNING
Kernel methods for regression
, , Ny f x x y Deterministic regressive model
2, with 0,y f x N Probabilistic regressive model
Build an estimate of the noise model and then compute f directly
(Support Vector Regression)
MACHINE LEARNING – 2012
35
MACHINE LEARNING MACHINE LEARNING
Kernel methods for regression
y
, , Ny f x x y Deterministic regressive model
2, with 0,y f x N Probabilistic regressive model
Probabilistic estimate of the nonlinear relationship between y and x
through the conditional density: (estimates the noise model and f)
And then compute the estimate by taking the expectation over the
conditional density:
|p y x
|
y E f x f x
y E p y x
36
MACHINE LEARNING MACHINE LEARNING
x
Gaussian Mixture Regression (GMR)
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
1
, , ; , , with , ; , ,
, : mean and covariance matrix of Gaussian
Ki i i i i i
i
i
i i
p x y p x y p x y N
i
2D projection of a
Gauss function
Ellipse contour ~ 2
std deviation
x
y
y
37
MACHINE LEARNING MACHINE LEARNING
x
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
1
, , ; , , with , ; , ,
, : mean and covariance matrix of Gaussian
Ki i i i i i
i
i
i i
p x y p x y p x y N
i
Parameters are learned through Expectation-maximization.
Iterative procedure. Start with random initialization.
y
Gaussian Mixture Regression (GMR)
38
MACHINE LEARNING MACHINE LEARNING
y
x
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
1
, , ; , , with , ; , ,
, : mean and covariance matrix of Gaussian
Ki i i i i i
i
i
i i
p x y p x y p x y N
i
1
1K
i
i
Mixing Coefficients
Probability that all M datapoints
were generated by Gaussian i:
1
|M
j
i
j
p i p i x
1 2
Gaussian Mixture Regression (GMR)
40
MACHINE LEARNING MACHINE LEARNING
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
2) Compute the regressive signal, by taking p(y|x)
y
x
1
| | ; ,K
i i
i
i
p y x x p y x
1
; ,with
; ,
i i
i
i Kj j
j
j
p xx
p x
Gauss function
The variance changes depending on the query point
Gaussian Mixture Regression (GMR)
41
MACHINE LEARNING MACHINE LEARNING
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
2) Compute the regressive signal, by taking p(y|x)
y
x
1
| | ; ,K
i i
i
i
p y x x p y x
The factors 𝛽 give a
measure of the relative
importance of each K
regressive model.
They are computed at each
query point weighted
regression
1
; ,with
; ,
i i
i
i Kj j
j
j
p xx
p x
2 x
1 x
Influence of each marginal
is modulated by
Query point
Gaussian Mixture Regression (GMR)
42
MACHINE LEARNING MACHINE LEARNING
y
x
3) The regressive signal is then obtained by computing E{p(y|x)}:
1
1
1
|
|
Y YX XX X
K
i i i i
i
i
i x
Ki
i
i
x xE p y x
E p y x x x
Linear combination of K local
regressive models
1 x
2 x
2 x
1 x
Gaussian Mixture Regression (GMR)
43
MACHINE LEARNING MACHINE LEARNING
y
x
1 x
2 x
Computing the variance var{p(x,y)} provides information on the uncertainty
of the prediction computed from the conditional distribution.
Careful: This is not the uncertainty of the model. Use the likelihood to
compute the uncertainty of the predictor.!
22 2
1 1
var |K K
i i i
i i
i i
p y x x x x x
Gaussian Mixture Regression (GMR)
44
MACHINE LEARNING – 2012 MACHINE LEARNING
var |p y x
|E p y x
Gaussian Mixture Regression (GMR)
Computing the variance var{p(x,y)} provides information on the uncertainty
of the prediction computed from the conditional distribution.
45
MACHINE LEARNING – 2012 MACHINE LEARNING
GMR: Sensitivity to Choice of K and
Initialization
46
MACHINE LEARNING – 2012 MACHINE LEARNING
Fit with 4 Gaussians
Uniform initialization
GMR: Sensitivity to Choice of K and
Initialization
47
MACHINE LEARNING – 2012 MACHINE LEARNING
Fit with 4 Gaussians
Random initialization
GMR: Sensitivity to Choice of K and
Initialization
48
MACHINE LEARNING – 2012 MACHINE LEARNING
Fit with 10 Gaussians
Random initialization
GMR: Sensitivity to Choice of K and
Initialization
MACHINE LEARNING – 2012
50
MACHINE LEARNING MACHINE LEARNING
Kernel methods for regression
y
, , Ny f x x y Deterministic regressive model
2, with 0,y f x N Probabilistic regressive model
Probabilistic estimate of the nonlinear relationship between y and x
through the conditional density: (estimates the noise model and f)
And then compute the estimate by taking the expectation over the
conditional density:
|p y x
|
y E f x f x
y E p y x
Gaussian Mixture Regression (GMR)
computes first p(x, y) and, then,
derives p(y|x).
Gaussian Process Regression (GPR)
computes directly p(y|x).
51
MACHINE LEARNING MACHINE LEARNING
Probability Density Function and Regression
A signal y can be estimated through regression y=f(x) by taking the
expectation over the conditional probability of p on x, for a choice of
parameters for p:
|y E p y x
The simplest way to estimate p(y|x) is through Probabilistic Regression
that estimates a linear regressive model.
52
MACHINE LEARNING MACHINE LEARNING
, , , T Ny f x w w x w x
PR is a statistical approach to classical linear regression that estimates the
relationship between zero-mean variables y and x by building a linear model
of the form:
2, with 0,Ty w x N
If one assumes that the observed values of y differ from f(x) by an additive
noise that follows a zero-mean Gaussian distribution (such an
assumption consists of putting a prior distribution over the noise), then:
Probabilistic Regression (PR) Change wT
53
MACHINE LEARNING MACHINE LEARNING
Probabilistic Regression
1
2
21
Data points are independently and identically distributed (i.i.d)
| , , ~ | , ,
1 exp
22
i
Mi i
i
T iM
i
p y X w p y x w
y w x
1
2
Training set of pairs of data points , ,
Likelihood of the regressive model
0,
~ | , ,
i iM
i
T
M X y x y
y w X N
y p y X w
Parameters of the model
54
MACHINE LEARNING MACHINE LEARNING
1
2
Training set of pairs of data points , ,
Likelihood of the regressive model
0,
~ | , ,
i iM
i
T
M X y x y
y w X N
y p y X w
Probabilistic Regression
Prior model on distribution of parameter w:
110, exp
2
T
w wp w N w w
Hyperparameters
Given by user
55
MACHINE LEARNING MACHINE LEARNING
Probabilistic Regression
1
2
1 1
2
1 with
1| , , ,
T
w
T T
A XX
p y x X N x A X x A x
y y
Testing point
Training datapoints
58
MACHINE LEARNING – 2012 MACHINE LEARNING
How to extend the simple linear Bayesian regressive model for nonlinear
regression, such that the non-linear problem becomes linear again?
From Probabilistic Regression
to Gaussian Process Regression
x
y
?
20,Ty w x N
x
2, ~ 0,Ty w x N
x
y
v
MACHINE LEARNING – 2012
59
MACHINE LEARNING MACHINE LEARNING
1
2
1 1
2
11| , , , , T
w
T TA XXp y x X N x A X x A x
y y
x Non-Linear Transformation
Gaussian Process Regression
1 1
2
2 1
1| , , ,
with
T T
T
w
p y x X N x A X x A x
A X X
y y
20,Ty w x N
How to extend the simple linear Bayesian regressive model for nonlinear
regression, such that the non-linear problem becomes linear again?
MACHINE LEARNING – 2012
60
MACHINE LEARNING MACHINE LEARNING
Gaussian Process Regression
1 1
2
2 1
1| , , ,
with
T T
T
w
p y x X N x A X x A x
A X X
y y
Inner product in feature space
, ' 'T
wk x x x x Take as kernel
1
2
1
with ,
| , , , i
M
i
i
K X X I
y E y x X k x x
y
y
MACHINE LEARNING – 2012
61
MACHINE LEARNING MACHINE LEARNING
Gaussian Process Regression
1
2
1
with ,
| , , , i
M
i
i
K X X I
y E y x X k x x
y
y >0
All datapoints
are used in the
computation!
MACHINE LEARNING – 2012
62
MACHINE LEARNING MACHINE LEARNING
Gaussian Process Regression
1
2
1
with ,
| , , , i
M
i
i
K X X I
y E y x X k x x
y
y
The kernel and its hyperparameters are given by the user.
These can be optimized through maximum likelihood over the
marginal likelihood, i.e. p(y|X;parameters).
MACHINE LEARNING – 2012
63
MACHINE LEARNING MACHINE LEARNING
Sensitivity to the choice of kernel width (called lengthscale in most books)
when using Gaussian kernels (also called RBF or square exponential).
Kernel Width=0.1
'
, '
x x
lk x x e
Gaussian Process Regression
MACHINE LEARNING – 2012
64
MACHINE LEARNING MACHINE LEARNING
Kernel Width=0.5
'
, '
x x
lk x x e
Sensitivity to the choice of kernel width (called lengthscale in most books)
when using Gaussian kernels (also called RBF or square exponential).
Gaussian Process Regression
MACHINE LEARNING – 2012
65
MACHINE LEARNING MACHINE LEARNING
Gaussian Process Regression
1
2
1
with ,
| , , , i
M
i
i
K X X I
y E y x X k x x
y
y
The value for the noise needs to pre-set by hand.
The larger the noise, the more uncertainty. The noise is <=1.
1
2cov | , , , ,p y x K x x K x X K X X I K X x
MACHINE LEARNING – 2012
67
MACHINE LEARNING MACHINE LEARNING
Gaussian Process Regression
Low noise: 0.05
Gaussian Process Regression
MACHINE LEARNING – 2012
68
MACHINE LEARNING MACHINE LEARNING
Gaussian Process Regression
High noise: 0.2
Gaussian Process Regression
MACHINE LEARNING – 2012
70
MACHINE LEARNING MACHINE LEARNING
Comparison Across Methods
GPR Predict y=0 away from datapoints!
Generalization – prediction away from datapoints
1
| , , , i
M
i
i
y E y x X k x x
y
*
1
,i
Mi
i
i
y k x x b
SVR predicts y=b away from datapoints
MACHINE LEARNING – 2012
71
MACHINE LEARNING MACHINE LEARNING
Comparison Across Methods
Generalization – prediction away from datapoints
GMR predicts the trend away from data
MACHINE LEARNING – 2012
72
MACHINE LEARNING MACHINE LEARNING
Comparison Across Methods
Generalization – prediction away from datapoints
But prediction depends on
initialization and solution found
during training
Variance in p(y|x) in GMR
represents uncertainty of
predictive model.
Variance in p(y|x) in GPR
represents uncertainty of
predictive mode.
MACHINE LEARNING – 2012
75
MACHINE LEARNING MACHINE LEARNING
Variance in SVR represents the epsilon-tube
and does not represent uncertainty of the model
either! No measure of uncertainty in SVR!
Comparison Across Methods
MACHINE LEARNING – 2012
76
MACHINE LEARNING MACHINE LEARNING
SVR, GPR, GMR: Similarities
• SVR, GMR and GPR are based on the same regressive model:
• GMR and GPR are Gaussian Conditional Distributions
• SVR, GMR and GPR compute a weighted combination of local
predictors
• SVR, GMR and GPR Separate input space into regions modeled by
Gaussian distributions (true only when using Gaussian kernels for
GPR and SVR)
• GMR allows to predict multi-dimensional outputs, while SVR and
GPR can predict only a uni-dimensional output y.
2, ~ 0,y f x N
MACHINE LEARNING – 2012
77
MACHINE LEARNING MACHINE LEARNING
SVR, GPR, GMR: Differences
SVR, GMR and GPR are based on the same probabilistic regressive model.
But they do not optimize the same objective function find different solutions.
• SVR:
• minimizes reconstruction error through convex optimization
ensured to find the optimal estimate; but not unique solution
• usually finds a nm of models <= nm of datapoints (support vectors)
• GMR:
• learns p(x,y) through maximum likelihood finds local optimum
• compute a generative model p(x,y) from which it derives p(y|x)
• starts with a low nm of models << nm of datapoints
• GPR:
• No optimization; analytical (optimal) solution
• expresses p(y|x) as a full density model
• nm of models = nm of datapoints!
MACHINE LEARNING – 2012
78
MACHINE LEARNING MACHINE LEARNING
Hyperparameters of SVR, GPR, GMR
SVR, GMR and GPR all depend on hyperparameters that need to be
determined beforehand. These are:
• SVR
• choice of error margin , which can be replaced by choice of in
SVM.
• choice of kernel and associated kernel parameters
• GMR:
• choice of the number of Gaussians
• choice of initialization (affects convergence to local optimum)
• GPR:
• choice of noise parameter
• choice of kernel width (length-scale)
The hyperparamaters can be optimized separately; e.g. the nm of Gaussians in GMR can be
estimated using BIC; the lengthscale and noise of GPR can be estimated through maximum
likelihood and the kernel parameters of SVR can be optimized through grid search.
MACHINE LEARNING – 2012
79
MACHINE LEARNING MACHINE LEARNING
Conclusion
No easy way to determine which regression technique fits best your problem
SVR GMR
GPR
SVR
Training Testing
Few SV or Gaussian fct
Small fraction of original data
Keeps all the datapoints
GPR Analytical solution
One shot, but uses all data-points
Convex optimization
GMR
EM, iterative technique,
needs several runs