Upload
burke
View
57
Download
0
Embed Size (px)
DESCRIPTION
Ch 3. Linear Models for Regression (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/. Contents. - PowerPoint PPT Presentation
Citation preview
Ch 3. Linear Models for Regression Ch 3. Linear Models for Regression (1/2)(1/2)
Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.
Previously summarized by Yung-Kyun Noh
Modified and presented by Rhee, Je-Keun
Biointelligence Laboratory, Seoul National University
http://bi.snu.ac.kr/
2 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
ContentsContents
3.1 Linear Basis Function Models 3.1.1 Maximum likelihood and least squares 3.1.2 Geometry of least squares 3.1.3 Sequential learning 3.1.4 Regularized least squares 3.1.5 Multiple outputs
3.2 The Bias-Variance Decomposition 3.3 Bayesian Lear Regression
3.3.1 Parameter distribution 3.3.2 Predictive distribution 3.3.3 Equivalent kernel
3 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Linear Basis Function ModelsLinear Basis Function Models
Linear regression
Linear model Linearity in the parameters Using basis functions, allow nonlinear function of the input vector x. Simplify the analysis of this class of models Have some significant limitations
M: total number of parameters : basis functions ( : dummy basis function) ,
0 1 1( , ) ... D Dy w w x w x x w
1
0
( , ) ( ) ( )M
Tj j
j
y w
x w x w x
( )j x0( ) 1 x
0 1( ,..., )TMw w w 0 1( ,..., )TM
4 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Basis FunctionsBasis Functions
Polynomial functions: Global functions of the input variable
spline functions Gaussian basis functions:
Sigmoidal basis functions: Logistic sigmoid functions:
Fourier basis wavelets
( ) jj x x
2
2
( )( ) exp{ }
2j
j
xx
s
( ) ( )j
j
xx
s
1( )
1 exp( )j xa
5 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Maximum Likelihood and Least Squares (1/2)Maximum Likelihood and Least Squares (1/2)
Assumption: Gaussian noise model
: zero mean Gaussian random variable with precision (inverse variance) .
Result Conditional mean = (unimoda
l)
For dataset Likelihood: (Drop the explicit x)
( , )t y x w
1( | , , ) ( | ( , ), )p t t y x w x wN
[ | ] ( | ) ( , )t tp t dt y x x x wE
1 1{ ,..., }, { ,..., }N Nt t X x x t
1
1
( | , , ) ( | ( ), )N
Tn n
n
p t
t X w w xN
6 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Maximum Likelihood and Least Squares (2/3)Maximum Likelihood and Least Squares (2/3)
Log-likelihood
Maximization of the likelihood function under a conditional Gaussian noise distribution for a linear model is equivalent to minimizing a sum-of-squares error function.
1
1
l n ( | , ) l n ( | ( ), ) l n l n(2 ) ( )2 2
NT
n n Dn
N Np t E
t w w x wN
2
1
1( ) ( ( ))
2
NT
D n nn
E t
w w x
7 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Maximum Likelihood and Least Squares (3/3)Maximum Likelihood and Least Squares (3/3)
The gradient of the log likelihood function
Setting the gradient of log likelihood and setting it to zero to get
where the NxM design matrix
1( )T TML
w Φ Φ Φ t
0 1 1 1 1 1
0 2 1 2 1 2
0 1 1
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
M
M
N N M N
x x x
x x xΦ
x x x
8 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Bias and Precision Parameter by MLBias and Precision Parameter by ML
Some other solutions we can get by setting derivative to zero. Bias maximizing log likelihood
The bias compensates for the difference between the averages (over the training set) of the target values and the weighted sum of the averages of the basis function values.
Noise precision parameter maximizing log likelihood
1
01
M
j jj
w t w
1
1 N
nn
t tN
1
1( )
N
j j nnN
x
2
1
1 1{ ( )}
NT
n ML nnML
tN
w x
9 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Geometry of Least SquaresGeometry of Least Squares
If the number M of basis functions is smaller than the number N of data points, then the M vectors will span a linear subspace S of dimensionality M.
: jth column of
( )j n x
y: linear combination of The least-squares solution
for w corresponds to that choice of y that lies in subspace S and that is closest to t.
j Φ
j
10 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Sequential LearningSequential Learning
On-line learning Technique of Stochastic gradient descent (or sequential gradient
descent)
For the case of sum-of-squares error function (least-mean-square or the LMS algorithm)
( 1) ( )nE w w
( 1) ( ) ( )( )Tn n nt w w w
11 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Regularized Least SquaresRegularized Least Squares
Regularized least-square Control over-fitting Total error function
Closed form solution (setting the gradient):
This represents a simple extension of the least-squares solution.
A more general regularizer
( ) ( )D WE Ew w
2
1
1( ) { ( )}
2
NT
D n nn
E t
w w x1
( )2
TWE w w w
1( )T T w I Φ Φ Φ t
2
1 1
1 1{ ( )}
2 2
N M qT
n n jn j
t w
w x
12 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
General RegularizerGeneral Regularizer
In case q=1 in general regularizer ‘lasso’ in the statistical literature
If λ is sufficiently large, some of the coefficients wj are driven to zero.
Sparse model: corresponding basis functions play no role. Minimizing the unregularized sum-of-squares error s.t. the constraint
Contours of the regularization termThe lasso gives the sparse solution
1
M q
jj
w
13 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Regularization & complexityRegularization & complexity
Regularization allows complex models to be trained on data sets of limited size without severe over-fitting, essentially by limiting the effective model complexity.
However, the problem of determining the optimal model complexity is then shifted from on of finding the appropriate number of basis functions to one of determining a suitable value of the regularization coefficient λ.
14 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Multiple OutputsMultiple Outputs
For K>1 target variables 1. Introduce a different set of basis functions for each componen
t of t. 2. Use the same set of basis functions to model all of the compo
nents of the target vector. (W: MxK matrix of parameters)
For each variable tk,
: pseudo-inverse of
( , ) ( )Ty x w W x
1( | , , ) ( | ( ), )Tp t x W t W x IN
1( )T Tk k k
w Φ Φ Φ t Φ tΦ Φ
15 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
The Bias-Variance Decomposition (1/4)The Bias-Variance Decomposition (1/4)
Frequentist viewpoint of the model complexity issue: bias-variance trade-off.
Expected squared loss
Bayesian: the uncertainty in our model is expressed through a posterior distribution over w.
Frequentist: make a point estimate of w based on the data set D.
2 2[ ] { ( ) ( )} ( ) { ( ) } ( , )L y h p d h t p t d dt x x x x x x xEArises from the intrinsic noise on the data
( ) [ | ] ( | )h t tp t dt x x xE
Dependent on the particular dataset D.
2[{ ( ; ) ( )} ]D y D hx xE
16 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
The Bias-Variance Decomposition (2/4)The Bias-Variance Decomposition (2/4)
Bias The extent to which the average prediction over all data sets diff
ers from the desired regression function. Variance
The extent to which the solutions for individual data sets vary around their average.
The extent to which the function y(x;D) is sensitive to the particular choice of data set.
Expected loss = (bias)2 + variance + noise
2
2
2 2
var i ance(bi as)
[{ ( ; ) ( )} ]
{ [ ( ; )] ( )} [{ ( ; ) [ ( ; )]} ]D
D D D
y D h
y D h y D y D
x x
x x x x
E
E E E
17 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
The Bias-Variance Decomposition (3/4)The Bias-Variance Decomposition (3/4)
bias-variance trade-off Averaging many solutions
for the complex model (M=25) is a beneficial procedure.
A weighted averaging (although with respect to the posterior distribution of parameters, not with respect to multiple data sets) of multiple solutions lies at the heart of Bayesian approach.
( ) si n(2 )h x x
18 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
The Bias-Variance Decomposition (4/4)The Bias-Variance Decomposition (4/4)
The average prediction
Bias and variance
Bias-variance decomposition is based on averages with respect to ensembles of data sets (frequentist perspective). We would be better off combining them into a single large training set.
( )
1
1( ) ( )
Ll
l
y x y xL
2 2
1
1( ) { ( ) ( )}
N
n nn
bias y x h xN
( ) 2
1 1
1 1{ ( ) ( )}
N Ll
n nn l
variance y x y xN L
19 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Bayesian Linear RegressionBayesian Linear Regression
In the particular problem, it cannot be decided simply by maximizing the likelihood function, because it always leads to excessively complex models and overfitting.
Independent hold-out data can be used to determine model complexity, but this can be both computationally expensive and wasteful of valuable data.
Bayesian treatment of linear regression will avoid the overfitting problem of maximum likelihood, and will also lead to autoamtic methods of determining model complexity using training data alome.
20 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Parameter distribution (1/3)Parameter distribution (1/3)
Conjugate prior of likelihood
Posterior
The maximum posterior weight vector
If S0=α -1I with α → 0, the mean mN reduces to wML given by (3.15)
0 0( ) ( | , )p w w m SN
( | ) ( | , )N Np w t w m SN10 0( )T
N N m S S m Φ t
1 10
TN S S Φ Φ
1( )T TML
w Φ Φ Φ t
21 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Parameter distribution (2/3)Parameter distribution (2/3)
Consider prior
Corresponding posterior
Log of the posterior
Maximization of this posterior distribution with respect to w is equivalent to the minimization of the sum-of squares error function with the addition of a quadratic regularization term with λ=α /β.
1( ) ( | 0, )p w w IN
10 0( )T
N N m S S m Φ t1 1
0T
N S S Φ Φ
2
1
l n ( | )
{ ( )} .2 2
NT T
n nn
p
t const
w t
w x w w
22 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Parameter distribution (3/3)Parameter distribution (3/3)
Other forms of prior over parameters
1/
1
1( | ) exp( )
2 2 (1/ ) 2
Mq M q
jj
qp w
q
w
0 1( , )y x w w x w
23 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Predictive Distribution (1/2)Predictive Distribution (1/2)
Our real interests( | , , )
( | , ) ( | , , )
p t
p t p d
t
w w t w1( | , , ) ( | ( , ), )p t t y x w x wN
( | ) ( | , )N Np w t w m SN
Mean of the Gaussian predictive distribution (red line), and predictive uncertainty (shaded region) as the number of data increases.
2
( | , , , )
( | ( ), ( ))TN N
p t
t
x t
m x xN
2 1( ) ( ) ( )T
N N
x x S x
noise Uncertainty associated with the parameters
w.0 if N∞
24 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Predictive Distribution (2/2)Predictive Distribution (2/2)
Draw samples from the posterior distribution over w.
25 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Equivalent Kernel (1/2)Equivalent Kernel (1/2)
If we substitue (3.53) into the expression (3.3), we see that the predictive mean can be written in the form
Mean of the predictive distribution at a point x.
1
( , ) ( ) ( ) ( ) ( )N
T T T TN N N N n n
n
y t
x m m x x S Φ t x S x
1
( , ) ( , )N
N n nn
y k t
x m x x ( , ) ( ) ( ')Tn Nk x x x S x
Smoother matrix or equivalent kernel
Polynomial and sigmoidal basis function
26 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Equivalent Kernel (2/2)Equivalent Kernel (2/2)
Instead of introducing a set of basis functions, which implicitly determines an equivalent kernel, we can instead define a localized kernel directly and use this to make predictions for new input vector x, given the observed training set. This leads to a practical framework for regression (and classification) called G
aussian processes.
The equivalent kernel satisfies an important property shared by kernel functions in general, namely that it can be expressed in the form an inner product with respect to a vector ψ(x) of nonlinear functions. Inner product of nonlinear functions
( , ) ( ) ( )Tk x z x x 1/ 2 1/ 2( ) ( )N x S x