Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

Ch 3. Linear Models for Regression Ch 3. Linear Models for Regression (1/2)(1/2)

Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.

Previously summarized by Yung-Kyun Noh

Modified and presented by Rhee, Je-Keun

Biointelligence Laboratory, Seoul National University

http://bi.snu.ac.kr/

2 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

ContentsContents

3.1 Linear Basis Function Models 3.1.1 Maximum likelihood and least squares 3.1.2 Geometry of least squares 3.1.3 Sequential learning 3.1.4 Regularized least squares 3.1.5 Multiple outputs

3.2 The Bias-Variance Decomposition 3.3 Bayesian Lear Regression

3.3.1 Parameter distribution 3.3.2 Predictive distribution 3.3.3 Equivalent kernel



Linear Basis Function ModelsLinear Basis Function Models

Linear regression

Linear model Linearity in the parameters Using basis functions, allow nonlinear function of the input vector x. Simplify the analysis of this class of models Have some significant limitations

M: total number of parameters : basis functions ( : dummy basis function) ,

0 1 1( , ) ... D Dy w w x w x x w

1

0

( , ) ( ) ( )M

Tj j

j

y w

x w x w x

( )j x0( ) 1 x

0 1( ,..., )TMw w w 0 1( ,..., )TM



Basis FunctionsBasis Functions

Polynomial functions: Global functions of the input variable

spline functions Gaussian basis functions:

Sigmoidal basis functions: Logistic sigmoid functions:

Fourier basis wavelets

( ) jj x x

2

2

( )( ) exp{ }

2j

j

xx

s

( ) ( )j

j

xx

s

1( )

1 exp( )j xa



Maximum Likelihood and Least Squares (1/2)Maximum Likelihood and Least Squares (1/2)

Assumption: Gaussian noise model

: zero mean Gaussian random variable with precision (inverse variance) .

Result Conditional mean = (unimoda

l)

For dataset Likelihood: (Drop the explicit x)

( , )t y x w

1( | , , ) ( | ( , ), )p t t y x w x wN

[ | ] ( | ) ( , )t tp t dt y x x x wE

1 1{ ,..., }, { ,..., }N Nt t X x x t

1

1

( | , , ) ( | ( ), )N

Tn n

n

p t

t X w w xN




Log-likelihood

Maximization of the likelihood function under a conditional Gaussian noise distribution for a linear model is equivalent to minimizing a sum-of-squares error function.

1

1

l n ( | , ) l n ( | ( ), ) l n l n(2 ) ( )2 2

NT

n n Dn

N Np t E

t w w x wN

2

1

1( ) ( ( ))

2

NT

D n nn

E t

w w x




The gradient of the log likelihood function

Setting the gradient of log likelihood and setting it to zero to get

where the NxM design matrix

1( )T TML

w Φ Φ Φ t

0 1 1 1 1 1

0 2 1 2 1 2

0 1 1

( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

M

M

N N M N

x x x

x x xΦ

x x x



Bias and Precision Parameter by MLBias and Precision Parameter by ML

Some other solutions we can get by setting derivative to zero. Bias maximizing log likelihood

The bias compensates for the difference between the averages (over the training set) of the target values and the weighted sum of the averages of the basis function values.

Noise precision parameter maximizing log likelihood

1

01

M

j jj

w t w

1

1 N

nn

t tN

1

1( )

N

j j nnN

x

2

1

1 1{ ( )}

NT

n ML nnML

tN

w x



Geometry of Least SquaresGeometry of Least Squares

If the number M of basis functions is smaller than the number N of data points, then the M vectors will span a linear subspace S of dimensionality M.

: jth column of

( )j n x

y: linear combination of The least-squares solution

for w corresponds to that choice of y that lies in subspace S and that is closest to t.

j Φ

j



Sequential LearningSequential Learning

On-line learning Technique of Stochastic gradient descent (or sequential gradient

descent)

For the case of sum-of-squares error function (least-mean-square or the LMS algorithm)

( 1) ( )nE w w

( 1) ( ) ( )( )Tn n nt w w w



Regularized Least SquaresRegularized Least Squares

Regularized least-square Control over-fitting Total error function

Closed form solution (setting the gradient):

This represents a simple extension of the least-squares solution.

A more general regularizer

( ) ( )D WE Ew w

2

1

1( ) { ( )}

2

NT

D n nn

E t

w w x1

( )2

TWE w w w

1( )T T w I Φ Φ Φ t

2

1 1

1 1{ ( )}

2 2

N M qT

n n jn j

t w

w x



General RegularizerGeneral Regularizer

In case q=1 in general regularizer ‘lasso’ in the statistical literature

If λ is sufficiently large, some of the coefficients wj are driven to zero.

Sparse model: corresponding basis functions play no role. Minimizing the unregularized sum-of-squares error s.t. the constraint

Contours of the regularization termThe lasso gives the sparse solution

1

M q

jj

w



Regularization & complexityRegularization & complexity

Regularization allows complex models to be trained on data sets of limited size without severe over-fitting, essentially by limiting the effective model complexity.

However, the problem of determining the optimal model complexity is then shifted from on of finding the appropriate number of basis functions to one of determining a suitable value of the regularization coefficient λ.



Multiple OutputsMultiple Outputs

For K>1 target variables 1. Introduce a different set of basis functions for each componen

t of t. 2. Use the same set of basis functions to model all of the compo

nents of the target vector. (W: MxK matrix of parameters)

For each variable tk,

: pseudo-inverse of

( , ) ( )Ty x w W x

1( | , , ) ( | ( ), )Tp t x W t W x IN

1( )T Tk k k

w Φ Φ Φ t Φ tΦ Φ



The Bias-Variance Decomposition (1/4)The Bias-Variance Decomposition (1/4)

Frequentist viewpoint of the model complexity issue: bias-variance trade-off.

Expected squared loss

Bayesian: the uncertainty in our model is expressed through a posterior distribution over w.

Frequentist: make a point estimate of w based on the data set D.

2 2[ ] { ( ) ( )} ( ) { ( ) } ( , )L y h p d h t p t d dt x x x x x x xEArises from the intrinsic noise on the data

( ) [ | ] ( | )h t tp t dt x x xE

Dependent on the particular dataset D.

2[{ ( ; ) ( )} ]D y D hx xE




Bias The extent to which the average prediction over all data sets diff

ers from the desired regression function. Variance

The extent to which the solutions for individual data sets vary around their average.

The extent to which the function y(x;D) is sensitive to the particular choice of data set.

Expected loss = (bias)2 + variance + noise

2

2

2 2

var i ance(bi as)

[{ ( ; ) ( )} ]

{ [ ( ; )] ( )} [{ ( ; ) [ ( ; )]} ]D

D D D

y D h

y D h y D y D

x x

x x x x

E

E E E




bias-variance trade-off Averaging many solutions

for the complex model (M=25) is a beneficial procedure.

A weighted averaging (although with respect to the posterior distribution of parameters, not with respect to multiple data sets) of multiple solutions lies at the heart of Bayesian approach.

( ) si n(2 )h x x




The average prediction

Bias and variance

Bias-variance decomposition is based on averages with respect to ensembles of data sets (frequentist perspective). We would be better off combining them into a single large training set.

( )

1

1( ) ( )

Ll

l

y x y xL

2 2

1

1( ) { ( ) ( )}

N

n nn

bias y x h xN

( ) 2

1 1

1 1{ ( ) ( )}

N Ll

n nn l

variance y x y xN L



Bayesian Linear RegressionBayesian Linear Regression

In the particular problem, it cannot be decided simply by maximizing the likelihood function, because it always leads to excessively complex models and overfitting.

Independent hold-out data can be used to determine model complexity, but this can be both computationally expensive and wasteful of valuable data.

Bayesian treatment of linear regression will avoid the overfitting problem of maximum likelihood, and will also lead to autoamtic methods of determining model complexity using training data alome.



Parameter distribution (1/3)Parameter distribution (1/3)

Conjugate prior of likelihood

Posterior

The maximum posterior weight vector

If S0=α -1I with α → 0, the mean mN reduces to wML given by (3.15)

0 0( ) ( | , )p w w m SN

( | ) ( | , )N Np w t w m SN10 0( )T

N N m S S m Φ t

1 10

TN S S Φ Φ

1( )T TML

w Φ Φ Φ t




Consider prior

Corresponding posterior

Log of the posterior

Maximization of this posterior distribution with respect to w is equivalent to the minimization of the sum-of squares error function with the addition of a quadratic regularization term with λ=α /β.

1( ) ( | 0, )p w w IN

10 0( )T

N N m S S m Φ t1 1

0T

N S S Φ Φ

2

1

l n ( | )

{ ( )} .2 2

NT T

n nn

p

t const

w t

w x w w




Other forms of prior over parameters

1/

1

1( | ) exp( )

2 2 (1/ ) 2

Mq M q

jj

qp w

q

w

0 1( , )y x w w x w



Predictive Distribution (1/2)Predictive Distribution (1/2)

Our real interests( | , , )

( | , ) ( | , , )

p t

p t p d

t

w w t w1( | , , ) ( | ( , ), )p t t y x w x wN

( | ) ( | , )N Np w t w m SN

Mean of the Gaussian predictive distribution (red line), and predictive uncertainty (shaded region) as the number of data increases.

2

( | , , , )

( | ( ), ( ))TN N

p t

t

x t

m x xN

2 1( ) ( ) ( )T

N N

x x S x

noise Uncertainty associated with the parameters

w.0 if N∞



Predictive Distribution (2/2)Predictive Distribution (2/2)

Draw samples from the posterior distribution over w.



Equivalent Kernel (1/2)Equivalent Kernel (1/2)

If we substitue (3.53) into the expression (3.3), we see that the predictive mean can be written in the form

Mean of the predictive distribution at a point x.

1

( , ) ( ) ( ) ( ) ( )N

T T T TN N N N n n

n

y t

x m m x x S Φ t x S x

1

( , ) ( , )N

N n nn

y k t

x m x x ( , ) ( ) ( ')Tn Nk x x x S x

Smoother matrix or equivalent kernel

Polynomial and sigmoidal basis function



Equivalent Kernel (2/2)Equivalent Kernel (2/2)

Instead of introducing a set of basis functions, which implicitly determines an equivalent kernel, we can instead define a localized kernel directly and use this to make predictions for new input vector x, given the observed training set. This leads to a practical framework for regression (and classification) called G

aussian processes.

The equivalent kernel satisfies an important property shared by kernel functions in general, namely that it can be expressed in the form an inner product with respect to a vector ψ(x) of nonlinear functions. Inner product of nonlinear functions

( , ) ( ) ( )Tk x z x x 1/ 2 1/ 2( ) ( )N x S x

Documents

Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun