Linear Methods for Regression

1

Linear Methods for Regression

Lecture Notes for CMPUT 466/551

Nilanjan Ray

2

Assumption: Linear Regression Function

NNpN

p

p

y

y

y

xx

xx

xx

2

1

1

221

111

,

1

1

1

yX

Model assumption: Output Y is linear in the inputs X=(X1, X2, X3,…, Xp)

Tp

jjj XXY

10

ˆPredict the output by:

Vector notation, 1 included in X

Where,

Also known as multiple-regression when p>1

3

Least Square Solution

N

i

p

jjiji

N

iii xyxfyRSS

1

2

10

1

2 )())(()(

)()()( XyXy TRSS

0)(2

XyXTRSS

yXXX TT 1)(ˆ

residual

Known as least square solution

),,,( 002010 pxxxx For a new input

yXXX TTTT xxxY 1000 )(ˆ)(ˆ The regression output is

Residual sum of squares:

In matrix-vector notation:

Vector differentiation:

Solution:

4

Bias-Variance Decomposition

)(XfY

yXXX TTTxxfxy 1000 )()(ˆ)(ˆ

TXXf )(

Estimator:

Unbiased estimator! Ex. Show the last step

Model: has zero expectationsame varianceuncorrelated

where

Bias:

0

])([

])([

)]()([

]ˆ)([

)](ˆ[)(

10

1000

100

100

00

εXXX

εXXX

εXXXX

XXX

TTT

TTTTT

TTTT

TTTT

xE

xxEx

xEx

xEx

xfExf

Variance:

)/(

])([

])([

])()([

]ˆ)([

)]()(ˆ[

)]](ˆ[)(ˆ[

2

210

20

100

20

10

20

10

200

200

Np

xE

xxxE

xxE

xxE

xfxfE

xfExfE

TTT

TTTTT

TTTT

TTTT

εXXX

εXXX

εXXXX

XXX

Decomposition of EPE:

]))(ˆ[)(ˆ[(]))(ˆ[)([(][

)]](ˆ[)](ˆ[)(ˆ)([

)](ˆ)([)](ˆ)([)(

200

200

2

20000

200

2000

xfExfExfExfEE

xfExfExfxfE

xfxfExyxyExEPE

Irreducible error= 2 Sq. bias=0 Variance= 2(p/N)

Linear

5

Gauss-Markov Theorem

)](ˆ[)( 00 xfExf

ycyXXX TTTTxxf 01

00 )()(ˆ

)]([)( 00 xgExf

Gauss-Markov Theorem: least square estimate has the minimum varianceamong all linear unbiased estimators

Interpretation:

The estimator found by least squares is linear in y

We have noticed that this estimator is unbiased, i.e.,

If we find any other unbiased estimator g(x0) of f(x0) that is linear in y too, i.e.,

,)( 0 ycTxg

then )].([)](ˆ[ 00 xgVarxfVar

and

Question: Is the LS the best estimator for the given linear additive model?

6

Subset Selection

• LS solution often has large variance (remember that variance is proportional to the number of inputs p, i.e., model complexity)

• If we decrease the number of input variables p, we can decrease the variance, however we then sacrifice the zero bias

• If this trade-off decreases test error, the solution can be accepted

• This reasoning leads to subset selection, i.e., select a subset from the p inputs for the regression computation

• Subset selection has another advantage– easy and focused interpretation of the input variables on the output

7

Subset Selection…

p

jjjXY

10

ˆ

Can we determine which j s are insignificant?

Yes, we can by statistical hypothesis testing!

However, we need a model assumption:

p

jjjXY

10

is zero mean Gaussian with standard deviation

8

Subset Selection: Statistical Significance Test

))(,(~ˆ 21 XXTN

j

jj

vz

ˆ

ˆ

The linear model with additive Gaussian noise has the following properties:

Ex. Show this.

So we can form a standardized coefficient or Z-score test for each coefficient:

N

iii yy

pN 1

2)ˆ(1

1̂ and vj is the jth diagonal element of (XTX)-1

Hypothesis testing principle says that a large value of Z-score should retainThe coefficient, a small value should discard the coefficient

How large/small – depends on the significance level

where

9

Case Study: Prostate Cancer

Output = log prostate-specific antigen

Input = ( log cancer volume, log prostate weight, age, log of benign prostatic hyperplacia, seminal vesicle invasion,log of capsular penetration, Gleason score, % of Gleason score4 or 5)

Goal: (1) predict the output given a novel input(2) Interpret the influence of the inputs on the output

10

Case Study…

Scatter plot

Hard to interpret which onesare most influencing

Also we want to find out howthe inputs jointly influence theoutput

11

Subset Selection on Prostate Cancer Data

Term Coefficient Std. Error Z-score

Intercept 2.48 0.09 27.66

Lcavol 0.68 0.13 5.37

Lweight 0.30 0.11 2.75

Age -0.14 0.10 -1.40

Lbph 0.21 0.10 2.06

Svi 0.31 0.12 2.47

Lcp -0.29 0.15 -1.87

Gleasson -0.02 0.15 -0.15

Pgg45 0.27 0.15 1.74

Scores with magnitude greater than 2 indicate significant variablesat 5% significance level

12

Coefficient Shrinkage: Ridge Regression Method

yXIXX TT

p

jj

N

i

p

jjiji xy

1

1

2

1

2

10

ridge

)(

})({minargˆ

One computational advantage is that the matrix is always invertible

If L2 norm is replaced by L1 norm, the corresponding regression is calledLASSO (see [HTF])

Non-negative penalty

13

Ridge Regression…

coefficient

Decreasing

One way to determine is cross validation – we’ll learn about it later

Documents

Linear Methods for Regression