13
1 Linear Methods for Regression Lecture Notes for CMPUT 466/551 Nilanjan Ray

Linear Methods for Regression

  • Upload
    cybele

  • View
    40

  • Download
    1

Embed Size (px)

DESCRIPTION

Linear Methods for Regression. Lecture Notes for CMPUT 466/551 Nilanjan Ray. Assumption: Linear Regression Function. Model assumption: Output Y is linear in the inputs X =( X 1 , X 2 , X 3 ,…, X p ). Predict the output by:. Vector notation, 1 included in X. Where,. - PowerPoint PPT Presentation

Citation preview

Page 1: Linear Methods for Regression

1

Linear Methods for Regression

Lecture Notes for CMPUT 466/551

Nilanjan Ray

Page 2: Linear Methods for Regression

2

Assumption: Linear Regression Function

NNpN

p

p

y

y

y

xx

xx

xx

2

1

1

221

111

,

1

1

1

yX

Model assumption: Output Y is linear in the inputs X=(X1, X2, X3,…, Xp)

Tp

jjj XXY

10

ˆPredict the output by:

Vector notation, 1 included in X

Where,

Also known as multiple-regression when p>1

Page 3: Linear Methods for Regression

3

Least Square Solution

N

i

p

jjiji

N

iii xyxfyRSS

1

2

10

1

2 )())(()(

)()()( XyXy TRSS

0)(2

XyXTRSS

yXXX TT 1)(ˆ

residual

Known as least square solution

),,,( 002010 pxxxx For a new input

yXXX TTTT xxxY 1000 )(ˆ)(ˆ The regression output is

Residual sum of squares:

In matrix-vector notation:

Vector differentiation:

Solution:

Page 4: Linear Methods for Regression

4

Bias-Variance Decomposition

)(XfY

yXXX TTTxxfxy 1000 )()(ˆ)(ˆ

TXXf )(

Estimator:

Unbiased estimator! Ex. Show the last step

Model: has zero expectationsame varianceuncorrelated

where

Bias:

0

])([

])([

)]()([

]ˆ)([

)](ˆ[)(

10

1000

100

100

00

εXXX

εXXX

εXXXX

XXX

TTT

TTTTT

TTTT

TTTT

xE

xxEx

xEx

xEx

xfExf

Variance:

)/(

])([

])([

])()([

]ˆ)([

)]()(ˆ[

)]](ˆ[)(ˆ[

2

210

20

100

20

10

20

10

200

200

Np

xE

xxxE

xxE

xxE

xfxfE

xfExfE

TTT

TTTTT

TTTT

TTTT

εXXX

εXXX

εXXXX

XXX

Decomposition of EPE:

]))(ˆ[)(ˆ[(]))(ˆ[)([(][

)]](ˆ[)](ˆ[)(ˆ)([

)](ˆ)([)](ˆ)([)(

200

200

2

20000

200

2000

xfExfExfExfEE

xfExfExfxfE

xfxfExyxyExEPE

Irreducible error= 2 Sq. bias=0 Variance= 2(p/N)

Linear

Page 5: Linear Methods for Regression

5

Gauss-Markov Theorem

)](ˆ[)( 00 xfExf

ycyXXX TTTTxxf 01

00 )()(ˆ

)]([)( 00 xgExf

Gauss-Markov Theorem: least square estimate has the minimum varianceamong all linear unbiased estimators

Interpretation:

The estimator found by least squares is linear in y

We have noticed that this estimator is unbiased, i.e.,

If we find any other unbiased estimator g(x0) of f(x0) that is linear in y too, i.e.,

,)( 0 ycTxg

then )].([)](ˆ[ 00 xgVarxfVar

and

Question: Is the LS the best estimator for the given linear additive model?

Page 6: Linear Methods for Regression

6

Subset Selection

• LS solution often has large variance (remember that variance is proportional to the number of inputs p, i.e., model complexity)

• If we decrease the number of input variables p, we can decrease the variance, however we then sacrifice the zero bias

• If this trade-off decreases test error, the solution can be accepted

• This reasoning leads to subset selection, i.e., select a subset from the p inputs for the regression computation

• Subset selection has another advantage– easy and focused interpretation of the input variables on the output

Page 7: Linear Methods for Regression

7

Subset Selection…

p

jjjXY

10

ˆ

Can we determine which j s are insignificant?

Yes, we can by statistical hypothesis testing!

However, we need a model assumption:

p

jjjXY

10

is zero mean Gaussian with standard deviation

Page 8: Linear Methods for Regression

8

Subset Selection: Statistical Significance Test

))(,(~ˆ 21 XXTN

j

jj

vz

ˆ

ˆ

The linear model with additive Gaussian noise has the following properties:

Ex. Show this.

So we can form a standardized coefficient or Z-score test for each coefficient:

N

iii yy

pN 1

2)ˆ(1

1̂ and vj is the jth diagonal element of (XTX)-1

Hypothesis testing principle says that a large value of Z-score should retainThe coefficient, a small value should discard the coefficient

How large/small – depends on the significance level

where

Page 9: Linear Methods for Regression

9

Case Study: Prostate Cancer

Output = log prostate-specific antigen

Input = ( log cancer volume, log prostate weight, age, log of benign prostatic hyperplacia, seminal vesicle invasion,log of capsular penetration, Gleason score, % of Gleason score4 or 5)

Goal: (1) predict the output given a novel input(2) Interpret the influence of the inputs on the output

Page 10: Linear Methods for Regression

10

Case Study…

Scatter plot

Hard to interpret which onesare most influencing

Also we want to find out howthe inputs jointly influence theoutput

Page 11: Linear Methods for Regression

11

Subset Selection on Prostate Cancer Data

Term Coefficient Std. Error Z-score

Intercept 2.48 0.09 27.66

Lcavol 0.68 0.13 5.37

Lweight 0.30 0.11 2.75

Age -0.14 0.10 -1.40

Lbph 0.21 0.10 2.06

Svi 0.31 0.12 2.47

Lcp -0.29 0.15 -1.87

Gleasson -0.02 0.15 -0.15

Pgg45 0.27 0.15 1.74

Scores with magnitude greater than 2 indicate significant variablesat 5% significance level

Page 12: Linear Methods for Regression

12

Coefficient Shrinkage: Ridge Regression Method

yXIXX TT

p

jj

N

i

p

jjiji xy

1

1

2

1

2

10

ridge

)(

})({minargˆ

One computational advantage is that the matrix is always invertible

If L2 norm is replaced by L1 norm, the corresponding regression is calledLASSO (see [HTF])

Non-negative penalty

Page 13: Linear Methods for Regression

13

Ridge Regression…

coefficient

Decreasing

One way to determine is cross validation – we’ll learn about it later