Upload
cybele
View
40
Download
1
Embed Size (px)
DESCRIPTION
Linear Methods for Regression. Lecture Notes for CMPUT 466/551 Nilanjan Ray. Assumption: Linear Regression Function. Model assumption: Output Y is linear in the inputs X =( X 1 , X 2 , X 3 ,…, X p ). Predict the output by:. Vector notation, 1 included in X. Where,. - PowerPoint PPT Presentation
Citation preview
1
Linear Methods for Regression
Lecture Notes for CMPUT 466/551
Nilanjan Ray
2
Assumption: Linear Regression Function
NNpN
p
p
y
y
y
xx
xx
xx
2
1
1
221
111
,
1
1
1
yX
Model assumption: Output Y is linear in the inputs X=(X1, X2, X3,…, Xp)
Tp
jjj XXY
10
ˆPredict the output by:
Vector notation, 1 included in X
Where,
Also known as multiple-regression when p>1
3
Least Square Solution
N
i
p
jjiji
N
iii xyxfyRSS
1
2
10
1
2 )())(()(
)()()( XyXy TRSS
0)(2
XyXTRSS
yXXX TT 1)(ˆ
residual
Known as least square solution
),,,( 002010 pxxxx For a new input
yXXX TTTT xxxY 1000 )(ˆ)(ˆ The regression output is
Residual sum of squares:
In matrix-vector notation:
Vector differentiation:
Solution:
4
Bias-Variance Decomposition
)(XfY
yXXX TTTxxfxy 1000 )()(ˆ)(ˆ
TXXf )(
Estimator:
Unbiased estimator! Ex. Show the last step
Model: has zero expectationsame varianceuncorrelated
where
Bias:
0
])([
])([
)]()([
]ˆ)([
)](ˆ[)(
10
1000
100
100
00
εXXX
εXXX
εXXXX
XXX
TTT
TTTTT
TTTT
TTTT
xE
xxEx
xEx
xEx
xfExf
Variance:
)/(
])([
])([
])()([
]ˆ)([
)]()(ˆ[
)]](ˆ[)(ˆ[
2
210
20
100
20
10
20
10
200
200
Np
xE
xxxE
xxE
xxE
xfxfE
xfExfE
TTT
TTTTT
TTTT
TTTT
εXXX
εXXX
εXXXX
XXX
Decomposition of EPE:
]))(ˆ[)(ˆ[(]))(ˆ[)([(][
)]](ˆ[)](ˆ[)(ˆ)([
)](ˆ)([)](ˆ)([)(
200
200
2
20000
200
2000
xfExfExfExfEE
xfExfExfxfE
xfxfExyxyExEPE
Irreducible error= 2 Sq. bias=0 Variance= 2(p/N)
Linear
5
Gauss-Markov Theorem
)](ˆ[)( 00 xfExf
ycyXXX TTTTxxf 01
00 )()(ˆ
)]([)( 00 xgExf
Gauss-Markov Theorem: least square estimate has the minimum varianceamong all linear unbiased estimators
Interpretation:
The estimator found by least squares is linear in y
We have noticed that this estimator is unbiased, i.e.,
If we find any other unbiased estimator g(x0) of f(x0) that is linear in y too, i.e.,
,)( 0 ycTxg
then )].([)](ˆ[ 00 xgVarxfVar
and
Question: Is the LS the best estimator for the given linear additive model?
6
Subset Selection
• LS solution often has large variance (remember that variance is proportional to the number of inputs p, i.e., model complexity)
• If we decrease the number of input variables p, we can decrease the variance, however we then sacrifice the zero bias
• If this trade-off decreases test error, the solution can be accepted
• This reasoning leads to subset selection, i.e., select a subset from the p inputs for the regression computation
• Subset selection has another advantage– easy and focused interpretation of the input variables on the output
7
Subset Selection…
p
jjjXY
10
ˆ
Can we determine which j s are insignificant?
Yes, we can by statistical hypothesis testing!
However, we need a model assumption:
p
jjjXY
10
is zero mean Gaussian with standard deviation
8
Subset Selection: Statistical Significance Test
))(,(~ˆ 21 XXTN
j
jj
vz
ˆ
ˆ
The linear model with additive Gaussian noise has the following properties:
Ex. Show this.
So we can form a standardized coefficient or Z-score test for each coefficient:
N
iii yy
pN 1
2)ˆ(1
1̂ and vj is the jth diagonal element of (XTX)-1
Hypothesis testing principle says that a large value of Z-score should retainThe coefficient, a small value should discard the coefficient
How large/small – depends on the significance level
where
9
Case Study: Prostate Cancer
Output = log prostate-specific antigen
Input = ( log cancer volume, log prostate weight, age, log of benign prostatic hyperplacia, seminal vesicle invasion,log of capsular penetration, Gleason score, % of Gleason score4 or 5)
Goal: (1) predict the output given a novel input(2) Interpret the influence of the inputs on the output
10
Case Study…
Scatter plot
Hard to interpret which onesare most influencing
Also we want to find out howthe inputs jointly influence theoutput
11
Subset Selection on Prostate Cancer Data
Term Coefficient Std. Error Z-score
Intercept 2.48 0.09 27.66
Lcavol 0.68 0.13 5.37
Lweight 0.30 0.11 2.75
Age -0.14 0.10 -1.40
Lbph 0.21 0.10 2.06
Svi 0.31 0.12 2.47
Lcp -0.29 0.15 -1.87
Gleasson -0.02 0.15 -0.15
Pgg45 0.27 0.15 1.74
Scores with magnitude greater than 2 indicate significant variablesat 5% significance level
12
Coefficient Shrinkage: Ridge Regression Method
yXIXX TT
p
jj
N
i
p
jjiji xy
1
1
2
1
2
10
ridge
)(
})({minargˆ
One computational advantage is that the matrix is always invertible
If L2 norm is replaced by L1 norm, the corresponding regression is calledLASSO (see [HTF])
Non-negative penalty
13
Ridge Regression…
coefficient
Decreasing
One way to determine is cross validation – we’ll learn about it later