48
Classical regression methods in R

Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Classical  regression  methods  in  R  

Page 2: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Goals  

•  Learn  basic  regression  techniques  in  R  •  Less  flexible  than  self-­‐built  models  fit  by  Maximum  Likelihood  or  Bayes,  But:  – Very  quick  and  easy    •  (may  as  well  try  ‘em)  

– Widely  used  •  (good  to  know  about)  

– Adequate  in  many  applicaHons  •  (don’t  overdo  it)  

Page 3: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

References  •  The  basics:  –  Prac+cal  Regression  and  Anova  using  R  

•  Freely  available:  cran.r-­‐project.org/doc/contrib/Faraway-­‐PRA.pdf  •  Accompanied  by  the  R  library  faraway,  containing  funcHons  and  example  data  •  Covers  all  the  standard  methods,  including  everything  in  these  slides  

•  Going  further:  –  Extending  the  Linear  Model  With  R  

•  I  think  you  have  to  buy  this  one  •   Covers  GLM,  mixed  models,  etc.  

•  Both  are  by  Julian  Faraway  –  More  info  at:  h:p://www.maths.bath.ac.uk/~jjf23/  

•  Examples  shown  here  are  in  ‘Regression  in  R.r’  on  the  IB509  website  

Page 4: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

First  example  dataset  

Page 5: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Basic  syntax:  lm()  

Page 6: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

y ~ x !OR !y ~ x + 1!y = f (x) =mx + b

Specifying  model  formula  

y = f (x) = b y ~ 1!

y = f (x) =mx y ~ x - 1!

y = f (x) = !0 +!1x1 +!2x2 +!3x22 y ~ x1 + x2 + I(x2^2)!

Algebraically   Coded  in  R  

A  “1”  represents  the  intercept  

It  will  be  assumed  if  you  omit  it  

But  you  can  specify  no  intercept            [i.e.  f(0)=0]  

Include  as  many  terms  as  you  want,  including  derived  variables  [using  I()]  

Page 7: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Specifying  model  formula  Example  1:  Mean-­‐only  

0 10 20 30 40

050

100

150

Index

gamble

Page 8: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Specifying  model  formula  Example  2:  Mean  +  intercept  

2 4 6 8 10 12 14

050

100

150

income

gamble

Page 9: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Specifying  model  formula  Example  3:  No  intercept  

0 5 10 15

050

100

150

income

gamble

gamble ~ incomegamble ~ income - 1origin

Page 10: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Specifying  model  formula  Example  4:  MulHvariate  regression  

Page 11: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Specify  a  data  frame  

If  your  data  are  in  a  data  frame,  you  can  save  yourself  some  typing:  

x1 = dat$predict1!x2 = dat$predict2!y = dat$response!fit1 = lm(y ~ x1 + x2)!!# That’s the same as:!fit2 = lm(response ~ predict1 + predict2, data=dat)!

Page 12: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Specifying  model  formula  Example  5:  The  “data”  argument  

Page 13: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Omit  data  

Use  “subset”  to  exclude  certain  rows  or  to  model  relevant  subsets  of  the  data:  

# Exclude based on some numeric criterion:!fit = lm(y ~ x1 + x2, subset=z>0)!

!# (here, only observations with z>0 are included)!!# Or filter by some categorical variable:!fit = lm(y ~ x1 + x2, subset=color==“red”)!

Page 14: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Using  the  outputs  of  lm()  

fit$residuals!OR fit$resid!

OR residuals(fit)!

fit$coefficients!OR fit$coef!

fit$df.residual!OR fit$df!

If  fit  is  the  output  of  an  lm()  call,  then:  

fit$fitted.values!OR fit$fitted!

fit$rank!

is  the  list  of  best-­‐fit  parameters  

is  a  vector  of  residuals  for  each  observaHon  

is  the  residual  degrees  of  freedom  

is  the  model  rank  (parameter  degrees  of  freedom)    (n  =  #  of  observa+ons  =  fit$df  +  fit$rank)  

is  a  vector  of  the  predicted  y  values    at  each  observaHon  x  

 (y  =  fit$resid  +  fit$fiJed)  

Page 15: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Using  the  outputs  of  lm()  

Page 16: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Using  the  outputs  of  lm()  

0 20 60

-50

50

Fitted values

Residuals

Residuals vs Fitted24

39

36

-2 -1 0 1 2

-22

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q24

39

36

0 20 60

0.0

1.5

Fitted values

Standardized residuals

Scale-Location243936

0.00 0.15 0.30

-22

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance0.5

0.51

Residuals vs Leverage24

39

5

Page 17: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Model  Inference  and  InterpretaHon  

Page 18: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Inference:  Significance  and  Hypothesis  tesHng  

•  Inference  based  on  Sum-­‐of  Squares:  –  For  M0  and  MA  with  p0  and  pA  parameters  (corresponding  to  df0  and  dfA  degrees  of  freedom),  respecHvely:  

   

Is  F-­‐distributed  with  (df0  –  dfA)  and  dfA  degrees  of  freedom  

•  To  test  whether  full  model  is  significant:    MA  =  full  model,  M0  =  mean  only  

•  To  test  an  individual  parameter  B*:    MA  =  full  model,  M0  =  model  with  B*  fixed  at  0  

[This  F-­‐test  is  equivalent  to  the  t-­‐test  reported  by  summary()  in  R]  

F =RSS0 ! RSSA( ) df0 ! dfA( )

RSSA dfA=RSS0 ! RSSA( ) pA ! p0( )

RSSA n! pA( )

Page 19: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Some  pseudodata…  

Page 20: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Inference  Example  1:  TesHng  the  full  model  

Page 21: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Inference  Example  2:  TesHng  individual  predictors  

Page 22: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Inference  Example  2:  TesHng  individual  predictors  

Page 23: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Explanatory  power  

•  Given  a  significant  model,  R2  describes  how  well  it  explains  the  data  –  R2  =  “proporHon  variance  explained”  by  the  model”  

– Almost  ubiquitous  in  classical  modeling  – Generally,  not  the  same  as  [correlaHon  coeffient]2  •  (Although  they  are  equivalent  in  simple  linear  regression)  

•  “Adjusted  R2”  penalizes  for  number  of  parameters  –  Sort  of  like  AIC  

•  Mallow’s  Cp  is  an  esHmate  of  predicHon  error  –  Also  reflects  tradeoff  of  good  fit  vs.  overfit  

Page 24: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Explanatory  power  Example  

Residuals:! Min 1Q Median 3Q Max !-1.27356 -0.21581 -0.07422 0.19709 1.33962 !!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) 1.0378 0.5753 1.804 0.11425 !x1 -0.2663 0.1042 -2.555 0.03782 * !x2 1.2676 0.2611 4.854 0.00185 **!---!Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 !!Residual standard error: 0.7817 on 7 degrees of freedom!Multiple R-squared: 0.9023, !Adjusted R-squared: 0.8744 !F-statistic: 32.33 on 2 and 7 DF, p-value: 0.0002913!

t-­‐tests  for  individual  parameter  significance  

F-­‐test  for  overall  model  significance  

R2  and  adjusted  R2  indicaHng  explanatory  power  of  the  model  

Page 25: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Model  Diagnosis  and  

AssumpHon-­‐checking  

Page 26: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

New  example  data  

Page 27: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Leverage  

•  Leverage  measures  how  influenHal  a  residual  is  •  Leverage  hi  is  based  on  the  distance  of  observaHon  i  from  the  mean  x-­‐value:  

•  High  leverage  not  necessarily  a  problem,  but  indicates  observaHons  to  “keep  an  eye  on”  

•  Leverages  sum  to  p  (#  parameters  in  model)  –  So,  average  leverage  =  p/n  

•  As  a  rule  of  thumb,  look  out  for  leverage  >  2p/n  

hi =1n+(xi ! x )

2

(xi ! x )2"

Page 28: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Leverage:  Example  

0 10 20 30 40 50

0.05

0.15

0.25

Index

Leverages

Page 29: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Outliers  

•  Outliers  are  observaHons  that  are  unlikely  to  fit  the  same  model  as  the  majority  of  the  data  

•  One  test  is  based  on  the  “studenHzed”  residual  of  each  observaHon,  given  a  model  fit  to  all  other  obs:  

•  These  ti  are  t-­‐distributed  with  n-­‐p-­‐1  d.f.  •  So,  you  can  test  for  outliers  with  a  “simple”  t-­‐test  

ti =!i

"̂ i 1! hi

residual  of  data  point  i  

esHmated  s.d.  of  residuals  with  ith  obs.  excluded  

leverage  of  ith  point  

Page 30: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Outliers:  Example  

0 10 20 30 40 50

-4-2

02

4

Index

Jack

nife

Res

idua

ls

Page 31: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Influence  and  Cook’s  Distance  

•  An  observaHon  is  influen+al  if  it  has  a  large  effect  on  the  regression  results  –  This  comes  from  the  combinaHon  of  large  residual  and  high  leverage  

•  Cook’s  D:  a  staHsHc  to  measure  influence:  

•  Criteria  for  tesHng  D  vary  •  Definitely  check  model  fit  with  and  without  observaHons  with  highest  D  

Di =ri2

phi1! hi

ith  leverage  ith  residual  

Page 32: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Influence  and  Cook’s  Distance  Example  

0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

Index

Coo

k's

Dis

tanc

e

Utah

West Virginia

Page 33: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

HomoskedasHcity  

•  Check  visually  with  a  plot  of  residuals  vs.  fined  values  •  DiagnosHc  checks  include  regression  of  (scaled)  residuals  on  the  original  covariates  

•  Could  fix  by  data  transformaHon  or  addiHonal  covariates  

7.5. RESIDUAL PLOTS 81

you can make. If all is well, you should see constant variance in the vertical (!̂) direction and the scattershould be symmetric vertically about 0. Things to look for are heteroscedascity (non-constant variance) andnonlinearity (which indicates some change in the model is necessary). In Figure 7.5, these three cases areillustrated.

0.0 0.2 0.4 0.6 0.8 1.0

−2−1

01

No problem

Fitted

Residual

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

Heteroscedascity

Fitted

Residual

−0.8 −0.6 −0.4 −0.2 0.0 0.2

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

Nonlinear

Fitted

Residual

Figure 7.5: Residuals vs Fitted plots - the first suggests no change to the current model while the secondshows non-constant variance and the third indicates some nonlinearity which should prompt some changein the structural form of the model

You should also plot !̂ against xi (for predictors that are both in and out of the model). Look for the samethings except in the case of plots against predictors not in the model, look for any relationship which mightindicate that this predictor should be included.

We illustrate this using the savings dataset as an example again:

> g <- lm(sr ˜ pop15+pop75+dpi+ddpi,savings)

First the residuals vs. fitted plot and the abs(residuals) vs. fitted plot.

> plot(g$fit,g$res,xlab="Fitted",ylab="Residuals")> abline(h=0)> plot(g$fit,abs(g$res),xlab="Fitted",ylab="|Residuals|")

The plots may be seen in the first two panels of Figure 7.5. What do you see? The latter plot isdesigned to check for non-constant variance only. It folds over the bottom half of the first plot to increasethe resolution for detecting non-constant variance. The first plot is still needed because non-linearity mustbe checked.

A quick way to check non-constant variance is this regression:

> summary(lm(abs(g$res) ˜ g$fit))Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 4.840 1.186 4.08 0.00017g$fit -0.203 0.119 -1.72 0.09250

Page 34: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

HomoskedasHcity  Example  

900 950 1000 1050

-50

050

Fitted

Residuals

Page 35: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

900 950 1000 1050

-50

050

Fitted

Residuals

HomoskedasHcity  Example  

?  

Page 36: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Normal  error  

•  Usually  assessed  visually,  by  Q-­‐Q  plots,  boxplots,  histograms  

•  This  takes  pracHce  •  There  are  plenty  of  “tests  for  normality,”  but  the  p-­‐values  they  give  don’t  translate  directly  into  acHon  

•  When  residuals  are  non-­‐normal:  –  Parameter  esHmates  are  usually  sHll  OK  –  Parameter  CI  are  more  suspect,  but  may  sHll  be  OK  –  Higher  residual  skew  and  lower  sample  size  both  increase  concern  

Page 37: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Normal  error  Example  

-2 -1 0 1 2

-50

050

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-10

12

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-10

12

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

(Our  data)  (random  standard  normal  samples)  

Page 38: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Model  Building  (aka  variable  selecHon)  

Page 39: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Basic  Forward  selecHon  /  Backward  eliminaHon  

•  Idea:  Use  p-­‐values  of  individual  parameters  to  include  or  exclude  them  from  the  model  – Forward  SelecHon:  SequenHally  add  the  most  significant  parameters  

– Backward  SelecHon:  Start  with  all  parmeters,  sequenHally  remove  least  significant  

•  Parameter  significance  judged  by  t-­‐test  –  (against  the  null  hypothesis  of  parameter  =  0)  – Lowest  p-­‐value  =  “most  significant”  

•  Goal:  Include  as  many  significant  parameters  as  possible.  

Page 40: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Backward  eliminaHon  Example  

Page 41: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Forward  SelecHon  Example  

Page 42: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Forward  SelecHon  Example  

Page 43: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Forward  SelecHon  Example  

Page 44: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Model  fit  metrics  

•  Various  staHsHcs  proposed  to  describe  the  fit  of  classical  regression  models  –  Adjusted  R2:  Based  on  R2  but  with  penalty  for  model  size  – Mallows’  Cp:  An  esHmate  of  predicHon  error  

•  Approximates  tradeoff  for  overfirng  data  

–  AIC:  Can  be  used  for  classical  problems  too!  

•  Finding  the  best  model:  –  Firng  all  possible  models  may  be  feasible  

•  (Classical  regression  is  fast)  –  Otherwise,  use  a  search  algorithm  

•  Generalized  concept  of  forward  selecHon  /  backward  eliminaHon  

Page 45: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Model  fit  metrics  Example  1:  Fit  by  Adjusted  R2  

Our  original  formula  was:  !fit = lm(gamble ~ sex + status + income + verbal, data=teengamb)!

So  we  know  that  variables  1,  3,  and  4  are  sex,  income,  and  verbal  

Page 46: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Model  fit  metrics  Example  2:  Fit  by  Mallows’  Cp  

The  expected  value  of  Cp  is  ~p,  so  only  consider  models  that  fall  below  1:1  in  a  Cp  vs.  p  plot.  

Among  those,  you  could  favor  fewest  parameters  and/or  lowest  Cp.   3.0 3.5 4.0 4.5 5.0

3.0

3.5

4.0

4.5

5.0

p

Cp

13

134

1231234

Page 47: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Model  fit  metrics  Example  3:  Fit  by  AIC  

Page 48: Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Beyond  lm()  •  Most  other  “standard”  regression  models  are  covered  by  glm

(),  which  operates  similarly  •  There  are  also  a  lot  of  other  regression-­‐esque  modeling  

methods,  e.g.:  –  Regression  trees,  neural  networks,  splines  and  local  regressions,  etc.  

•  R  has  libraries  for  these  and  many  other  advanced  methods  •  But  remember:  

–  it’s  easy  to  write  down  (and  code)  a  likelihood  funcHon  for  almost  any  model.    –  Then  you  can:  

•  Solve  by  maximum  likelihood  (simpler  cases),  or  •  Solve  by  Bayesian  MCMC  (complex,  hierarchical,  prior  info.,  etc.)  

–  This  is  ouen  easier  and  simpler  than  learning  the  nuances  of  some  new  canned  funcHon.  

–  It  also  puts  you  in  complete  control  of  your  model  (for  bener  or  worse)