Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze...

Preview:

Citation preview

Classical  regression  methods  in  R  

Goals  

•  Learn  basic  regression  techniques  in  R  •  Less  flexible  than  self-­‐built  models  fit  by  Maximum  Likelihood  or  Bayes,  But:  – Very  quick  and  easy    •  (may  as  well  try  ‘em)  

– Widely  used  •  (good  to  know  about)  

– Adequate  in  many  applicaHons  •  (don’t  overdo  it)  

References  •  The  basics:  –  Prac+cal  Regression  and  Anova  using  R  

•  Freely  available:  cran.r-­‐project.org/doc/contrib/Faraway-­‐PRA.pdf  •  Accompanied  by  the  R  library  faraway,  containing  funcHons  and  example  data  •  Covers  all  the  standard  methods,  including  everything  in  these  slides  

•  Going  further:  –  Extending  the  Linear  Model  With  R  

•  I  think  you  have  to  buy  this  one  •   Covers  GLM,  mixed  models,  etc.  

•  Both  are  by  Julian  Faraway  –  More  info  at:  h:p://www.maths.bath.ac.uk/~jjf23/  

•  Examples  shown  here  are  in  ‘Regression  in  R.r’  on  the  IB509  website  

First  example  dataset  

Basic  syntax:  lm()  

y ~ x !OR !y ~ x + 1!y = f (x) =mx + b

Specifying  model  formula  

y = f (x) = b y ~ 1!

y = f (x) =mx y ~ x - 1!

y = f (x) = !0 +!1x1 +!2x2 +!3x22 y ~ x1 + x2 + I(x2^2)!

Algebraically   Coded  in  R  

A  “1”  represents  the  intercept  

It  will  be  assumed  if  you  omit  it  

But  you  can  specify  no  intercept            [i.e.  f(0)=0]  

Include  as  many  terms  as  you  want,  including  derived  variables  [using  I()]  

Specifying  model  formula  Example  1:  Mean-­‐only  

0 10 20 30 40

050

100

150

Index

gamble

Specifying  model  formula  Example  2:  Mean  +  intercept  

2 4 6 8 10 12 14

050

100

150

income

gamble

Specifying  model  formula  Example  3:  No  intercept  

0 5 10 15

050

100

150

income

gamble

gamble ~ incomegamble ~ income - 1origin

Specifying  model  formula  Example  4:  MulHvariate  regression  

Specify  a  data  frame  

If  your  data  are  in  a  data  frame,  you  can  save  yourself  some  typing:  

x1 = dat$predict1!x2 = dat$predict2!y = dat$response!fit1 = lm(y ~ x1 + x2)!!# That’s the same as:!fit2 = lm(response ~ predict1 + predict2, data=dat)!

Specifying  model  formula  Example  5:  The  “data”  argument  

Omit  data  

Use  “subset”  to  exclude  certain  rows  or  to  model  relevant  subsets  of  the  data:  

# Exclude based on some numeric criterion:!fit = lm(y ~ x1 + x2, subset=z>0)!

!# (here, only observations with z>0 are included)!!# Or filter by some categorical variable:!fit = lm(y ~ x1 + x2, subset=color==“red”)!

Using  the  outputs  of  lm()  

fit$residuals!OR fit$resid!

OR residuals(fit)!

fit$coefficients!OR fit$coef!

fit$df.residual!OR fit$df!

If  fit  is  the  output  of  an  lm()  call,  then:  

fit$fitted.values!OR fit$fitted!

fit$rank!

is  the  list  of  best-­‐fit  parameters  

is  a  vector  of  residuals  for  each  observaHon  

is  the  residual  degrees  of  freedom  

is  the  model  rank  (parameter  degrees  of  freedom)    (n  =  #  of  observa+ons  =  fit$df  +  fit$rank)  

is  a  vector  of  the  predicted  y  values    at  each  observaHon  x  

 (y  =  fit$resid  +  fit$fiJed)  

Using  the  outputs  of  lm()  

Using  the  outputs  of  lm()  

0 20 60

-50

50

Fitted values

Residuals

Residuals vs Fitted24

39

36

-2 -1 0 1 2

-22

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q24

39

36

0 20 60

0.0

1.5

Fitted values

Standardized residuals

Scale-Location243936

0.00 0.15 0.30

-22

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance0.5

0.51

Residuals vs Leverage24

39

5

Model  Inference  and  InterpretaHon  

Inference:  Significance  and  Hypothesis  tesHng  

•  Inference  based  on  Sum-­‐of  Squares:  –  For  M0  and  MA  with  p0  and  pA  parameters  (corresponding  to  df0  and  dfA  degrees  of  freedom),  respecHvely:  

   

Is  F-­‐distributed  with  (df0  –  dfA)  and  dfA  degrees  of  freedom  

•  To  test  whether  full  model  is  significant:    MA  =  full  model,  M0  =  mean  only  

•  To  test  an  individual  parameter  B*:    MA  =  full  model,  M0  =  model  with  B*  fixed  at  0  

[This  F-­‐test  is  equivalent  to  the  t-­‐test  reported  by  summary()  in  R]  

F =RSS0 ! RSSA( ) df0 ! dfA( )

RSSA dfA=RSS0 ! RSSA( ) pA ! p0( )

RSSA n! pA( )

Some  pseudodata…  

Inference  Example  1:  TesHng  the  full  model  

Inference  Example  2:  TesHng  individual  predictors  

Inference  Example  2:  TesHng  individual  predictors  

Explanatory  power  

•  Given  a  significant  model,  R2  describes  how  well  it  explains  the  data  –  R2  =  “proporHon  variance  explained”  by  the  model”  

– Almost  ubiquitous  in  classical  modeling  – Generally,  not  the  same  as  [correlaHon  coeffient]2  •  (Although  they  are  equivalent  in  simple  linear  regression)  

•  “Adjusted  R2”  penalizes  for  number  of  parameters  –  Sort  of  like  AIC  

•  Mallow’s  Cp  is  an  esHmate  of  predicHon  error  –  Also  reflects  tradeoff  of  good  fit  vs.  overfit  

Explanatory  power  Example  

Residuals:! Min 1Q Median 3Q Max !-1.27356 -0.21581 -0.07422 0.19709 1.33962 !!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) 1.0378 0.5753 1.804 0.11425 !x1 -0.2663 0.1042 -2.555 0.03782 * !x2 1.2676 0.2611 4.854 0.00185 **!---!Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 !!Residual standard error: 0.7817 on 7 degrees of freedom!Multiple R-squared: 0.9023, !Adjusted R-squared: 0.8744 !F-statistic: 32.33 on 2 and 7 DF, p-value: 0.0002913!

t-­‐tests  for  individual  parameter  significance  

F-­‐test  for  overall  model  significance  

R2  and  adjusted  R2  indicaHng  explanatory  power  of  the  model  

Model  Diagnosis  and  

AssumpHon-­‐checking  

New  example  data  

Leverage  

•  Leverage  measures  how  influenHal  a  residual  is  •  Leverage  hi  is  based  on  the  distance  of  observaHon  i  from  the  mean  x-­‐value:  

•  High  leverage  not  necessarily  a  problem,  but  indicates  observaHons  to  “keep  an  eye  on”  

•  Leverages  sum  to  p  (#  parameters  in  model)  –  So,  average  leverage  =  p/n  

•  As  a  rule  of  thumb,  look  out  for  leverage  >  2p/n  

hi =1n+(xi ! x )

2

(xi ! x )2"

Leverage:  Example  

0 10 20 30 40 50

0.05

0.15

0.25

Index

Leverages

Outliers  

•  Outliers  are  observaHons  that  are  unlikely  to  fit  the  same  model  as  the  majority  of  the  data  

•  One  test  is  based  on  the  “studenHzed”  residual  of  each  observaHon,  given  a  model  fit  to  all  other  obs:  

•  These  ti  are  t-­‐distributed  with  n-­‐p-­‐1  d.f.  •  So,  you  can  test  for  outliers  with  a  “simple”  t-­‐test  

ti =!i

"̂ i 1! hi

residual  of  data  point  i  

esHmated  s.d.  of  residuals  with  ith  obs.  excluded  

leverage  of  ith  point  

Outliers:  Example  

0 10 20 30 40 50

-4-2

02

4

Index

Jack

nife

Res

idua

ls

Influence  and  Cook’s  Distance  

•  An  observaHon  is  influen+al  if  it  has  a  large  effect  on  the  regression  results  –  This  comes  from  the  combinaHon  of  large  residual  and  high  leverage  

•  Cook’s  D:  a  staHsHc  to  measure  influence:  

•  Criteria  for  tesHng  D  vary  •  Definitely  check  model  fit  with  and  without  observaHons  with  highest  D  

Di =ri2

phi1! hi

ith  leverage  ith  residual  

Influence  and  Cook’s  Distance  Example  

0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

Index

Coo

k's

Dis

tanc

e

Utah

West Virginia

HomoskedasHcity  

•  Check  visually  with  a  plot  of  residuals  vs.  fined  values  •  DiagnosHc  checks  include  regression  of  (scaled)  residuals  on  the  original  covariates  

•  Could  fix  by  data  transformaHon  or  addiHonal  covariates  

7.5. RESIDUAL PLOTS 81

you can make. If all is well, you should see constant variance in the vertical (!̂) direction and the scattershould be symmetric vertically about 0. Things to look for are heteroscedascity (non-constant variance) andnonlinearity (which indicates some change in the model is necessary). In Figure 7.5, these three cases areillustrated.

0.0 0.2 0.4 0.6 0.8 1.0

−2−1

01

No problem

Fitted

Residual

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

Heteroscedascity

Fitted

Residual

−0.8 −0.6 −0.4 −0.2 0.0 0.2

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

Nonlinear

Fitted

Residual

Figure 7.5: Residuals vs Fitted plots - the first suggests no change to the current model while the secondshows non-constant variance and the third indicates some nonlinearity which should prompt some changein the structural form of the model

You should also plot !̂ against xi (for predictors that are both in and out of the model). Look for the samethings except in the case of plots against predictors not in the model, look for any relationship which mightindicate that this predictor should be included.

We illustrate this using the savings dataset as an example again:

> g <- lm(sr ˜ pop15+pop75+dpi+ddpi,savings)

First the residuals vs. fitted plot and the abs(residuals) vs. fitted plot.

> plot(g$fit,g$res,xlab="Fitted",ylab="Residuals")> abline(h=0)> plot(g$fit,abs(g$res),xlab="Fitted",ylab="|Residuals|")

The plots may be seen in the first two panels of Figure 7.5. What do you see? The latter plot isdesigned to check for non-constant variance only. It folds over the bottom half of the first plot to increasethe resolution for detecting non-constant variance. The first plot is still needed because non-linearity mustbe checked.

A quick way to check non-constant variance is this regression:

> summary(lm(abs(g$res) ˜ g$fit))Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 4.840 1.186 4.08 0.00017g$fit -0.203 0.119 -1.72 0.09250

HomoskedasHcity  Example  

900 950 1000 1050

-50

050

Fitted

Residuals

900 950 1000 1050

-50

050

Fitted

Residuals

HomoskedasHcity  Example  

?  

Normal  error  

•  Usually  assessed  visually,  by  Q-­‐Q  plots,  boxplots,  histograms  

•  This  takes  pracHce  •  There  are  plenty  of  “tests  for  normality,”  but  the  p-­‐values  they  give  don’t  translate  directly  into  acHon  

•  When  residuals  are  non-­‐normal:  –  Parameter  esHmates  are  usually  sHll  OK  –  Parameter  CI  are  more  suspect,  but  may  sHll  be  OK  –  Higher  residual  skew  and  lower  sample  size  both  increase  concern  

Normal  error  Example  

-2 -1 0 1 2

-50

050

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-10

12

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-10

12

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

(Our  data)  (random  standard  normal  samples)  

Model  Building  (aka  variable  selecHon)  

Basic  Forward  selecHon  /  Backward  eliminaHon  

•  Idea:  Use  p-­‐values  of  individual  parameters  to  include  or  exclude  them  from  the  model  – Forward  SelecHon:  SequenHally  add  the  most  significant  parameters  

– Backward  SelecHon:  Start  with  all  parmeters,  sequenHally  remove  least  significant  

•  Parameter  significance  judged  by  t-­‐test  –  (against  the  null  hypothesis  of  parameter  =  0)  – Lowest  p-­‐value  =  “most  significant”  

•  Goal:  Include  as  many  significant  parameters  as  possible.  

Backward  eliminaHon  Example  

Forward  SelecHon  Example  

Forward  SelecHon  Example  

Forward  SelecHon  Example  

Model  fit  metrics  

•  Various  staHsHcs  proposed  to  describe  the  fit  of  classical  regression  models  –  Adjusted  R2:  Based  on  R2  but  with  penalty  for  model  size  – Mallows’  Cp:  An  esHmate  of  predicHon  error  

•  Approximates  tradeoff  for  overfirng  data  

–  AIC:  Can  be  used  for  classical  problems  too!  

•  Finding  the  best  model:  –  Firng  all  possible  models  may  be  feasible  

•  (Classical  regression  is  fast)  –  Otherwise,  use  a  search  algorithm  

•  Generalized  concept  of  forward  selecHon  /  backward  eliminaHon  

Model  fit  metrics  Example  1:  Fit  by  Adjusted  R2  

Our  original  formula  was:  !fit = lm(gamble ~ sex + status + income + verbal, data=teengamb)!

So  we  know  that  variables  1,  3,  and  4  are  sex,  income,  and  verbal  

Model  fit  metrics  Example  2:  Fit  by  Mallows’  Cp  

The  expected  value  of  Cp  is  ~p,  so  only  consider  models  that  fall  below  1:1  in  a  Cp  vs.  p  plot.  

Among  those,  you  could  favor  fewest  parameters  and/or  lowest  Cp.   3.0 3.5 4.0 4.5 5.0

3.0

3.5

4.0

4.5

5.0

p

Cp

13

134

1231234

Model  fit  metrics  Example  3:  Fit  by  AIC  

Beyond  lm()  •  Most  other  “standard”  regression  models  are  covered  by  glm

(),  which  operates  similarly  •  There  are  also  a  lot  of  other  regression-­‐esque  modeling  

methods,  e.g.:  –  Regression  trees,  neural  networks,  splines  and  local  regressions,  etc.  

•  R  has  libraries  for  these  and  many  other  advanced  methods  •  But  remember:  

–  it’s  easy  to  write  down  (and  code)  a  likelihood  funcHon  for  almost  any  model.    –  Then  you  can:  

•  Solve  by  maximum  likelihood  (simpler  cases),  or  •  Solve  by  Bayesian  MCMC  (complex,  hierarchical,  prior  info.,  etc.)  

–  This  is  ouen  easier  and  simpler  than  learning  the  nuances  of  some  new  canned  funcHon.  

–  It  also  puts  you  in  complete  control  of  your  model  (for  bener  or  worse)