Download pdf - Regression in R · 2012. 8. 21. · • Both’are’by’Julian’Faraway ... s l a u d si re d ze i rd a d n a St Cook's distance 0.5 0.5 1 Residuals vs Leverage 24 39 5. Model’Inference’and’Interpretaon

Classical regression methods in R

Goals

•  Learn basic regression techniques in R •  Less flexible than self-‐built models fit by Maximum Likelihood or Bayes, But: – Very quick and easy •  (may as well try ‘em)

– Widely used •  (good to know about)

– Adequate in many applicaHons •  (don’t overdo it)

References •  The basics: –  Prac+cal Regression and Anova using R

•  Freely available: cran.r-‐project.org/doc/contrib/Faraway-‐PRA.pdf •  Accompanied by the R library faraway, containing funcHons and example data •  Covers all the standard methods, including everything in these slides

•  Going further: –  Extending the Linear Model With R

•  I think you have to buy this one •  Covers GLM, mixed models, etc.

•  Both are by Julian Faraway –  More info at: h:p://www.maths.bath.ac.uk/~jjf23/

•  Examples shown here are in ‘Regression in R.r’ on the IB509 website

First example dataset

Basic syntax: lm()

y ~ x !OR !y ~ x + 1!y = f (x) =mx + b

Specifying model formula

y = f (x) = b y ~ 1!

y = f (x) =mx y ~ x - 1!

y = f (x) = !0 +!1x1 +!2x2 +!3x22 y ~ x1 + x2 + I(x2^2)!

Algebraically Coded in R

A “1” represents the intercept

It will be assumed if you omit it

But you can specify no intercept [i.e. f(0)=0]

Include as many terms as you want, including derived variables [using I()]

Specifying model formula Example 1: Mean-‐only

0 10 20 30 40

050

100

150

Index

gamble

Specifying model formula Example 2: Mean + intercept

2 4 6 8 10 12 14

050

100

150

income

gamble

Specifying model formula Example 3: No intercept

0 5 10 15

050

100

150

income

gamble

gamble ~ incomegamble ~ income - 1origin

Specifying model formula Example 4: MulHvariate regression

Specify a data frame

If your data are in a data frame, you can save yourself some typing:

x1 = dat$predict1!x2 = dat$predict2!y = dat$response!fit1 = lm(y ~ x1 + x2)!!# That’s the same as:!fit2 = lm(response ~ predict1 + predict2, data=dat)!

Specifying model formula Example 5: The “data” argument

Omit data

Use “subset” to exclude certain rows or to model relevant subsets of the data:

# Exclude based on some numeric criterion:!fit = lm(y ~ x1 + x2, subset=z>0)!

!# (here, only observations with z>0 are included)!!# Or filter by some categorical variable:!fit = lm(y ~ x1 + x2, subset=color==“red”)!

Using the outputs of lm()

fit$residuals!OR fit$resid!

OR residuals(fit)!

fit$coefficients!OR fit$coef!

fit$df.residual!OR fit$df!

If fit is the output of an lm() call, then:

fit$fitted.values!OR fit$fitted!

fit$rank!

is the list of best-‐fit parameters

is a vector of residuals for each observaHon

is the residual degrees of freedom

is the model rank (parameter degrees of freedom) (n = # of observa+ons = fit$df + fit$rank)

is a vector of the predicted y values at each observaHon x

(y = fit$resid + fit$fiJed)



0 20 60

-50

50

Fitted values

Residuals

Residuals vs Fitted24

39

36

-2 -1 0 1 2

-22

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q24

39

36

0 20 60

0.0

1.5

Fitted values

Standardized residuals

Scale-Location243936

0.00 0.15 0.30

-22

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance0.5

0.51

Residuals vs Leverage24

39

5

Model Inference and InterpretaHon

Inference: Significance and Hypothesis tesHng

•  Inference based on Sum-‐of Squares: –  For M0 and MA with p0 and pA parameters (corresponding to df0 and dfA degrees of freedom), respecHvely:

Is F-‐distributed with (df0 – dfA) and dfA degrees of freedom

•  To test whether full model is significant: MA = full model, M0 = mean only

•  To test an individual parameter B*: MA = full model, M0 = model with B* fixed at 0

[This F-‐test is equivalent to the t-‐test reported by summary() in R]

F =RSS0 ! RSSA( ) df0 ! dfA( )

RSSA dfA=RSS0 ! RSSA( ) pA ! p0( )

RSSA n! pA( )

Some pseudodata…

Inference Example 1: TesHng the full model

Inference Example 2: TesHng individual predictors

Inference Example 2: TesHng individual predictors

Explanatory power

•  Given a significant model, R2 describes how well it explains the data –  R2 = “proporHon variance explained” by the model”

– Almost ubiquitous in classical modeling – Generally, not the same as [correlaHon coeffient]2 •  (Although they are equivalent in simple linear regression)

•  “Adjusted R2” penalizes for number of parameters –  Sort of like AIC

•  Mallow’s Cp is an esHmate of predicHon error –  Also reflects tradeoff of good fit vs. overfit

Explanatory power Example

Residuals:! Min 1Q Median 3Q Max !-1.27356 -0.21581 -0.07422 0.19709 1.33962 !!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) 1.0378 0.5753 1.804 0.11425 !x1 -0.2663 0.1042 -2.555 0.03782 * !x2 1.2676 0.2611 4.854 0.00185 **!---!Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 !!Residual standard error: 0.7817 on 7 degrees of freedom!Multiple R-squared: 0.9023, !Adjusted R-squared: 0.8744 !F-statistic: 32.33 on 2 and 7 DF, p-value: 0.0002913!

t-‐tests for individual parameter significance

F-‐test for overall model significance

R2 and adjusted R2 indicaHng explanatory power of the model

Model Diagnosis and

AssumpHon-‐checking

New example data

Leverage

•  Leverage measures how influenHal a residual is •  Leverage hi is based on the distance of observaHon i from the mean x-‐value:

•  High leverage not necessarily a problem, but indicates observaHons to “keep an eye on”

•  Leverages sum to p (# parameters in model) –  So, average leverage = p/n

•  As a rule of thumb, look out for leverage > 2p/n

hi =1n+(xi ! x )

2

(xi ! x )2"

Leverage: Example

0 10 20 30 40 50

0.05

0.15

0.25

Index

Leverages

Outliers

•  Outliers are observaHons that are unlikely to fit the same model as the majority of the data

•  One test is based on the “studenHzed” residual of each observaHon, given a model fit to all other obs:

•  These ti are t-‐distributed with n-‐p-‐1 d.f. •  So, you can test for outliers with a “simple” t-‐test

ti =!i

"̂ i 1! hi

residual of data point i

esHmated s.d. of residuals with ith obs. excluded

leverage of ith point

Outliers: Example

0 10 20 30 40 50

-4-2

02

4

Index

Jack

nife

Res

idua

ls

Influence and Cook’s Distance

•  An observaHon is influen+al if it has a large effect on the regression results –  This comes from the combinaHon of large residual and high leverage

•  Cook’s D: a staHsHc to measure influence:

•  Criteria for tesHng D vary •  Definitely check model fit with and without observaHons with highest D

Di =ri2

phi1! hi

ith leverage ith residual

Influence and Cook’s Distance Example

0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

Index

Coo

k's

Dis

tanc

e

Utah

West Virginia

HomoskedasHcity

•  Check visually with a plot of residuals vs. fined values •  DiagnosHc checks include regression of (scaled) residuals on the original covariates

•  Could fix by data transformaHon or addiHonal covariates

7.5. RESIDUAL PLOTS 81

you can make. If all is well, you should see constant variance in the vertical (!̂) direction and the scattershould be symmetric vertically about 0. Things to look for are heteroscedascity (non-constant variance) andnonlinearity (which indicates some change in the model is necessary). In Figure 7.5, these three cases areillustrated.

0.0 0.2 0.4 0.6 0.8 1.0

−2−1

01

No problem

Fitted

Residual

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

Heteroscedascity

Fitted

Residual

−0.8 −0.6 −0.4 −0.2 0.0 0.2

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

Nonlinear

Fitted

Residual

Figure 7.5: Residuals vs Fitted plots - the first suggests no change to the current model while the secondshows non-constant variance and the third indicates some nonlinearity which should prompt some changein the structural form of the model

You should also plot !̂ against xi (for predictors that are both in and out of the model). Look for the samethings except in the case of plots against predictors not in the model, look for any relationship which mightindicate that this predictor should be included.

We illustrate this using the savings dataset as an example again:

> g <- lm(sr ˜ pop15+pop75+dpi+ddpi,savings)

First the residuals vs. fitted plot and the abs(residuals) vs. fitted plot.

> plot(g$fit,g$res,xlab="Fitted",ylab="Residuals")> abline(h=0)> plot(g$fit,abs(g$res),xlab="Fitted",ylab="|Residuals|")

The plots may be seen in the first two panels of Figure 7.5. What do you see? The latter plot isdesigned to check for non-constant variance only. It folds over the bottom half of the first plot to increasethe resolution for detecting non-constant variance. The first plot is still needed because non-linearity mustbe checked.

A quick way to check non-constant variance is this regression:

> summary(lm(abs(g$res) ˜ g$fit))Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 4.840 1.186 4.08 0.00017g$fit -0.203 0.119 -1.72 0.09250

HomoskedasHcity Example

900 950 1000 1050

-50

050

Fitted

Residuals

900 950 1000 1050

-50

050

Fitted

Residuals

HomoskedasHcity Example

?

Normal error

•  Usually assessed visually, by Q-‐Q plots, boxplots, histograms

•  This takes pracHce •  There are plenty of “tests for normality,” but the p-‐values they give don’t translate directly into acHon

•  When residuals are non-‐normal: –  Parameter esHmates are usually sHll OK –  Parameter CI are more suspect, but may sHll be OK –  Higher residual skew and lower sample size both increase concern

Normal error Example

-2 -1 0 1 2

-50

050

Normal Q-Q Plot


Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot


Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot


Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot


Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-10

12

Normal Q-Q Plot


Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-10

12

Normal Q-Q Plot


Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot


Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot


Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot


Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-2-1

01

2

Normal Q-Q Plot


Sam

ple

Qua

ntile

s

(Our data) (random standard normal samples)

Model Building (aka variable selecHon)

Basic Forward selecHon / Backward eliminaHon

•  Idea: Use p-‐values of individual parameters to include or exclude them from the model – Forward SelecHon: SequenHally add the most significant parameters

– Backward SelecHon: Start with all parmeters, sequenHally remove least significant

•  Parameter significance judged by t-‐test –  (against the null hypothesis of parameter = 0) – Lowest p-‐value = “most significant”

•  Goal: Include as many significant parameters as possible.

Backward eliminaHon Example

Forward SelecHon Example



Model fit metrics

•  Various staHsHcs proposed to describe the fit of classical regression models –  Adjusted R2: Based on R2 but with penalty for model size – Mallows’ Cp: An esHmate of predicHon error

•  Approximates tradeoff for overfirng data

–  AIC: Can be used for classical problems too!

•  Finding the best model: –  Firng all possible models may be feasible

•  (Classical regression is fast) –  Otherwise, use a search algorithm

•  Generalized concept of forward selecHon / backward eliminaHon

Model fit metrics Example 1: Fit by Adjusted R2

Our original formula was: !fit = lm(gamble ~ sex + status + income + verbal, data=teengamb)!

So we know that variables 1, 3, and 4 are sex, income, and verbal

Model fit metrics Example 2: Fit by Mallows’ Cp

The expected value of Cp is ~p, so only consider models that fall below 1:1 in a Cp vs. p plot.

Among those, you could favor fewest parameters and/or lowest Cp. 3.0 3.5 4.0 4.5 5.0

3.0

3.5

4.0

4.5

5.0

p

Cp

13

134

1231234

Model fit metrics Example 3: Fit by AIC

Beyond lm() •  Most other “standard” regression models are covered by glm

(), which operates similarly •  There are also a lot of other regression-‐esque modeling

methods, e.g.: –  Regression trees, neural networks, splines and local regressions, etc.

•  R has libraries for these and many other advanced methods •  But remember:

–  it’s easy to write down (and code) a likelihood funcHon for almost any model. –  Then you can:

•  Solve by maximum likelihood (simpler cases), or •  Solve by Bayesian MCMC (complex, hierarchical, prior info., etc.)

–  This is ouen easier and simpler than learning the nuances of some new canned funcHon.

–  It also puts you in complete control of your model (for bener or worse)