Classical regression methods in R
Goals
• Learn basic regression techniques in R • Less flexible than self-‐built models fit by Maximum Likelihood or Bayes, But: – Very quick and easy • (may as well try ‘em)
– Widely used • (good to know about)
– Adequate in many applicaHons • (don’t overdo it)
References • The basics: – Prac+cal Regression and Anova using R
• Freely available: cran.r-‐project.org/doc/contrib/Faraway-‐PRA.pdf • Accompanied by the R library faraway, containing funcHons and example data • Covers all the standard methods, including everything in these slides
• Going further: – Extending the Linear Model With R
• I think you have to buy this one • Covers GLM, mixed models, etc.
• Both are by Julian Faraway – More info at: h:p://www.maths.bath.ac.uk/~jjf23/
• Examples shown here are in ‘Regression in R.r’ on the IB509 website
First example dataset
Basic syntax: lm()
y ~ x !OR !y ~ x + 1!y = f (x) =mx + b
Specifying model formula
y = f (x) = b y ~ 1!
y = f (x) =mx y ~ x - 1!
y = f (x) = !0 +!1x1 +!2x2 +!3x22 y ~ x1 + x2 + I(x2^2)!
Algebraically Coded in R
A “1” represents the intercept
It will be assumed if you omit it
But you can specify no intercept [i.e. f(0)=0]
Include as many terms as you want, including derived variables [using I()]
Specifying model formula Example 1: Mean-‐only
0 10 20 30 40
050
100
150
Index
gamble
Specifying model formula Example 2: Mean + intercept
2 4 6 8 10 12 14
050
100
150
income
gamble
Specifying model formula Example 3: No intercept
0 5 10 15
050
100
150
income
gamble
gamble ~ incomegamble ~ income - 1origin
Specifying model formula Example 4: MulHvariate regression
Specify a data frame
If your data are in a data frame, you can save yourself some typing:
x1 = dat$predict1!x2 = dat$predict2!y = dat$response!fit1 = lm(y ~ x1 + x2)!!# That’s the same as:!fit2 = lm(response ~ predict1 + predict2, data=dat)!
Specifying model formula Example 5: The “data” argument
Omit data
Use “subset” to exclude certain rows or to model relevant subsets of the data:
# Exclude based on some numeric criterion:!fit = lm(y ~ x1 + x2, subset=z>0)!
!# (here, only observations with z>0 are included)!!# Or filter by some categorical variable:!fit = lm(y ~ x1 + x2, subset=color==“red”)!
Using the outputs of lm()
fit$residuals!OR fit$resid!
OR residuals(fit)!
fit$coefficients!OR fit$coef!
fit$df.residual!OR fit$df!
If fit is the output of an lm() call, then:
fit$fitted.values!OR fit$fitted!
fit$rank!
is the list of best-‐fit parameters
is a vector of residuals for each observaHon
is the residual degrees of freedom
is the model rank (parameter degrees of freedom) (n = # of observa+ons = fit$df + fit$rank)
is a vector of the predicted y values at each observaHon x
(y = fit$resid + fit$fiJed)
Using the outputs of lm()
Using the outputs of lm()
0 20 60
-50
50
Fitted values
Residuals
Residuals vs Fitted24
39
36
-2 -1 0 1 2
-22
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q-Q24
39
36
0 20 60
0.0
1.5
Fitted values
Standardized residuals
Scale-Location243936
0.00 0.15 0.30
-22
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance0.5
0.51
Residuals vs Leverage24
39
5
Model Inference and InterpretaHon
Inference: Significance and Hypothesis tesHng
• Inference based on Sum-‐of Squares: – For M0 and MA with p0 and pA parameters (corresponding to df0 and dfA degrees of freedom), respecHvely:
Is F-‐distributed with (df0 – dfA) and dfA degrees of freedom
• To test whether full model is significant: MA = full model, M0 = mean only
• To test an individual parameter B*: MA = full model, M0 = model with B* fixed at 0
[This F-‐test is equivalent to the t-‐test reported by summary() in R]
F =RSS0 ! RSSA( ) df0 ! dfA( )
RSSA dfA=RSS0 ! RSSA( ) pA ! p0( )
RSSA n! pA( )
Some pseudodata…
Inference Example 1: TesHng the full model
Inference Example 2: TesHng individual predictors
Inference Example 2: TesHng individual predictors
Explanatory power
• Given a significant model, R2 describes how well it explains the data – R2 = “proporHon variance explained” by the model”
– Almost ubiquitous in classical modeling – Generally, not the same as [correlaHon coeffient]2 • (Although they are equivalent in simple linear regression)
• “Adjusted R2” penalizes for number of parameters – Sort of like AIC
• Mallow’s Cp is an esHmate of predicHon error – Also reflects tradeoff of good fit vs. overfit
Explanatory power Example
Residuals:! Min 1Q Median 3Q Max !-1.27356 -0.21581 -0.07422 0.19709 1.33962 !!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) 1.0378 0.5753 1.804 0.11425 !x1 -0.2663 0.1042 -2.555 0.03782 * !x2 1.2676 0.2611 4.854 0.00185 **!---!Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 !!Residual standard error: 0.7817 on 7 degrees of freedom!Multiple R-squared: 0.9023, !Adjusted R-squared: 0.8744 !F-statistic: 32.33 on 2 and 7 DF, p-value: 0.0002913!
t-‐tests for individual parameter significance
F-‐test for overall model significance
R2 and adjusted R2 indicaHng explanatory power of the model
Model Diagnosis and
AssumpHon-‐checking
New example data
Leverage
• Leverage measures how influenHal a residual is • Leverage hi is based on the distance of observaHon i from the mean x-‐value:
• High leverage not necessarily a problem, but indicates observaHons to “keep an eye on”
• Leverages sum to p (# parameters in model) – So, average leverage = p/n
• As a rule of thumb, look out for leverage > 2p/n
hi =1n+(xi ! x )
2
(xi ! x )2"
Leverage: Example
0 10 20 30 40 50
0.05
0.15
0.25
Index
Leverages
Outliers
• Outliers are observaHons that are unlikely to fit the same model as the majority of the data
• One test is based on the “studenHzed” residual of each observaHon, given a model fit to all other obs:
• These ti are t-‐distributed with n-‐p-‐1 d.f. • So, you can test for outliers with a “simple” t-‐test
ti =!i
"̂ i 1! hi
residual of data point i
esHmated s.d. of residuals with ith obs. excluded
leverage of ith point
Outliers: Example
0 10 20 30 40 50
-4-2
02
4
Index
Jack
nife
Res
idua
ls
Influence and Cook’s Distance
• An observaHon is influen+al if it has a large effect on the regression results – This comes from the combinaHon of large residual and high leverage
• Cook’s D: a staHsHc to measure influence:
• Criteria for tesHng D vary • Definitely check model fit with and without observaHons with highest D
Di =ri2
phi1! hi
ith leverage ith residual
Influence and Cook’s Distance Example
0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
Index
Coo
k's
Dis
tanc
e
Utah
West Virginia
HomoskedasHcity
• Check visually with a plot of residuals vs. fined values • DiagnosHc checks include regression of (scaled) residuals on the original covariates
• Could fix by data transformaHon or addiHonal covariates
7.5. RESIDUAL PLOTS 81
you can make. If all is well, you should see constant variance in the vertical (!̂) direction and the scattershould be symmetric vertically about 0. Things to look for are heteroscedascity (non-constant variance) andnonlinearity (which indicates some change in the model is necessary). In Figure 7.5, these three cases areillustrated.
0.0 0.2 0.4 0.6 0.8 1.0
−2−1
01
No problem
Fitted
Residual
0.0 0.2 0.4 0.6 0.8 1.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
Heteroscedascity
Fitted
Residual
−0.8 −0.6 −0.4 −0.2 0.0 0.2
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
Nonlinear
Fitted
Residual
Figure 7.5: Residuals vs Fitted plots - the first suggests no change to the current model while the secondshows non-constant variance and the third indicates some nonlinearity which should prompt some changein the structural form of the model
You should also plot !̂ against xi (for predictors that are both in and out of the model). Look for the samethings except in the case of plots against predictors not in the model, look for any relationship which mightindicate that this predictor should be included.
We illustrate this using the savings dataset as an example again:
> g <- lm(sr ˜ pop15+pop75+dpi+ddpi,savings)
First the residuals vs. fitted plot and the abs(residuals) vs. fitted plot.
> plot(g$fit,g$res,xlab="Fitted",ylab="Residuals")> abline(h=0)> plot(g$fit,abs(g$res),xlab="Fitted",ylab="|Residuals|")
The plots may be seen in the first two panels of Figure 7.5. What do you see? The latter plot isdesigned to check for non-constant variance only. It folds over the bottom half of the first plot to increasethe resolution for detecting non-constant variance. The first plot is still needed because non-linearity mustbe checked.
A quick way to check non-constant variance is this regression:
> summary(lm(abs(g$res) ˜ g$fit))Coefficients:
Estimate Std. Error t value Pr(>|t|)(Intercept) 4.840 1.186 4.08 0.00017g$fit -0.203 0.119 -1.72 0.09250
HomoskedasHcity Example
900 950 1000 1050
-50
050
Fitted
Residuals
900 950 1000 1050
-50
050
Fitted
Residuals
HomoskedasHcity Example
?
Normal error
• Usually assessed visually, by Q-‐Q plots, boxplots, histograms
• This takes pracHce • There are plenty of “tests for normality,” but the p-‐values they give don’t translate directly into acHon
• When residuals are non-‐normal: – Parameter esHmates are usually sHll OK – Parameter CI are more suspect, but may sHll be OK – Higher residual skew and lower sample size both increase concern
Normal error Example
-2 -1 0 1 2
-50
050
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
-2 -1 0 1 2
-2-1
01
2
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
-2 -1 0 1 2
-2-1
01
2
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
-2 -1 0 1 2
-2-1
01
2
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
-2 -1 0 1 2
-10
12
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
-2 -1 0 1 2
-10
12
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
-2 -1 0 1 2
-2-1
01
2
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
-2 -1 0 1 2
-2-1
01
2
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
-2 -1 0 1 2
-2-1
01
2
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
-2 -1 0 1 2
-2-1
01
2
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
(Our data) (random standard normal samples)
Model Building (aka variable selecHon)
Basic Forward selecHon / Backward eliminaHon
• Idea: Use p-‐values of individual parameters to include or exclude them from the model – Forward SelecHon: SequenHally add the most significant parameters
– Backward SelecHon: Start with all parmeters, sequenHally remove least significant
• Parameter significance judged by t-‐test – (against the null hypothesis of parameter = 0) – Lowest p-‐value = “most significant”
• Goal: Include as many significant parameters as possible.
Backward eliminaHon Example
Forward SelecHon Example
Forward SelecHon Example
Forward SelecHon Example
Model fit metrics
• Various staHsHcs proposed to describe the fit of classical regression models – Adjusted R2: Based on R2 but with penalty for model size – Mallows’ Cp: An esHmate of predicHon error
• Approximates tradeoff for overfirng data
– AIC: Can be used for classical problems too!
• Finding the best model: – Firng all possible models may be feasible
• (Classical regression is fast) – Otherwise, use a search algorithm
• Generalized concept of forward selecHon / backward eliminaHon
Model fit metrics Example 1: Fit by Adjusted R2
Our original formula was: !fit = lm(gamble ~ sex + status + income + verbal, data=teengamb)!
So we know that variables 1, 3, and 4 are sex, income, and verbal
Model fit metrics Example 2: Fit by Mallows’ Cp
The expected value of Cp is ~p, so only consider models that fall below 1:1 in a Cp vs. p plot.
Among those, you could favor fewest parameters and/or lowest Cp. 3.0 3.5 4.0 4.5 5.0
3.0
3.5
4.0
4.5
5.0
p
Cp
13
134
1231234
Model fit metrics Example 3: Fit by AIC
Beyond lm() • Most other “standard” regression models are covered by glm
(), which operates similarly • There are also a lot of other regression-‐esque modeling
methods, e.g.: – Regression trees, neural networks, splines and local regressions, etc.
• R has libraries for these and many other advanced methods • But remember:
– it’s easy to write down (and code) a likelihood funcHon for almost any model. – Then you can:
• Solve by maximum likelihood (simpler cases), or • Solve by Bayesian MCMC (complex, hierarchical, prior info., etc.)
– This is ouen easier and simpler than learning the nuances of some new canned funcHon.
– It also puts you in complete control of your model (for bener or worse)
Recommended