6
Linear models in R Richard Sherley University of Exeter, Penryn Campus, UK Februrary 2020 Richard Sherley Linear models in R Februrary 2020 1/1 Thanks and preamble Thanks to J. J. Valletta (now an Associate Lecturer in Statistics, University of St Andrews) and T. J. McKinley (Lecturer in Mathematical Biology, now at the Streatham Campus) Extensive notes, handouts of these slides, and data files for the practicals are available at: https://exeter-data-analytics.github.io/StatModelling/ The Team Dr Beth Clark Dr Dan Padfield Dr Matt Silk Dr Richard Inger Richard Sherley Linear models in R Februrary 2020 2/1 What is a model and why do we need one? A model is a human construct/abstraction that tries to approximate the data generating process in some useful manner Models are built for enhancing our understanding of a complex phenomenon executing “what if”scenarios predicting/forecasting an outcome controlling a system Richard Sherley Linear models in R Februrary 2020 3/1 Illustrative example 0 2 4 6 8 10 0 5 10 15 20 25 Hours spent watching Game of Thrones (GoT) at the weekend Minutes spent on monday talking about GoT at work Richard Sherley Linear models in R Februrary 2020 4/1

The Team - Exeter Data Analytics Hub## Residual standard error: 10.31 on 98 degrees of freedom ## Multiple R-squared: 0.8622, Adjusted R-squared: 0.8608 ## F-statistic: 613.1 on 1

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Team - Exeter Data Analytics Hub## Residual standard error: 10.31 on 98 degrees of freedom ## Multiple R-squared: 0.8622, Adjusted R-squared: 0.8608 ## F-statistic: 613.1 on 1

Linear models in R

Richard Sherley

University of Exeter, Penryn Campus, UK

Februrary 2020

Richard Sherley Linear models in R Februrary 2020 1 / 1

Thanks and preamble

Thanks to J. J. Valletta (now an Associate Lecturer in Statistics,University of St Andrews) and T. J. McKinley (Lecturer in MathematicalBiology, now at the Streatham Campus)

Extensive notes, handouts of these slides, and data files for the practicalsare available at: https://exeter-data-analytics.github.io/StatModelling/

The Team• Dr Beth Clark

• Dr Dan Padfield

• Dr Matt Silk

• Dr Richard Inger

Richard Sherley Linear models in R Februrary 2020 2 / 1

What is a model and why do we need one?

A model is a human construct/abstraction that tries to approximate thedata generating process in some useful manner

Models are built for• enhancing our understanding of a complex phenomenon

• executing “what if”scenarios

• predicting/forecasting an outcome

• controlling a system

Richard Sherley Linear models in R Februrary 2020 3 / 1

Illustrative example

0 2 4 6 8 10

05

1015

2025

Hours spent watching Game of Thrones (GoT) at the weekend

Min

utes

spe

nt o

n m

onda

y ta

lkin

g ab

out G

oT a

t wor

k

Richard Sherley Linear models in R Februrary 2020 4 / 1

Page 2: The Team - Exeter Data Analytics Hub## Residual standard error: 10.31 on 98 degrees of freedom ## Multiple R-squared: 0.8622, Adjusted R-squared: 0.8608 ## F-statistic: 613.1 on 1

Illustrative example

0 2 4 6 8 10

05

1015

2025

Hours spent watching Game of Thrones (GoT) at the weekend

Min

utes

spe

nt o

n m

onda

y ta

lkin

g ab

out G

oT a

t wor

k

Richard Sherley Linear models in R Februrary 2020 5 / 1

Illustrative example

0 2 4 6 8 10

05

1015

2025

Hours spent watching Game of Thrones (GoT) at the weekend

Min

utes

spe

nt o

n m

onda

y ta

lkin

g ab

out G

oT a

t wor

k

c = intercept

m = gradient

Richard Sherley Linear models in R Februrary 2020 6 / 1

Illustrative example

0 2 4 6 8 10

05

1015

2025

Hours spent watching Game of Thrones (GoT) at the weekend

Min

utes

spe

nt o

n m

onda

y ta

lkin

g ab

out G

oT a

t wor

k

β0 = intercept

β1 = gradient

Richard Sherley Linear models in R Februrary 2020 7 / 1

Formal definition

yi = β0 + β1xi + εi

εi ∼ N (0, σ2)

Observed data• y (outcome/response): minutes spent talking about GoT

• x (explanatory): hours spent watching Game of Thrones (GoT)

Parameters to infer• β0: intercept

• β1: gradient wrt minutes talking about GoT

Richard Sherley Linear models in R Februrary 2020 8 / 1

Page 3: The Team - Exeter Data Analytics Hub## Residual standard error: 10.31 on 98 degrees of freedom ## Multiple R-squared: 0.8622, Adjusted R-squared: 0.8608 ## F-statistic: 613.1 on 1

Linear models in R

• Use the lm() function

• Requires a formula objectoutcome ∼ explanatory variable

1 # talk: minutes spent talking about GoT (outcome/response variable)

2 # watch: hours spent watching GoT (explanatory variable)

3

4 fit <- lm(talk ~ watch)

5

6 # If data is in a data frame called "df"

7 fit <- lm(talk ~ watch , df)

Richard Sherley Linear models in R Februrary 2020 9 / 1

Summary of fitted model

1 summary(fit)

##

## Call:

## lm(formula = height ~ weight, data = df)

##

## Residuals:

## Min 1Q Median 3Q Max

## -31.089 -6.926 -0.689 6.057 24.967

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 2.35229 7.11668 0.331 0.742

## weight 2.17446 0.08782 24.762 <2e-16 ***

## ---

## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

##

## Residual standard error: 10.31 on 98 degrees of freedom

## Multiple R-squared: 0.8622, Adjusted R-squared: 0.8608

## F-statistic: 613.1 on 1 and 98 DF, p-value: < 2.2e-16

Richard Sherley Linear models in R Februrary 2020 10 / 1

Model checking

In order to make robust inference, we must check the model fit

1 plot(fit)

140 160 180 200 220

−30

−20

−10

010

2030

Fitted values

Res

idua

ls

Residuals vs Fitted

79

5664

−2 −1 0 1 2

−3

−2

−1

01

23

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

79

5664

140 160 180 200 220

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−Location79

5664

0.00 0.01 0.02 0.03 0.04

−3

−2

−1

01

23

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance

Residuals vs Leverage

79

56

88

Richard Sherley Linear models in R Februrary 2020 11 / 1

Model checking

A couple of examples where the homogeneity of variance assumption isviolated

Richard Sherley Linear models in R Februrary 2020 12 / 1

Page 4: The Team - Exeter Data Analytics Hub## Residual standard error: 10.31 on 98 degrees of freedom ## Multiple R-squared: 0.8622, Adjusted R-squared: 0.8608 ## F-statistic: 613.1 on 1

Model checking: Leverage and Influence

• Obs. 1 – large leverage (outlier in x and y), but not influential

• Obs. 2 – large residual, but not an outlier in x or y. Not influential

• Obs. 3 – not an outlier for y, but has large leverage and largeresidual: very influential (high Cook’s distance).

Richard Sherley Linear models in R Februrary 2020 13 / 1

Model checking: Plot your data!

• Always visualise your data before fitting any model

• Assumption of a linear relationship...

• Scatterplots can indicate unequal variance, nonlinearity and outliers

Richard Sherley Linear models in R Februrary 2020 14 / 1

Model checking: Plot your data!

• Always visualise your data before fitting any model

• Assumption of a linear relationship...

• Scatterplots can indicate unequal variance, nonlinearity and outliers

Anscombe (1973) American Statistician 27: 17–21

• r2 = 0.68 in all four cases

• Test that H0 = 0 identical in all four cases (t = 4.24, P = 0.002)

Richard Sherley Linear models in R Februrary 2020 15 / 1

Confidence vs prediction intervals

25

50

75

100

0.7 0.8 0.9

thorax

long

evity

Richard Sherley Linear models in R Februrary 2020 16 / 1

Page 5: The Team - Exeter Data Analytics Hub## Residual standard error: 10.31 on 98 degrees of freedom ## Multiple R-squared: 0.8622, Adjusted R-squared: 0.8608 ## F-statistic: 613.1 on 1

Confidence vs prediction intervals

0

25

50

75

100

0.7 0.8 0.9thorax

long

evity

Fitted regression line with 97% prediction interval

Richard Sherley Linear models in R Februrary 2020 16 / 1

Multiple linear regression

yi = β0 + β1x1i + . . .+ βpxpi + εi

εi ∼ N (0, σ2)

60 70 80 90 100

120

140

160

180

200

220

240

120

140

160

180

200

220

Weight (kg)

Hei

ght o

f Par

ents

(cm

)

Hei

ght (

cm)

Richard Sherley Linear models in R Februrary 2020 17 / 1

Multiple linear regression

yi = β0 + β1x1i + . . .+ βpxpi + εi

εi ∼ N (0, σ2)

60 70 80 90 100

120

140

160

180

200

220

240

120

140

160

180

200

220

Weight (kg)

Hei

ght o

f Par

ents

(cm

)

Hei

ght (

cm)

Richard Sherley Linear models in R Februrary 2020 18 / 1

Categorical variables

160

200

240

60 70 80 90 100

weight

heig

ht

sex

Female

Male

Richard Sherley Linear models in R Februrary 2020 19 / 1

Page 6: The Team - Exeter Data Analytics Hub## Residual standard error: 10.31 on 98 degrees of freedom ## Multiple R-squared: 0.8622, Adjusted R-squared: 0.8608 ## F-statistic: 613.1 on 1

Categorical variables

We need dummy variables

Si =

{1 if i is male,0 otherwise

Here, female is known as the baseline/reference levelThe regression is:

yi = β0 + β1Si + β2xi + εi

Or in English:

heighti = β0 + β1sexi + β2weighti + εi

Richard Sherley Linear models in R Februrary 2020 20 / 1

Categorical variables

The mean regression lines for male and female are:

• Female (sex=0)

heighti = β0 + (β1 × 0) + β2weighti

heighti = β0 + β2weighti

• Male (sex=1)

heighti = β0 + (β1 × 1) + β2weighti

heighti = (β0 + β1) + β2weighti

Richard Sherley Linear models in R Februrary 2020 21 / 1

Summary

Linear regression is a powerful tool:

• It splits the data into signal (trend/mean) and noise (residual error)

• It can cope with multiple variables

• It can incorporate different types of variable

• It can be used to produce point and interval estimates for theparameters

• It can be used to assess the importance of variables

But always check that the model fit is sensible, that the assumptions aremet and the results make sense!

Richard Sherley Linear models in R Februrary 2020 22 / 1