Linear Modelling: Simple Regression - GitHub Pages · Simple Regression: 21 Aims: • To...

Preview:

Citation preview

Linear Modelling: Simple Regression10th of May 2018 R. Nicholls / D.-L. Couturier / M. Fernandes

Introduction:

ANOVA•  Usedfortestinghypothesesregardingdifferencesbetweengroups•  Considersthevariationwithinandbetweengroups

Regression•  Usedforrevealingandinvestigatingrelationshipsbetweeninputandoutputvariables•  Modeldata,andextrapolateasmuchinformationaspossible

2

0 10 20 30 40 50

010

2030

4050

x

yCorrelation:

3

Howtomeasurethestrengthofalinearrelationshipbetweenvariables?

0 10 20 30 40 50

010

2030

4050

x

yCorrelation:

4

0 10 20 30 40 50

010

2030

4050

60

x

yCorrelation:

5

0 10 20 30 40 50

-15

-10

-50

510

x

yCorrelation:

6

0 10 20 30 40 50

010

2030

4050

x

yCorrelation:

7

0 10 20 30 40 50

010

2030

4050

x

yCorrelation:

8

0 10 20 30 40 50

010

2030

4050

x

yCorrelation:

9

0 10 20 30 40 50

010

2030

4050

x

yCorrelation:

10

0 10 20 30 40 50

010

2030

4050

x

yCorrelation:

11

0 10 20 30 40 50

010

2030

4050

x

yCorrelation:

12

0 10 20 30 40 50

010

2030

4050

x

yCorrelation:

13

Positivelycorrelated:

Negativelycorrelated:

Uncorrelated:

Correlation:

14

Pearson’sproduct-momentcorrelationcoefficient:

CoefficientofVariation(R2value):

0 10 20 30 40 50

010

2030

4050

x

y

Correlation:

15

r=0.931R2=0.866

0 10 20 30 40 500

1020

3040

5060

x

y

r=-0.949R2=0.901

0 10 20 30 40 50-5

05

1015

2025

30

x

y

0 10 20 30 40 50

-15

-10

-50

510

x

y

Correlation:

16

r=-0.060R2=0.004

r=0.106R2=0.011

0 10 20 30 40 50

010

2030

4050

x

y

Correlation:

17

data:xandyt=17.613,df=48,p-value<2.2e-16alternativehypothesis:truecorrelationisnotequalto095percentconfidenceinterval:0.88025560.9602168sampleestimates:cor0.9305923

data:xandyt=1.5609,df=48,p-value=0.1251alternativehypothesis:truecorrelationisnotequalto095percentconfidenceinterval:-0.062380660.46941403sampleestimates:cor0.2197833

0 10 20 30 40 50

-200

-100

0100

200

xy

CanIsaywhethermydataarecorrelated?Isanobservedcorrelationsignificant?

Simple Regression:

18

0 10 20 30 40 50

010

2030

4050

x

y

Aims:•  Toinvestigatelinearcorrelationbetweentwovariablesinmoredetail•  Beabletopredictresponsegivenaknowledgeoftheindependentvariable

PredictorvariableIndependentvariable

ResponsevariableDependentvariable

0 10 20 30 40 50

010

2030

4050

x

y

Simple Regression:

19

Aims:•  Toinvestigatelinearcorrelationbetweentwovariablesinmoredetail•  Beabletopredictresponsegivenaknowledgeoftheindependentvariable

ResponsevariableDependentvariable

PredictorvariableIndependentvariable

0 10 20 30 40 50

010

2030

4050

x

y

Simple Regression:

20

Aims:•  Toinvestigatelinearcorrelationbetweentwovariablesinmoredetail•  Beabletopredictresponsegivenaknowledgeoftheindependentvariable

ResponsevariableDependentvariable

PredictorvariableIndependentvariable

0 10 20 30 40 50

010

2030

4050

x

y

Simple Regression:

21

Aims:•  Toinvestigatelinearcorrelationbetweentwovariablesinmoredetail•  Beabletopredictresponsegivenaknowledgeoftheindependentvariable

ResponsevariableDependentvariable

PredictorvariableIndependentvariable

εi=errors,residuals

Fortheithobservation:

yi

εixi

Simple Regression:

22

Sohowdowefittheregressionline?Supposeweknowparameterestimatesand

0 10 20 30 40 500

1020

3040

50

x

y

Observations:

yi

εixi

= !(! + !!+ !|!,!)

Simple Regression:

23

0 10 20 30 40 500

1020

3040

50

x

y

Observations:Fittedvalues:

yi

εi

ŷi

xi

Sohowdowefittheregressionline?Supposeweknowparameterestimatesand

Simple Regression:

24

! = !− !Residuals:

0 10 20 30 40 500

1020

3040

50

x

y yi

εi

ŷi

xi

Sohowdowefittheregressionline?Supposeweknowparameterestimatesand

Simple Regression:

25

0 10 20 30 40 500

1020

3040

50

x

y yi

εi

! = !− !

ŷi

Residuals: xi! = !+ !

Sohowdowefittheregressionline?Supposeweknowparameterestimatesand ! = !− !

Simple Regression:

26

0 10 20 30 40 500

1020

3040

50

x

y yi

εi

! = !− !

ŷi

Residuals: xi! = !+ !

! ~ !(!,!!)

Sohowdowefittheregressionline?Supposeweknowparameterestimatesand ! = !− !

Simple Regression:

27

0 10 20 30 40 500

1020

3040

50

x

y yi

εi

! = !− !

ŷi

Residuals: xi! = !+ !

! ~ !(!,!!)

!(!|!;!,!) = 12!!!

!!(!!!)!!!!

Sohowdowefittheregressionline?Supposeweknowparameterestimatesand ! = !− !

0 10 20 30 40 500

1020

3040

50

x

y

Simple Regression:

28

Sohowdowefittheregressionline?ObtainestimatesandMaximiselikelihoodofparametersgiventhedata

yi

εi

ŷi

xi

! = !− !

!(!|!;!,!) = 12!!!

!!(!!!)!!!!

! !,! !,! = !(!!|!!;!,!)!

= 12!!!!

!!(!!!!!)!

!!!

0 10 20 30 40 500

1020

3040

50

x

y

Simple Regression:

29

Sohowdowefittheregressionline?ObtainestimatesandMaximiselikelihoodofparametersgiventhedata

yi

εi

ŷi

xi

! = !− !

!(!|!;!,!) = 12!!!

!!(!!!)!!!!

! !,! !,! = !(!!|!!;!,!)!

= 12!!!!

!!(!!!!!)!

!!!

ln! !,! !,! = !!! ln 2!!! − (!!!!!)!

!!!!

= !!! log 2!!

! − !!!! (!! − !!)!

!

! !,! !,! !"#

0 10 20 30 40 500

1020

3040

50

x

y

Simple Regression:

30

Optimalparameters:minimiseresidualsumofsquaresMaximumLikelihoodandLeastSquaresestimatesareequivalent(forGaussianerrorsmodel)

yi

εi

ŷi

xi

! = !− !

Sohowdowefittheregressionline?ObtainestimatesandMaximiselikelihoodofparametersgiventhedata

! !,! !,! !"#

Simple Regression:

31

Optimalparameters:minimiseresidualsumofsquaresMaximumLikelihoodandLeastSquaresestimatesareequivalent(forGaussianerrormodel)

! = !− !

Sohowdowefittheregressionline?ObtainestimatesandMaximiselikelihoodofparametersgiventhedata

0 10 20 30 40 500

1020

3040

50

x

y!

!

Simple Regression:

32

! = !− !

Sohowdowefittheregressionline?ObtainestimatesandMaximiselikelihoodofparametersgiventhedataMinimisesumofsquaredresiduals

0 10 20 30 40 500

1020

3040

50

x

y!

!

Finalanswer:

Simple Regression:

33

Example:Predictingtimbervolumeoffelledblackcherrytrees

8 10 12 14 16 18 20

1020

3040

5060

70

GirthVolume

> cor(trees$Volume,trees$Girth)[1] 0.9671194

> m1 = lm(Volume~Girth,data=trees)> summary(m1)

Call:lm(formula = Volume ~ Girth, data = trees)

Residuals: Min 1Q Median 3Q Max -8.065 -3.107 0.152 3.495 9.587

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***Girth 5.0659 0.2474 20.48 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedomMultiple R-squared: 0.9353, Adjusted R-squared: 0.9331 F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16

Response: y=VolumePredictor: x=Girth

8 10 12 14 16 18 20

1020

3040

5060

70

GirthVolume

Simple Regression:

34

Example:Predictingtimbervolumeoffelledblackcherrytrees

> cor(trees$Volume,trees$Girth)[1] 0.9671194

> m1 = lm(Volume~Girth,data=trees)> summary(m1)

Call:lm(formula = Volume ~ Girth, data = trees)

Residuals: Min 1Q Median 3Q Max -8.065 -3.107 0.152 3.495 9.587

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***Girth 5.0659 0.2474 20.48 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedomMultiple R-squared: 0.9353, Adjusted R-squared: 0.9331 F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16

Response: y=VolumePredictor: x=Girth

Residuals

Frequency

-10 -5 0 5 10

01

23

45

67

Simple Regression:

35

Example:Predictingtimbervolumeoffelledblackcherrytrees

> cor(trees$Volume,trees$Girth)[1] 0.9671194

> m1 = lm(Volume~Girth,data=trees)> summary(m1)

Call:lm(formula = Volume ~ Girth, data = trees)

Residuals: Min 1Q Median 3Q Max -8.065 -3.107 0.152 3.495 9.587

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***Girth 5.0659 0.2474 20.48 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedomMultiple R-squared: 0.9353, Adjusted R-squared: 0.9331 F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16

Response: y=VolumePredictor: x=Girth

! = 4.252!! = 18.1

Residuals

Frequency

-10 -5 0 5 10

01

23

45

67

Simple Regression:

36

Example:Predictingtimbervolumeoffelledblackcherrytrees

> cor(trees$Volume,trees$Girth)[1] 0.9671194

> m1 = lm(Volume~Girth,data=trees)> summary(m1)

Call:lm(formula = Volume ~ Girth, data = trees)

Residuals: Min 1Q Median 3Q Max -8.065 -3.107 0.152 3.495 9.587

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***Girth 5.0659 0.2474 20.48 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedomMultiple R-squared: 0.9353, Adjusted R-squared: 0.9331 F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16

Response: y=VolumePredictor: x=Girth

95%within±8.5

! = 4.252!! = 18.1

Linear Regression:

37

Assumptions:1.  Modelislinearinparameters.

Linear Regression:

38

Assumptions:1.  Modelislinearinparameters.

Linear Regression:

39

Assumptions:1.  Modelislinearinparameters.

Linear Regression:

40

Assumptions:1.  Modelislinearinparameters.

Linear Regression:

41

Assumptions:1.  Modelislinearinparameters.

Linear Regression:

42

Assumptions:1.  Modelislinearinparameters.

2.  Gaussianerrormodel.

Linear Regression:

43

Assumptions:1.  Modelislinearinparameters.

2.  Gaussianerrormodel.

3.  Additiveerrormodel.

Linear Regression:

44

Assumptions:1.  Modelislinearinparameters.

2.  Gaussianerrormodel.

3.  Additiveerrormodel.

4.  Independenceoferrors.

Noautocorrelation–whenoneobservationdependsonthelast

Linear Regression:

45

Assumptions:1.  Modelislinearinparameters.

2.  Gaussianerrormodel.

3.  Additiveerrormodel.

4.  Independenceoferrors.

Noautocorrelation–whenoneobservationdependsonthelast

5.  Homoscedasticity.Homogeneity/stabilityofvarianceoftheresiduals

Testing Assumptions: diagnostic plots

46

1.  ResidualsvsFittedValues

10 20 30 40 50 60

-10

-50

510

Fitted values

Residuals

lm(Volume ~ Girth)

Residuals vs Fitted

31

2019

•  Shouldnotberelated•  Novisiblepattern•  Meanresidual=zero•  Constantvariance

-2 -1 0 1 2

-2-1

01

23

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

lm(Volume ~ Girth)

Normal Q-Q

31

2019

Testing Assumptions: diagnostic plots

47

1.  ResidualsvsFittedValues2.  NormalQuantile-Quantileplot

•  VisualtestforNormality•  Nostrongtrends/departures

10 20 30 40 50 60

0.0

0.5

1.0

1.5

Fitted values

Standardized residuals

lm(Volume ~ Girth)

Scale-Location31

20

19

Testing Assumptions: diagnostic plots

48

1.  ResidualsvsFittedValues2.  NormalQuantile-QuantilePlot3.  Scale-LocationPlot

•  Testforhomoscedasticity•  Shouldbeconstant,≈1•  Notrend

0.00 0.05 0.10 0.15 0.20

-2-1

01

23

Leverage

Sta

ndar

dize

d re

sidu

als

lm(Volume ~ Girth)

Cook's distance0.5

0.5

1

Residuals vs Leverage

31

128

Testing Assumptions: diagnostic plots

49

1.  ResidualsvsFittedValues2.  NormalQuantile-QuantilePlot3.  Scale-LocationPlot4.  IndexPlotofCook’sDistance

•  Measurestheinfluenceofaparticularobservation

•  Extremex-vals:highleverage•  Mayinformoutlierrejection

Modelling Non-Linear Relationships

50

Linearmodelscanbeusedtodescribenon-linearrelationships…

Applyingtransformationstoresponseand/orpredictorvariablescanbeusefulto:•  Linearisethedata,i.e.maketherelationshipbetweenvariablesmorelinear.•  Stabilisethevarianceoftheresiduals,sothatσ2doesn’tdependonthe

independentvariable.•  Normalisethedistributionoftheresiduals

Modelling Non-Linear Relationships

51

Example:Stoppingdistanceofcarsversusspeed(mph)

5 10 15 20 25

020

4060

80100

120

speed

dist

Response: y=distancePredictor: x=speed

5 10 15 20 25

020

4060

80100

120

speed

dist

Modelling Non-Linear Relationships

52

Example:Stoppingdistanceofcarsversusspeed(mph)

Response: y=distancePredictor: x=speed

0 20 40 60 80

-20

020

40

Fitted values

Residuals

lm(dist ~ speed)

Residuals vs Fitted

4923

35

R2=0.651

3 4 5 6 7 8 9

-2-1

01

23

Fitted values

Residuals

lm(sqrt(dist) ~ speed)

Residuals vs Fitted

23

35

39

5 10 15 20 25

24

68

10

speed

sqrt(dist)

Modelling Non-Linear Relationships

53

Example:Stoppingdistanceofcarsversusspeed(mph)

Response: y=distancePredictor: x=speed

R2=0.651R2=0.709

1.5 2.0 2.5 3.0 3.5 4.0 4.5

-1.0

-0.5

0.0

0.5

1.0

Fitted values

Residuals

lm(log(dist) ~ log(speed))

Residuals vs Fitted

3

232

1.5 2.0 2.5 3.0

12

34

log(speed)

log(dist)

Modelling Non-Linear Relationships

54

Example:Stoppingdistanceofcarsversusspeed(mph)

Response: y=distancePredictor: x=speed

R2=0.651R2=0.709R2=0.733

5 10 15 20 25

020

4060

80100

120

speed

dist

Modelling Non-Linear Relationships

55

Example:Stoppingdistanceofcarsversusspeed(mph)

Response: y=distancePredictor: x=speed

R2=0.651R2=0.709R2=0.733

Call:lm(formula = log(dist) ~ log(speed), data = cars)

Residuals: Min 1Q Median 3Q Max -1.00215 -0.24578 -0.02898 0.20717 0.88289

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.7297 0.3758 -1.941 0.0581 . log(speed) 1.6024 0.1395 11.484 2.26e-15 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4053 on 48 degrees of freedomMultiple R-squared: 0.7331, Adjusted R-squared: 0.7276 F-statistic: 131.9 on 1 and 48 DF, p-value: 2.259e-15

Modelling Non-Linear Relationships

56

Canyouusesimpleregressiontofitthismodel?

Non-linearMultiplicativeerrormodel

Modelling Non-Linear Relationships

57

Canyouusesimpleregressiontofitthismodel?

Non-linearMultiplicativeerrormodel

Modelling Non-Linear Relationships

58

Canyouusesimpleregressiontofitthismodel?

Yes,solongasErrormodelislog-Normal.

5 10 15 20 25

020

4060

80100

120

speed

dist

Non-linearMultiplicativeerrormodel

5 10 15 20 25

020

4060

80100

120

speed

dist

Modelling Non-Linear Relationships

59

Example:Stoppingdistanceofcarsversusspeed(mph)

Response: y=distancePredictor: x=speed

R2=0.651R2=0.709R2=0.733

Call:lm(formula = log(dist) ~ log(speed), data = cars)

Residuals: Min 1Q Median 3Q Max -1.00215 -0.24578 -0.02898 0.20717 0.88289

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.7297 0.3758 -1.941 0.0581 . log(speed) 1.6024 0.1395 11.484 2.26e-15 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4053 on 48 degrees of freedomMultiple R-squared: 0.7331, Adjusted R-squared: 0.7276 F-statistic: 131.9 on 1 and 48 DF, p-value: 2.259e-15

! = !!!!!!

0 10 20 30 40 50

010

2030

4050

x

y

R functions:

plot(x,y)cor(x,y)cor.test(x,y)

60

data:xandyt=17.613,df=48,p-value<2.2e-16alternativehypothesis:truecorrelationisnotequalto095percentconfidenceinterval:0.88025560.9602168sampleestimates:cor0.9305923

Simple Regression in R:

Correlation Coefficients:

61

Simple Regression in R:

R functions:

plot(x,y)m1 = lm(y~x)abline(m1)

summary(m1)

Call:lm(formula = Volume ~ Girth, data = trees)

Residuals: Min 1Q Median 3Q Max -8.065 -3.107 0.152 3.495 9.587

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***Girth 5.0659 0.2474 20.48 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedomMultiple R-squared: 0.9353, Adjusted R-squared: 0.9331 F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16

62

Simple Regression in R:

R functions:

plot(x,y)m1 = lm(y~x)abline(m1)

summary(m1)

r1 = residuals(r1)hist(r1)

63

Simple Regression in R:

R functions:

plot(x,y)m1 = lm(y~x)abline(m1)

summary(m1)

r1 = residuals(r1)hist(r1)

plot(m1)

10 20 30 40 50 60

-10

-50

510

Fitted values

Residuals

Residuals vs Fitted

31

2019

-2 -1 0 1 2

-2-1

01

23

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

31

2019

10 20 30 40 50 60

0.0

0.5

1.0

1.5

Fitted values

Standardized residuals

Scale-Location31

2019

0.00 0.05 0.10 0.15 0.20

-2-1

01

23

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance 0.5

0.5

1

Residuals vs Leverage

31

128

Recommended