Stat 115 Lecture Notes 12 - math.wsu.edumath.wsu.edu/faculty/xchen/stat115/LectureNotes6/LectureNotes12_notes.pdfStat 115 Lecture Notes 12 Xiongzhi Chen Washington State University

Stat 115 Lecture Notes 12Xiongzhi Chen

Washington State University

Contents2

Multiple linear regression 2Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Intepretation of model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Estimate regression coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Fit multiple linear regression 3Data matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Correlation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Correlation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Correlation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Fit multiple regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Fit multiple regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Correlation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Fitted model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Inference on regression coefficients 8General principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Inference on global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Inference on global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Inference on global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Inference on global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Inference on global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Inference on global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Inference on one coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Inference on one coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Inference on one coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Inference on one coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Inference on one coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Inference on subset of coef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Inference on subset of coef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Inference on subset of coef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Inference on coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Inference on coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Practice via software 13Load data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1

Pairwise correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Fit linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Test global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Test one coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Test a subset of coef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Quick diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16License and session Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Multiple linear regression

Motivation

• Using blood sugar, blood pressure, weight, triglycerides to predict cholesterol

• Using HW scores, participation scores, midterm scores to predict final exam scores

• Using precipitation, temperature, relative humidity to predict fruit growth

Simple linear regression

• Model: yi = β0 + β1xi + εi

• Model: y = β0 + β1x+ ε when εi are i.i.d.

• Model: E[y] = β0 + β1x

• β1: change in units in E(y) for unit change in x

In a model, the random error ε term represents all other variables not explicitly included in the model

Simple linear regression

• Not sufficient to caputre relationship between more than two variables

• Not sufficient to caputre nonlinear relationship between two variables

A fine decomposition of random error leads to a better but more complicated model


Multiple linear regression with one response

• one response variable y; observations y1, y2, . . . , yn• p covariates x1, · · ·, xp• random error ε with zero mean; realizations ε1, ε2, . . . , εn

Say, p = 3 covariates x1, x2 and x3, then the model is:

yi = β0 + β1x1i + β2x2i + β3x3i + εi

2


In the modely = β0 + β1x1 + β2x2 + β3x3 + ε

if you obsorb the termβ2x2 + β3x3 + ε

into a new random error δ, i.e., if you set

δ = β2x2 + β3x3 + ε,

you get Simple Linear Regression Model

y = β0 + β1x1 + δ

Intepretation of model

Take p = 3 for example. Consider the model:

y = β0 + β1x1 + β2x2 + β3x3 + ε

Is the model for E(y), i.e., for the expected value of y?

What does β1 mean?

Estimate regression coefficients

Still the Least-Squares Method is used, i.e., the fitted model minimizes the sum of squaresn∑i=1

[yi − (β0 + β1x1 + · · ·+ βpxp)]2

with βi, i = 1, · · · , p as parameters

Fit multiple linear regression

Data matrix

Importing data:

satisfaction age severity anxiety1 48 50 51 2.32 57 36 46 2.33 66 40 48 2.24 70 41 44 1.85 89 28 43 1.86 36 49 54 2.9

y = satisfaction, x1 = age, x2 = serverity, x3 = anxiety

3

Correlation plot

A correlation plot helps visually check

• if a covariate is correlated to response• if some covariates are correlated with each other• if a linear term for a covariate in the model is appropriate

Correlation plot

R command:> pairs(x, ...)

Correlation plot

> pairs(sat)

4

satisfaction

25 35 45 55 1.8 2.2 2.6

3040

5060

7080

90

2530

3540

4550

55

age

severity

4550

5560

30 50 70 90

1.8

2.0

2.2

2.4

2.6

2.8

45 50 55 60

anxiety

Fit multiple regression model

R command: lm(formula, data)

• “formula” is the expression for model

5

• it can be: y ~ x1 + x2 + x3• it can be: y ~ x1 + x2 + x3 + x1x2 + x3x3

Fit multiple regression model


Model:y = β0 + β1x1 + β2x2 + β3x3 + ε,

where> regML = lm(satisfaction ~ age + severity + anxiety, sat)> regML$coefficients(Intercept) age severity anxiety158.4912517 -1.1416118 -0.4420043 -13.4701632

Correlation plot

Does the fit comply with scatter plot?> pairs(sat)

6

satisfaction

25 35 45 55 1.8 2.2 2.6

3050

7090

2535

4555

age

severity

4550

5560

30 50 70 90

1.8

2.2

2.6

45 50 55 60

anxiety

Fitted model


Fitted model:y = 158.49− 1.14x1 − 0.44x2 − 13.47x3 + ε

• How to interpret the estimated regression coefficients?• Do you think the model caputures well the relationship between these variables?

7

Questions

Recall the following:

• y = satisfaction, x1 = age, x2 = serverity, x3 = anxiety• Fitted model:

y = 158.49− 1.14x1 − 0.44x2 − 13.47x3 + ε

Suppose you just fit the model

y = β0 + β1x1 + ε

• will the estimated β1 obtained from this model be different from that obtained from the previousmodel?

Questions

Fit model: y = β0 + β1x1 + ε

> regMG = lm(satisfaction ~ age, sat)> regMG$coefficients(Intercept) age119.943170 -1.520604

Fitted model: y = 119.94− 1.52x1 + ε

Questions

Recall: y = satisfaction, x1 = age, x2 = serverity, x3 = anxiety

Compare the two fitted models:

• y = 119.94− 1.52x1 + ε

•y = 158.49− 1.14x1 − 0.44x2 − 13.47x3 + ε

Do you know what caused this difference?

Inference on regression coefficients

General principle

Assumptions:

• random errors are independent• random errors are Normally distributed with zero mean and the same variance

Namely, random errors are independent and identically distributed (i.i.d) Normal. Then inference onregression coefficients can be based on Student’s t distributions or F distributions

8

Inference on global null

Recall:

• y = satisfaction, x1 = age, x2 = serverity, x3 = anxiety

• Model: y = β0 + β1x1 + β2x2 + β3x3 + ε

• Fitted model: y = 158.49− 1.14x1 − 0.44x2 − 13.47x3 + ε

Are all coefficients zero, i.e., H0 : β1 = β2 = β3 = 0?


If β1 = β2 = β3 = 0, then the performance of the Reduced Model

y = β0 + ε

should be very close to that of the Full Model

y = β0 + β1x1 + β2x2 + β3x3 + ε

In other words, SS(Reduced) ∑(yi − y)2

(based on Reduced Model) should be close to SS(Full)∑(yi − yi)2

(based on Full Model)


To assess H0 : β1 = β2 = β3 = 0 versus Ha, the test statistis is

F = [SS(Reduced)− SS(Full)]/3SS(Full)/(n− 4)

It has an F distribution with df1 = 3 and df2 = n− 4. Reject H0 if F > Fα,df1,df2


Test H0 : β1 = β2 = β3 = 0 using the following:

• SS(Reduced) = 13369.3; SS(Full) = 4248.841• p = 3 and n = 46• F0.05,p,n−p−1 = 2.827


Illustation:

9

> regML = lm(satisfaction ~ age + severity + anxiety, sat)> summary(regML)$fstatistic

value numdf dendf30.05208 3.00000 42.00000


In general, to assess H0 : β1 = β2 = . . . = βp = 0 versus Ha, the test statistis is

F = [SS(Reduced)− SS(Full)]/pSS(Full)/(n− (p+ 1))

• SS(Reduced): residual sum of squares obtained by fitting the Reduced Model y = β0 + ε• SS(Full) is the residual sum of squares obtained by fitting the Full Model y = β0+β1x1+. . .+βpxp+ε

The test stat has an F distribution df1 = p and df2 = n− (p+ 1). Reject H0 if F > Fα,df1,df2

Inference on one coefficient

Recall the model

y = β0 + β1x1 + β2x2 + β3x3 + ε

Consider testing H0 : β1 = 0 vs Ha : β1 6= 0, which among t test and F test to use?


Test β1 = 0 or not using the following:

• β1 = −1.14 and sβ1= 0.215

• n = 46 and p = 3• sβ1

has degrees freedom n− (p+ 1)• t0.025,42 = 2.02


Illustation:> regML = lm(satisfaction ~ age + severity + anxiety, sat)> library(broom)> tidy(summary(regML))# A tibble: 4 x 5

term estimate std.error statistic p.value<chr> <dbl> <dbl> <dbl> <dbl>

1 (Intercept) 158. 18.1 8.74 5.26e-112 age -1.14 0.215 -5.31 3.81e- 63 severity -0.442 0.492 -0.898 3.74e- 14 anxiety -13.5 7.10 -1.90 6.47e- 2

Value of test stat for if β1 = 0 is −5.314796

10


Test β1 = 0 or not using the following:

• Full Model: y = β0 + β1x1 + β2x2 + β3x3 + ε• Residual sum of squares for Full Model: 4248.841• Reduced Model when H0 is true: y = β0 + β2x2 + β3x3 + ε• Residual sum of squares for Reduced Model: 7106.394• Full Model has 1 more parameter than Reduced Model

F = [SS(Reduced)− SS(Full)]/1SS(Full)/(n− (p+ 1))


Illustation:> regML = lm(satisfaction ~ age + severity + anxiety, sat)> RSSF1 = sum(regML$residuals^2)>> regMLR = lm(satisfaction ~ severity + anxiety, sat)> RSSR1 = sum(regMLR$residuals^2)>> FFR = ((RSSR1 - RSSF1)/1)/(RSSF1/42)> FFR[1] 28.24706

value of test stat for if β1 = 0 is 28.24706

Note (−5.314796)2 = 28.24706

Inference on subset of coef

Say to assess H0 : β1 = β2 = 0 in model

y = β0 + β1x1 + β2x2 + β3x3 + ε

• Full Model: y = β0 + β1x1 + β2x2 + β3x3 + ε

• Reduced Model when H0 is true: y = β0 + β3x3 + ε

• Full Model has 2 more parameters than Reduced Model

F = [SS(Reduced)− SS(Full)]/2SS(Full)/(n− (p+ 1))


Test H0 : β1 = β2 = 0 or not using the following:

• Full Model: y = β0 + β1x1 + β2x2 + β3x3 + ε

• Residual sum of squares for Full Model: 4248.841

• Reduced Model when H0 is true: y = β0 + β3x3 + ε

11

• Residual sum of squares for Reduced Model: 7814.391

• Full Model has 2 more parameters than Reduced Model


Illustation:> regML = lm(satisfaction ~ age + severity + anxiety, sat)> RSSF1 = sum(regML$residuals^2)> RSSF1[1] 4248.841>> regMLR2 = lm(satisfaction ~ anxiety, sat)> RSSR2 = sum(regMLR2$residuals^2)> RSSR2[1] 7814.391>> FFR2 = ((RSSR2 - RSSF1)/2)/(RSSF1/42)> FFR2[1] 17.62282>> qf(0.05,df1=2,df2=42,ncp=0,lower.tail = F)[1] 3.219942

Inference on coefficients

• Full Model has p+ 1 parameters (1 intercept and p covariates)

• Reduced Model is obtained by assuming H0 is true, for which H0 says that q coefficients in FullModel are 0

• Full model has q more parameters than Reduced Model

Under Model Assumptions, the test statistic is

F = [SS(Reduced)− SS(Full)]/qSS(Full)/(n− (p+ 1))

Inference on coefficients

R commands syntax:

# Full ModelregF = lm(y ~ all p covariates, yourdata)

# Reduced ModelregR = lm(y ~ (p-q) covariates, yourdata)

# testing (put Reduced Model ahead of Full Model)anova(regR,regF)

12

Practice via software

Load data

Importing data:> sat = read.table("http://math.wsu.edu/faculty/xchen/stat412/data/CH06PR15.txt",header = F)> colnames(sat) = c("satisfaction","age","severity","anxiety")> head(sat)

satisfaction age severity anxiety1 48 50 51 2.32 57 36 46 2.33 66 40 48 2.24 70 41 44 1.85 89 28 43 1.86 36 49 54 2.9

Pairwise correlations

> pairs(sat)

13

satisfaction

25 35 45 55 1.8 2.2 2.6

3050

7090

2535

4555

age

severity

4550

5560

30 50 70 90

1.8

2.2

2.6

45 50 55 60

anxiety

Model

• y = satisfaction, x1 = age, x2 = serverity, x3 = anxiety

• Model: y = β0 + β1x1 + β2x2 + β3x3 + ε

Fit linear model

14

> regML = lm(satisfaction ~ age + severity + anxiety, sat)> regML$coefficients(Intercept) age severity anxiety158.4912517 -1.1416118 -0.4420043 -13.4701632

Test global null

> regML = lm(satisfaction ~ age + severity + anxiety, sat)> library(broom)>> # get value of F statistic and its p-value> glance(regML)$statistic[1] 30.05208> glance(regML)$p.value[1] 1.541973e-10

Test one coefficient

> regML = lm(satisfaction ~ age + severity + anxiety, sat)> library(broom)> tidy(regML)# A tibble: 4 x 5

term estimate std.error statistic p.value<chr> <dbl> <dbl> <dbl> <dbl>

1 (Intercept) 158. 18.1 8.74 5.26e-112 age -1.14 0.215 -5.31 3.81e- 63 severity -0.442 0.492 -0.898 3.74e- 14 anxiety -13.5 7.10 -1.90 6.47e- 2

Test a subset of coef

H0 : β1 = β2 = 0> # Full Model> regML = lm(satisfaction ~ age + severity + anxiety, sat)>> # Reduced Model> regMLR2 = lm(satisfaction ~ anxiety, sat)>> # testing> anova(regMLR2,regML)Analysis of Variance Table

Model 1: satisfaction ~ anxietyModel 2: satisfaction ~ age + severity + anxiety

Res.Df RSS Df Sum of Sq F Pr(>F)1 44 7814.42 42 4248.8 2 3565.6 17.623 2.773e-06 ***

15

---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Quick diagnostics

> regML=lm(satisfaction~age +severity+anxiety, sat)> par(mfrow=c(2,2))> plot(regML)

40 50 60 70 80 90

−20

−10

010

20

Fitted values

Res

idua

ls

Residuals vs Fitted

27

1117

−2 −1 0 1 2

−2

−1

01

2

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

27

1117

40 50 60 70 80 90

0.0

0.4

0.8

1.2

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−Location27

1117

0.00 0.05 0.10 0.15

−2

−1

01

2

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance

Residuals vs Leverage

17

3127

16

License and session Information

License> sessionInfo()R version 3.5.0 (2018-04-23)Platform: x86_64-w64-mingw32/x64 (64-bit)Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:[1] LC_COLLATE=English_United States.1252[2] LC_CTYPE=English_United States.1252[3] LC_MONETARY=English_United States.1252[4] LC_NUMERIC=C[5] LC_TIME=English_United States.1252

attached base packages:[1] stats graphics grDevices utils datasets methods[7] base

other attached packages:[1] broom_0.5.1 knitr_1.21

loaded via a namespace (and not attached):[1] Rcpp_1.0.0 rstudioapi_0.8 bindr_0.1.1[4] magrittr_1.5 tidyselect_0.2.5 lattice_0.20-35[7] R6_2.3.0 rlang_0.3.0.1 fansi_0.4.0

[10] stringr_1.3.1 highr_0.7 dplyr_0.7.8[13] tools_3.5.0 grid_3.5.0 nlme_3.1-137[16] xfun_0.4 utf8_1.1.4 cli_1.0.1[19] htmltools_0.3.6 yaml_2.2.0 digest_0.6.18[22] assertthat_0.2.0 tibble_1.4.2 crayon_1.3.4[25] bindrcpp_0.2.2 tidyr_0.8.2 purrr_0.2.5[28] glue_1.3.0 evaluate_0.12 rmarkdown_1.11[31] stringi_1.2.4 compiler_3.5.0 pillar_1.3.1[34] backports_1.1.3 generics_0.0.2 pkgconfig_2.0.2

17

http://math.wsu.edu/faculty/xchen/stat412/LICENSE.html

Documents

Stat 115 Lecture Notes 12 - math.wsu.edumath.wsu.edu/faculty/xchen/stat115/LectureNotes6/LectureNotes12_notes.pdfStat 115 Lecture Notes 12 Xiongzhi Chen Washington State University