Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Stat 115 Lecture Notes 12Xiongzhi Chen
Washington State University
Contents2
Multiple linear regression 2Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Intepretation of model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Estimate regression coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Fit multiple linear regression 3Data matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Correlation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Correlation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Correlation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Fit multiple regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Fit multiple regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Correlation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Fitted model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Inference on regression coefficients 8General principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Inference on global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Inference on global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Inference on global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Inference on global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Inference on global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Inference on global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Inference on one coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Inference on one coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Inference on one coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Inference on one coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Inference on one coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Inference on subset of coef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Inference on subset of coef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Inference on subset of coef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Inference on coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Inference on coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Practice via software 13Load data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1
Pairwise correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Fit linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Test global null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Test one coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Test a subset of coef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Quick diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16License and session Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Multiple linear regression
Motivation
• Using blood sugar, blood pressure, weight, triglycerides to predict cholesterol
• Using HW scores, participation scores, midterm scores to predict final exam scores
• Using precipitation, temperature, relative humidity to predict fruit growth
Simple linear regression
• Model: yi = β0 + β1xi + εi
• Model: y = β0 + β1x+ ε when εi are i.i.d.
• Model: E[y] = β0 + β1x
• β1: change in units in E(y) for unit change in x
In a model, the random error ε term represents all other variables not explicitly included in the model
Simple linear regression
• Not sufficient to caputre relationship between more than two variables
• Not sufficient to caputre nonlinear relationship between two variables
A fine decomposition of random error leads to a better but more complicated model
Multiple linear regression
Multiple linear regression with one response
• one response variable y; observations y1, y2, . . . , yn• p covariates x1, · · ·, xp• random error ε with zero mean; realizations ε1, ε2, . . . , εn
Say, p = 3 covariates x1, x2 and x3, then the model is:
yi = β0 + β1x1i + β2x2i + β3x3i + εi
2
Multiple linear regression
In the modely = β0 + β1x1 + β2x2 + β3x3 + ε
if you obsorb the termβ2x2 + β3x3 + ε
into a new random error δ, i.e., if you set
δ = β2x2 + β3x3 + ε,
you get Simple Linear Regression Model
y = β0 + β1x1 + δ
Intepretation of model
Take p = 3 for example. Consider the model:
y = β0 + β1x1 + β2x2 + β3x3 + ε
Is the model for E(y), i.e., for the expected value of y?
What does β1 mean?
Estimate regression coefficients
Still the Least-Squares Method is used, i.e., the fitted model minimizes the sum of squaresn∑i=1
[yi − (β0 + β1x1 + · · ·+ βpxp)]2
with βi, i = 1, · · · , p as parameters
Fit multiple linear regression
Data matrix
Importing data:
satisfaction age severity anxiety1 48 50 51 2.32 57 36 46 2.33 66 40 48 2.24 70 41 44 1.85 89 28 43 1.86 36 49 54 2.9
y = satisfaction, x1 = age, x2 = serverity, x3 = anxiety
3
Correlation plot
A correlation plot helps visually check
• if a covariate is correlated to response• if some covariates are correlated with each other• if a linear term for a covariate in the model is appropriate
Correlation plot
R command:> pairs(x, ...)
Correlation plot
> pairs(sat)
4
satisfaction
25 35 45 55 1.8 2.2 2.6
3040
5060
7080
90
2530
3540
4550
55
age
severity
4550
5560
30 50 70 90
1.8
2.0
2.2
2.4
2.6
2.8
45 50 55 60
anxiety
Fit multiple regression model
R command: lm(formula, data)
• “formula” is the expression for model
5
• it can be: y ~ x1 + x2 + x3• it can be: y ~ x1 + x2 + x3 + x1x2 + x3x3
Fit multiple regression model
y = satisfaction, x1 = age, x2 = serverity, x3 = anxiety
Model:y = β0 + β1x1 + β2x2 + β3x3 + ε,
where> regML = lm(satisfaction ~ age + severity + anxiety, sat)> regML$coefficients(Intercept) age severity anxiety158.4912517 -1.1416118 -0.4420043 -13.4701632
Correlation plot
Does the fit comply with scatter plot?> pairs(sat)
6
satisfaction
25 35 45 55 1.8 2.2 2.6
3050
7090
2535
4555
age
severity
4550
5560
30 50 70 90
1.8
2.2
2.6
45 50 55 60
anxiety
Fitted model
y = satisfaction, x1 = age, x2 = serverity, x3 = anxiety
Fitted model:y = 158.49− 1.14x1 − 0.44x2 − 13.47x3 + ε
• How to interpret the estimated regression coefficients?• Do you think the model caputures well the relationship between these variables?
7
Questions
Recall the following:
• y = satisfaction, x1 = age, x2 = serverity, x3 = anxiety• Fitted model:
y = 158.49− 1.14x1 − 0.44x2 − 13.47x3 + ε
Suppose you just fit the model
y = β0 + β1x1 + ε
• will the estimated β1 obtained from this model be different from that obtained from the previousmodel?
Questions
Fit model: y = β0 + β1x1 + ε
> regMG = lm(satisfaction ~ age, sat)> regMG$coefficients(Intercept) age119.943170 -1.520604
Fitted model: y = 119.94− 1.52x1 + ε
Questions
Recall: y = satisfaction, x1 = age, x2 = serverity, x3 = anxiety
Compare the two fitted models:
• y = 119.94− 1.52x1 + ε
•y = 158.49− 1.14x1 − 0.44x2 − 13.47x3 + ε
Do you know what caused this difference?
Inference on regression coefficients
General principle
Assumptions:
• random errors are independent• random errors are Normally distributed with zero mean and the same variance
Namely, random errors are independent and identically distributed (i.i.d) Normal. Then inference onregression coefficients can be based on Student’s t distributions or F distributions
8
Inference on global null
Recall:
• y = satisfaction, x1 = age, x2 = serverity, x3 = anxiety
• Model: y = β0 + β1x1 + β2x2 + β3x3 + ε
• Fitted model: y = 158.49− 1.14x1 − 0.44x2 − 13.47x3 + ε
Are all coefficients zero, i.e., H0 : β1 = β2 = β3 = 0?
Inference on global null
If β1 = β2 = β3 = 0, then the performance of the Reduced Model
y = β0 + ε
should be very close to that of the Full Model
y = β0 + β1x1 + β2x2 + β3x3 + ε
In other words, SS(Reduced) ∑(yi − y)2
(based on Reduced Model) should be close to SS(Full)∑(yi − yi)2
(based on Full Model)
Inference on global null
To assess H0 : β1 = β2 = β3 = 0 versus Ha, the test statistis is
F = [SS(Reduced)− SS(Full)]/3SS(Full)/(n− 4)
It has an F distribution with df1 = 3 and df2 = n− 4. Reject H0 if F > Fα,df1,df2
Inference on global null
Test H0 : β1 = β2 = β3 = 0 using the following:
• SS(Reduced) = 13369.3; SS(Full) = 4248.841• p = 3 and n = 46• F0.05,p,n−p−1 = 2.827
Inference on global null
Illustation:
9
> regML = lm(satisfaction ~ age + severity + anxiety, sat)> summary(regML)$fstatistic
value numdf dendf30.05208 3.00000 42.00000
Inference on global null
In general, to assess H0 : β1 = β2 = . . . = βp = 0 versus Ha, the test statistis is
F = [SS(Reduced)− SS(Full)]/pSS(Full)/(n− (p+ 1))
• SS(Reduced): residual sum of squares obtained by fitting the Reduced Model y = β0 + ε• SS(Full) is the residual sum of squares obtained by fitting the Full Model y = β0+β1x1+. . .+βpxp+ε
The test stat has an F distribution df1 = p and df2 = n− (p+ 1). Reject H0 if F > Fα,df1,df2
Inference on one coefficient
Recall the model
y = β0 + β1x1 + β2x2 + β3x3 + ε
Consider testing H0 : β1 = 0 vs Ha : β1 6= 0, which among t test and F test to use?
Inference on one coefficient
Test β1 = 0 or not using the following:
• β1 = −1.14 and sβ1= 0.215
• n = 46 and p = 3• sβ1
has degrees freedom n− (p+ 1)• t0.025,42 = 2.02
Inference on one coefficient
Illustation:> regML = lm(satisfaction ~ age + severity + anxiety, sat)> library(broom)> tidy(summary(regML))# A tibble: 4 x 5
term estimate std.error statistic p.value<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 158. 18.1 8.74 5.26e-112 age -1.14 0.215 -5.31 3.81e- 63 severity -0.442 0.492 -0.898 3.74e- 14 anxiety -13.5 7.10 -1.90 6.47e- 2
Value of test stat for if β1 = 0 is −5.314796
10
Inference on one coefficient
Test β1 = 0 or not using the following:
• Full Model: y = β0 + β1x1 + β2x2 + β3x3 + ε• Residual sum of squares for Full Model: 4248.841• Reduced Model when H0 is true: y = β0 + β2x2 + β3x3 + ε• Residual sum of squares for Reduced Model: 7106.394• Full Model has 1 more parameter than Reduced Model
F = [SS(Reduced)− SS(Full)]/1SS(Full)/(n− (p+ 1))
Inference on one coefficient
Illustation:> regML = lm(satisfaction ~ age + severity + anxiety, sat)> RSSF1 = sum(regML$residuals^2)>> regMLR = lm(satisfaction ~ severity + anxiety, sat)> RSSR1 = sum(regMLR$residuals^2)>> FFR = ((RSSR1 - RSSF1)/1)/(RSSF1/42)> FFR[1] 28.24706
value of test stat for if β1 = 0 is 28.24706
Note (−5.314796)2 = 28.24706
Inference on subset of coef
Say to assess H0 : β1 = β2 = 0 in model
y = β0 + β1x1 + β2x2 + β3x3 + ε
• Full Model: y = β0 + β1x1 + β2x2 + β3x3 + ε
• Reduced Model when H0 is true: y = β0 + β3x3 + ε
• Full Model has 2 more parameters than Reduced Model
F = [SS(Reduced)− SS(Full)]/2SS(Full)/(n− (p+ 1))
Inference on subset of coef
Test H0 : β1 = β2 = 0 or not using the following:
• Full Model: y = β0 + β1x1 + β2x2 + β3x3 + ε
• Residual sum of squares for Full Model: 4248.841
• Reduced Model when H0 is true: y = β0 + β3x3 + ε
11
• Residual sum of squares for Reduced Model: 7814.391
• Full Model has 2 more parameters than Reduced Model
Inference on subset of coef
Illustation:> regML = lm(satisfaction ~ age + severity + anxiety, sat)> RSSF1 = sum(regML$residuals^2)> RSSF1[1] 4248.841>> regMLR2 = lm(satisfaction ~ anxiety, sat)> RSSR2 = sum(regMLR2$residuals^2)> RSSR2[1] 7814.391>> FFR2 = ((RSSR2 - RSSF1)/2)/(RSSF1/42)> FFR2[1] 17.62282>> qf(0.05,df1=2,df2=42,ncp=0,lower.tail = F)[1] 3.219942
Inference on coefficients
• Full Model has p+ 1 parameters (1 intercept and p covariates)
• Reduced Model is obtained by assuming H0 is true, for which H0 says that q coefficients in FullModel are 0
• Full model has q more parameters than Reduced Model
Under Model Assumptions, the test statistic is
F = [SS(Reduced)− SS(Full)]/qSS(Full)/(n− (p+ 1))
Inference on coefficients
R commands syntax:
# Full ModelregF = lm(y ~ all p covariates, yourdata)
# Reduced ModelregR = lm(y ~ (p-q) covariates, yourdata)
# testing (put Reduced Model ahead of Full Model)anova(regR,regF)
12
Practice via software
Load data
Importing data:> sat = read.table("http://math.wsu.edu/faculty/xchen/stat412/data/CH06PR15.txt",header = F)> colnames(sat) = c("satisfaction","age","severity","anxiety")> head(sat)
satisfaction age severity anxiety1 48 50 51 2.32 57 36 46 2.33 66 40 48 2.24 70 41 44 1.85 89 28 43 1.86 36 49 54 2.9
Pairwise correlations
> pairs(sat)
13
satisfaction
25 35 45 55 1.8 2.2 2.6
3050
7090
2535
4555
age
severity
4550
5560
30 50 70 90
1.8
2.2
2.6
45 50 55 60
anxiety
Model
• y = satisfaction, x1 = age, x2 = serverity, x3 = anxiety
• Model: y = β0 + β1x1 + β2x2 + β3x3 + ε
Fit linear model
14
> regML = lm(satisfaction ~ age + severity + anxiety, sat)> regML$coefficients(Intercept) age severity anxiety158.4912517 -1.1416118 -0.4420043 -13.4701632
Test global null
> regML = lm(satisfaction ~ age + severity + anxiety, sat)> library(broom)>> # get value of F statistic and its p-value> glance(regML)$statistic[1] 30.05208> glance(regML)$p.value[1] 1.541973e-10
Test one coefficient
> regML = lm(satisfaction ~ age + severity + anxiety, sat)> library(broom)> tidy(regML)# A tibble: 4 x 5
term estimate std.error statistic p.value<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 158. 18.1 8.74 5.26e-112 age -1.14 0.215 -5.31 3.81e- 63 severity -0.442 0.492 -0.898 3.74e- 14 anxiety -13.5 7.10 -1.90 6.47e- 2
Test a subset of coef
H0 : β1 = β2 = 0> # Full Model> regML = lm(satisfaction ~ age + severity + anxiety, sat)>> # Reduced Model> regMLR2 = lm(satisfaction ~ anxiety, sat)>> # testing> anova(regMLR2,regML)Analysis of Variance Table
Model 1: satisfaction ~ anxietyModel 2: satisfaction ~ age + severity + anxiety
Res.Df RSS Df Sum of Sq F Pr(>F)1 44 7814.42 42 4248.8 2 3565.6 17.623 2.773e-06 ***
15
---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Quick diagnostics
> regML=lm(satisfaction~age +severity+anxiety, sat)> par(mfrow=c(2,2))> plot(regML)
40 50 60 70 80 90
−20
−10
010
20
Fitted values
Res
idua
ls
Residuals vs Fitted
27
1117
−2 −1 0 1 2
−2
−1
01
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
27
1117
40 50 60 70 80 90
0.0
0.4
0.8
1.2
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location27
1117
0.00 0.05 0.10 0.15
−2
−1
01
2
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
Residuals vs Leverage
17
3127
16
License and session Information
License> sessionInfo()R version 3.5.0 (2018-04-23)Platform: x86_64-w64-mingw32/x64 (64-bit)Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:[1] LC_COLLATE=English_United States.1252[2] LC_CTYPE=English_United States.1252[3] LC_MONETARY=English_United States.1252[4] LC_NUMERIC=C[5] LC_TIME=English_United States.1252
attached base packages:[1] stats graphics grDevices utils datasets methods[7] base
other attached packages:[1] broom_0.5.1 knitr_1.21
loaded via a namespace (and not attached):[1] Rcpp_1.0.0 rstudioapi_0.8 bindr_0.1.1[4] magrittr_1.5 tidyselect_0.2.5 lattice_0.20-35[7] R6_2.3.0 rlang_0.3.0.1 fansi_0.4.0
[10] stringr_1.3.1 highr_0.7 dplyr_0.7.8[13] tools_3.5.0 grid_3.5.0 nlme_3.1-137[16] xfun_0.4 utf8_1.1.4 cli_1.0.1[19] htmltools_0.3.6 yaml_2.2.0 digest_0.6.18[22] assertthat_0.2.0 tibble_1.4.2 crayon_1.3.4[25] bindrcpp_0.2.2 tidyr_0.8.2 purrr_0.2.5[28] glue_1.3.0 evaluate_0.12 rmarkdown_1.11[31] stringi_1.2.4 compiler_3.5.0 pillar_1.3.1[34] backports_1.1.3 generics_0.0.2 pkgconfig_2.0.2
17