Upload
lysandra-bullock
View
28
Download
1
Embed Size (px)
DESCRIPTION
Experimental Statistics - week 12. Chapter 11: Linear Regression and Correlation Chapter 12: Multiple Regression. April 5 -- Lab. Analysis of Variance Approach. Mathematical Fact. SS(Total) = SS(Regression) + SS(Residuals). (S yy ). (SS “explained” by the model). - PowerPoint PPT Presentation
Citation preview
1
Experimental StatisticsExperimental Statistics - week 12 - week 12Experimental StatisticsExperimental Statistics - week 12 - week 12
Chapter 11:
Linear Regression and Correlation
Chapter 12:
Multiple Regression
2
April 5 -- Lab
3
Analysis of Variance Approach
2 2 2
1 1 1
( ) ( ) ( )ˆ ˆ n n n
i i i ii i i
y y y y y y
Mathematical Fact
SS(Total) = SS(Regression) + SS(Residuals)
p. 649
(SS “explained” by the model)
(SS “unexplained” by the model)
(Syy )
4
Plot of Production vs Cost
5
SS(???)
6
SS(???)
7
SS(???)
8
2R
2R measures the proportion of the variability in Y that is explained by the regression on X
2 (Regression)
(Total)
SSR
SS
9
12 8 8 7 12 4 15 11
1015 12 20 8 17 14 24
7 8 812 4 121115
Y X
15
5.3y
y
s
X
10
The GLM ProcedureDependent Variable: y Sum of Source DF Squares Model 1 19.575 Error 6 174.425 Corrected Total 7 194.000
The GLM ProcedureDependent Variable: y Sum of Source DF Squares Model =SS(reg) 1 170.492 Error =SS(Res) 6 23.508 Corrected Total 7 194.000 =SS(Total)2 170.492
194.0 .899
R
2 19.575
194.0 .101
R
11
RECALLTheoretical Model
Regression line
0 1y x
0 1ˆ ˆy x
2' (0, )where the s (errors) are distributed N 2- i.e. all the errors have the same variance
1
2 2 20 1
1 1 1
ˆ ˆ
ˆ ˆ( ) [ ( )]ˆn n n
i i i i ii i i
e y y y x
0 and are chosen to minimize
ˆi.e. i i ie y y residuals
12
Residual Analysis
Examination of residuals to help determine if: - assumptions are met - regression model is appropriate
Residual Plot: Plot of x vs residuals
13
14
15
Study Time Data
PROC GLM; MODEL score=time; OUTPUT out=new r=resid;RUN;
PROC GPLOT; TITLE 'Plot of Residuals'; PLOT resid*time;RUN;
16
Average Height of Girls by Age
17
Average Height of Girls by Age
18
Residual Plot
19
Residual Analysis
Examination of residuals to help determine if: - assumptions are met - regression model is appropriate
Residual Plot:
- plot of x vs residuals
Normality of Residuals: - probability plot - histogram
20
Residuals from Car Dataset fit using √hp
21
Residuals from Car Dataset fit using log(hp)
22
Y X4.3 4 5.5 56.8 68.0 74.0 45.2 56.6 6 7.5 72.0 4 4.0 5 5.7 6 6.5 7
Data – Page 572
Y = weight loss (wtloss)
X = exposure time (exptime)
Weight loss in a chemical compound as a function of how long it is exposed to air
23
PROC REG;MODEL wtloss=exptime/r cli clm;output out=new r=resid;RUN;
The REG Procedure
Dependent Variable: wtloss
Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 26.00417 26.00417 40.22 <.0001 Error 10 6.46500 0.64650 Corrected Total 11 32.46917
Root MSE 0.80405 R-Square 0.8009 Dependent Mean 5.50833 Adj R-Sq 0.7810 Coeff Var 14.59701
Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -1.73333 1.16518 -1.49 0.1677 exptime 1 1.31667 0.20761 6.34 <.0001
24
Plot of Residuals - MLR Model The REG Procedure Dependent Variable: wtloss Output Statistics
Dependent Predict Std ErrorObs Variable Value Mean Predict 95% CL Mean 95% CL Predict Residual1 4.3000 3.5333 0.3884 2.6679 4.3987 1.5437 5.5229 0.76672 5.5000 4.8500 0.2543 4.2835 5.4165 2.9710 6.7290 0.65003 6.8000 6.1667 0.2543 5.6001 6.7332 4.2877 8.0456 0.63334 8.0000 7.4833 0.3884 6.6179 8.3487 5.4937 9.4729 0.51675 4.0000 3.5333 0.3884 2.6679 4.3987 1.5437 5.5229 0.46676 5.2000 4.8500 0.2543 4.2835 5.4165 2.9710 6.7290 0.35007 6.6000 6.1667 0.2543 5.6001 6.7332 4.2877 8.0456 0.43338 7.5000 7.4833 0.3884 6.6179 8.3487 5.4937 9.4729 0.01679 2.0000 3.5333 0.3884 2.6679 4.3987 1.5437 5.5229 -1.533310 4.0000 4.8500 0.2543 4.2835 5.4165 2.9710 6.7290 -0.850011 5.7000 6.1667 0.2543 5.6001 6.7332 4.2877 8.0456 -0.466712 6.5000 7.4833 0.3884 6.6179 8.3487 5.4937 9.4729 -0.9833
25
26
The REG Procedure
Dependent Variable: wtloss
Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 26.00417 26.00417 40.22 <.0001 Error 10 6.46500 0.64650 Corrected Total 11 32.46917
Root MSE 0.80405 R-Square 0.8009 Dependent Mean 5.50833 Adj R-Sq 0.7810 Coeff Var 14.59701
Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -1.73333 1.16518 -1.49 0.1677 exptime 1 1.31667 0.20761 6.34 <.0001
???
For testing H0:
For testing H0:
27
Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 1 26.00417 26.00417 40.22 <.0001Error 10 6.46500 0.64650Corrected Total 11 32.46917
Recall: SS(Regression) = “Model SS”
SS(Residual) = “Error SS”
28
H0: there is no linear relationship between X and Y
H1: there is a linear relationship between X and Y
F MS(Regression) MS(Regression)MS(Residual) MSE
Reject H0 if F > F(1,n – 2)
where
29
H0: there is no linear relationship between weight loss and exposure timeH1: there is a linear relationship between weight loss and exposure time
30
Note: In simple linear regression
H0: there is no linear relationship between X and Y
H1: there is a linear relationship between X and Y
and
H0: 0
H1: ≠ 0
are equivalent and F t2
31
Multiple Regression Use of more than one independent variable to predict Y
0 1 1 ... k ky x x
1 2, ,..., px x x -- call these
2' (0, )- the s (errors) are distributed N 2- i.e. all the errors have the same variance
- errors are independent
Assumptions:
32
Data
1 11 12 1
2 21 22 2
1 2
k
k
n n n nk
y x x x
y x x x
y x x x
...
...
...
0 1 1 ...i i k ik iy x x
and so we have
ijx ith observation, jth independent variable
33
0 1 1ˆ ˆ ˆ...ˆ k ky x x
ˆi i ie y y i where is called the th residual
Goal:
1
2 20 1 1
1 1
ˆ ˆ ˆ
ˆ ˆ ˆ( ) [ ( ... )]ˆ
k
n n
i i i i k iki i
y y y x x
0, , ... , are chosen to minimize
Find “best” prediction equation of the form
As before:
34
Again: the solution involves calculus
-- solving the Normal Equations on page 627
35
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Model k SS(Reg.) MS(Reg.)=SS(Reg.)/k MS(Reg.)/MSE
Error n-k-1 SSE MSE=SSE/(n-k-1)
Corr. Total n-1 SS(Total)
36
H0: there is no linear relationship between Y and the independent variablesH1: there is a linear relationship between Y and the independent variables
F MS(Regression)MSE
Reject H0 if F > F(k, n k1)
where
Multiple Regression Setting
37
2R
2R measures the proportion of the variability in Y that is explained by the regression
2 (Regression)
(Total)
SSR
SS
- in MLR Setting has the same interpretation as before
38
Y X1 X2
4.3 4 .25.5 5 .26.8 6 .28.0 7 .24.0 4 .35.2 5 .36.6 6 .37.5 7 .32.0 4 .44.0 5 .45.7 6 .46.5 7 .4
Data – Page 628
Y = weight loss (wtloss)
X1 = exposure time (exptime)
X2 = relative humidity (humidity)
Weight loss in a chemical compound as a function of exposure time and humidity
39
The REG Procedure Dependent Variable: wtloss Number of Observations Read 12 Number of Observations Used 12
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > FModel 2 31.12417 15.56208 104.13 <.0001Error 9 1.34500 0.14944Corrected Total 11 32.46917
Root MSE 0.38658 R-Square 0.9586 Dependent Mean 5.50833 Adj R-Sq 0.9494 Coeff Var 7.01810
Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 0.66667 0.69423 0.96 0.3620 exptime 1 1.31667 0.09981 13.19 <.0001 humidity 1 -8.00000 1.36677 -5.85 0.0002
Chemical Weight Loss – MLR Output
40
H0: there is no linear relationship between weight loss and the variables exposure time and humidityH1: there is a linear relationship between weight loss and the variables exposure time and humidity
41
Examining Contributions of Individual X variables
Use t-test for the X variable in question.
- this tests the effect of that particular independent variable while all other independent variables stay constant.
Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 0.66667 0.69423 0.96 0.3620exptime 1 1.31667 0.09981 13.19 <.0001humidity 1 -8.00000 1.36677 -5.85 0.0002