8
Chap. 7, page 1 Chapter 7 The Simple Linear Regression Model A common model for modeling the relationship between two quantitative variables is the linear regression model. Don’t be fooled by the “linear” part: as we’ll see, linear regression models can often be used to model relationships which aren’t linear. Although we looked at the linear regression model last semester, we only looked at one part of it – the part that models the mean response Y as a linear function of X. We’ll extend the model to model the scatter of the individual data points around the line. The way we extend it makes the linear regression model exactly like the ANOVA model, except that the explanatory variable is quantitative instead of categorical. We assume that at each X, the distribution of Y values is normal with mean X 1 0 β β + and standard deviation σ. ( ) X X Y 1 0 β β µ + = ( ) 2 σ σ = X Y Data: ) , ( , ), , ( ), , ( 2 2 1 1 n n Y X Y X Y X . The i Y ’s are assumed to be independent. Least squares estimates of 0 β and 1 β are denoted by 0 ˆ β and 1 ˆ β . The predicted or fitted value of Y for a particular X is: ( ) X X Y 1 0 ˆ ˆ ˆ β β µ + = . This is also denoted Y ˆ in many books. The fitted values for the data points are: i i i X fit Y 1 0 ˆ ˆ ˆ β β + = = and the residuals are: i i i i i Y Y fit Y res ˆ = = . The residuals are sometimes denoted i e in other texts. By modeling the distribution of data points around the line, we can make inferences from the sample data about the regression parameters.

Chapter 7

Embed Size (px)

Citation preview

Page 1: Chapter 7

Chap. 7, page 1

Chapter 7 The Simple Linear Regression Model A common model for modeling the relationship between two quantitative variables is the linear regression model. Don’t be fooled by the “linear” part: as we’ll see, linear regression models can often be used to model relationships which aren’t linear. Although we looked at the linear regression model last semester, we only looked at one part of it – the part that models the mean response Y as a linear function of X. We’ll extend the model to model the scatter of the individual data points around the line. The way we extend it makes the linear regression model exactly like the ANOVA model, except that the explanatory variable is quantitative instead of categorical. We assume that at each X, the distribution of Y values is normal with mean X10 ββ + and standard deviation σ.

( ) XXY 10 ββµ += ( ) 2σσ =XY

Data: ),(,),,(),,( 2211 nn YXYXYX … . The iY ’s are assumed to be independent. Least squares estimates of 0β and 1β are denoted by 0β and 1β . The predicted or fitted value of Y for a particular X is: ( ) XXY 10

ˆˆˆ ββµ += . This is also denoted Y in many books. The fitted values for the data points are: iii XfitY 10

ˆˆˆ ββ +== and the residuals are: iiiii YYfitYres ˆ−=−= . The residuals are sometimes denoted ie in other texts. By modeling the distribution of data points around the line, we can make inferences from the sample data about the regression parameters.

Page 2: Chapter 7

Chap. 7, page 2

Case Study 7.2: Meat Processing and pH

ANOVAb

3.00647 1 3.00647 444.306 .000a

.05413 8 .006773.06060 9

RegressionResidualTotal

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), Log(hours)a.

Dependent Variable: pHb.

Coefficientsa

6.9836 .0485 143.897 .000-.7257 .0344 -.991 -21.079 .000

(Constant)Log(hours)

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: pHa.

Hours pH Log(hours) fit res

1 7.02 0 6.9836 0.03641 6.93 0 6.9836 -0.05362 6.42 0.69 6.4806 -0.06062 6.51 0.69 6.4806 0.02944 6.07 1.39 5.9777 0.09234 5.99 1.39 5.9777 0.01236 5.59 1.79 5.6834 -0.09346 5.8 1.79 5.6834 0.11668 5.51 2.08 5.4747 0.03538 5.36 2.08 5.4747 -0.1147

Page 3: Chapter 7

Chap. 7, page 3

Another (equivalent) way to write the linear regression model is

iii XY εββ ++= 10 where the iε ’s are independent N(0,σ) random variables. Formulas for least squares estimators:

=

=

−−= n

ii

n

iii

XX

YYXX

1

2

11

)(

))((β , XY 10

ˆˆ ββ −=

Mean of residuals is 0 (always true for least squares)

Estimate of σ is 2freedom of degrees

residuals squared of sumˆ 1

2

−==∑=

n

resn

ii

σ .

Degrees of freedom = n - #parameters in the model for the means = n –2 for simple linear regression The ANOVA table gives the sum of squared residuals and the mean square residual which is

2σ = 0.00677 so σ = 0.0823. The standard errors of 0β and 1β represent the estimated standard deviations of the sampling

distributions of 0β and 1β . The sampling distributions refer to how the least squares estimates would vary from sample to sample. We view the iX ’s as fixed; they are viewed to remain the same from sample to sample while the iY ’s are random.

21 )1(1ˆ)ˆ(SE

Xsn −= σβ , 2

2

0 )1(1ˆ)ˆ(SE

XsnX

n −+= σβ

Confidence intervals for slope and intercept are Estimate ± e)SE(Estimat)2/1(df α−t

Page 4: Chapter 7

Chap. 7, page 4

Example: Steer carcass data Predicted pH = 6.9836 - .7257 Log(Hours) where Log is natural logarithm. Inferences for slope: Mean pH is estimated to decrease by .7257 for every one unit increase in Log(Hours). A one unit increase in Log(Hours) is an increase in Hours by a factor of e ≈ 2.72. If we had used Log10(Hours) instead, the interpretation would be easier: the slope represents the increase in predicted pH for every 10-fold increase in time since slaughter. A 95% confidence interval for 1β is -.7257 ± )0344(.)975(.8t = -.7257 ± 2.306 (.0344) = -.7257± .0793 = -.805 to -.646. So we are 95% confident that the decrease in mean pH is between .646 and .805 for every 2.72-fold increase in time since slaughter. The confidence interval can also be obtained from SPSS by choosing Options in the Analyze…Regression…Linear window.

Coefficientsa

6.984 .049 143.897 .000 6.872 7.096-.726 .034 -.991 -21.079 .000 -.805 -.646

(Constant)Log(hours)

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Lower Bound Upper Bound95% Confidence Interval for B

Dependent Variable: pHa.

Inferences for intercept: The intercept 0β represents the mean value of Y when X = 0. Usually, this is not particularly meaningful. It is usually more meaningful to estimate the mean value of Y at particular values of X which are meaningful and interesting, which is covered next. Inferences for the mean response at a particular value of X: Inferences about the slope of the regression line tell us about how big the change is in the mean response (Y) for a 1-unit increase in X. Sometimes, we are interested in a confidence interval for the mean response at a particular X, say 0X . According to the model, the true mean of Y at 0X

is ( ) 0100 XXY ββµ += . The estimate of this is ( ) 0100ˆˆˆ XXY ββµ += . The standard error of

( )0ˆ XYµ is

( )[ ] 2

20

0 )1()(1ˆˆˆSEXsn

XXn

XY−

−+= σµ

Note that the standard error is bigger for values of 0X further from X and is smallest at X .

Page 5: Chapter 7

Chap. 7, page 5

Steer data: What is the estimated mean pH for carcasses 3 hours old? Give a 95% confidence interval for the mean pH after 3 hours. First, remember that the X variable in the regression model is log(Hours), so 0X = log(3) = 1.0986 (natural logarithm). Therefore, ( )0986.1ˆ

0 =XYµ = 6.9836 - .7257(1.0986) = 6.186. To calculate the standard error, we need to compute X , the mean of the log(Hours) for the 10 data points and 2

Xs , the sample variance of log(Hours). From SPSS,

Descriptive Statistics

10 1.19013 .796480 .6343810

LogTimeValid N (listwise)

N Mean Std. Deviation Variance

Hence, X = 1.1901 and 2)1( Xsn − = 9(.63438) = 5.709. Therefore,

( )[ ]709.5

)1901.10986.1(1010823.00986.1ˆˆSE

2

0−

+==XYµ = 0.0262

and a 95% confidence interval for the mean pH among all steers after 3 hours is 6.186 ± )0262(.)975(.8t = 6.186 ± 2.306(.0262) = 6.186 ± .0604 ≈ 6.13 to 6.25 Simultaneous confidence intervals for the mean response at several values of X If we want simultaneous confidence intervals at several different values of X, we can use Bonferroni if the number of values is small. We can compute simultaneous confidence intervals at every possible value of X using a Scheffe procedure. The result is a set of confidence bands for the regression line. We are 95% (or whatever the chosen confidence level) that the regression line lies entirely within the bands. Thus, we are 95% confident that the true means at all possible values of X are all within the confidence band limits. The formula for the simultaneous confidence bands is

X10ˆˆ ββ + ± )1(2 2,2 α−−nF ( )[ ]XYµSE

This is referred to as the Workman-Hotelling procedure. In practice, you compute these limits at a large number of X values, then join the limits to make a smooth curve on the scatterplot. Some programs will do this automatically, but SPSS will not. It will, however, plot the individual confidence intervals for all X’s using the t coefficient rather than the Scheffe coefficient. Steer data: for simultaneous 95% confidence intervals, )95(.)1(. 8,22,2 FF n =−− α = 4.46. The confidence interval for the mean pH after 3 hours is therefore (see above): 6.186 ± )0262(.)46.4(2 = 6.186 ± 2.987(.0262) = 6.186 ± .0782 = 6.11 to 6.26 We could compute confidence intervals for any number of values of X.

Page 6: Chapter 7

Chap. 7, page 6

Prediction interval for a future response The confidence intervals above is for the mean pH for all steer 3 hours after slaughter. A 95% prediction interval for the pH of an individual steer 3 hours after slaughter is an interval in which you are 95% confident that the pH of a particular steer will lie 3 hours after slaughter. A confidence interval is for a mean; a prediction interval is for an individual. The predicted value for a future response at 0XX = is

( ) ( ) 01000ˆˆˆPred XXYXY ββµ +==

The standard error of prediction is

[ ] ( )[ ] 2

202

02

0 )1()(11ˆˆSEˆ)XPred(YSEXsn

XXn

XY−−

++=+= σµσ

The standard error of prediction has two parts: the uncertainty due to estimating the mean response at 0X and the uncertainty due to the fact that individual observations vary around that mean with standard deviation σ. Note that while the standard error of the mean response at 0X goes to 0 as n increases, the standard error of prediction never goes to 0. An individual 100(1-α)% prediction interval for the response of an individual at 0X is

( )[ ]02010 edPrSE)2/1(ˆˆ XYtX n αββ −±+ − For the steer data, a 95% prediction interval for the pH of a particular steer 3 hours after slaughter is:

6.186 ± 2.306 (.0823)709.5

)1901.10986.1(1011

2−++ = 6.186 ± 2.306(.08637) = 6.186 ± .1992 =

5.99 to 6.39. Simultaneous prediction intervals can be computed for several different X values using Bonferroni, but there is no analog to the Working-Hotelling Scheffe-based procedure for simultaneous prediction intervals at all possible values of X.

Page 7: Chapter 7

Chap. 7, page 7

SPSS commands Analyze…Regression…Linear Under Statistics button, you can choose to get confidence intervals for 0β and 1β . Under Save button: • Unstandardized Predicted Values • Unstandardized Residuals • Prediction Intervals: Mean: this isn’t a prediction interval, it’s an individual confidence

interval for the mean response at each X. SPSS does not compute the Working-Hotelling simultaneous confidence intervals

• Prediction Intervals: Individual: this is a prediction interval for an individual response at each X

To obtain predicted values, confidence intervals and prediction intervals for a value of X not in the data set, add a case to the data with the desired X value, but leave the value of Y blank (it should display a period which indicates a missing value). SPSS can plot the individual confidence intervals for mean response and the prediction intervals for an individual response. Create a scatterplot and double-click the plot to get into Chart Editor. Select one of the data points and click on the “Add fit line” icon. Under the “Fit line” tab you can select “Mean” or “Individual” confidence intervals. The first gives individual (not simultaneous) confidence intervals for the mean response at each X and the second gives prediction intervals.

Page 8: Chapter 7

Chap. 7, page 8

95% individual confidence intervals for the mean, 95% Working-Hotelling simultaneous confidence bands for the mean, and 95% individual prediction intervals for a single response (this graph is from S-Plus,; SPSS will only do the first and last of the three).

0.95 bands

x

y

0.0 0.5 1.0 1.5 2.0

5.5

6.0

6.5

7.0