59
Chapter 11. Linear Regression and Correlation

Chapter 11. Linear Regression and Correlation

  • Upload
    others

  • View
    18

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Chapter 11. Linear Regression and Correlation

Chapter 11. Linear Regression and Correlation

Page 2: Chapter 11. Linear Regression and Correlation

Prediction vs. Explanation

• Prediction– Reference to future values

• Explanation– Reference to current or past values

• For prediction (or explanation) to make much sense, there must be some connection between the variable we’re predicting (the dependent variable) and the variable we’re using to make the prediction (the independent variable).

Page 3: Chapter 11. Linear Regression and Correlation

Simple Regression (1)

• There is a single independent variable and the equation for predicting a dependent variable y is a linear function of a given independent variable x.

• For example, a prediction equation is a linear equation. The constant term, such as the 2.0, is the intercept term and is interpreted as the predicted value of y when x =0. The coefficient of x, such as 3.0, is the slope of the line, the predicted change in y when there is a one-unit change in x.

xy 0.30.2ˆ +=

Page 4: Chapter 11. Linear Regression and Correlation

Simple Regression (2)

• The formal assumptions of regression analysis include:– linearity

• The relation is, in fact, linear, so that the errors all have expected value zero: for all i.

– equal variance• The errors all have the same variance: for all i.

– independence• The errors are independent of each other.

– normality• The errors are all normally distributed; is normally distributed for all i.

0)( =iεE

2)( ii σεVar =

Page 5: Chapter 11. Linear Regression and Correlation

Scatterplot

• To check the assumptions of regression analysis, it is important to look at a scatterplot of the data. This is simply a plot of each (x, y) point, with the independent variable value on the horizontal axis, and the dependent variable value measured on the vertical axis.

Page 6: Chapter 11. Linear Regression and Correlation

Smoothers (1)

• Smoothers have been developed to sketch a curve through data without necessarily assuming any particular model.

• If such a smoother yields something close to a straight line, then linear regression is reasonable.

• One such method is called LOWESS (locally weighted scatterplotsmoother). Roughly, a smoother takes a relatively narrow “slice” of data along the x axis, calculates a line that fits the data in that slice, moves the slice slightly along the x axis, recalculates the line, and so on. Then all the little lines are connected in a smooth curve. The width of the slice is called the bandwidth; this may often be controlled in the computer program that does the smoothing.

Page 7: Chapter 11. Linear Regression and Correlation

Smoothers (2)

Page 8: Chapter 11. Linear Regression and Correlation

Smoothers (3)

• Another type of scatterplot smoother is the spline fit.– It can be understood as taking a narrow slice of data, fitting a curve (often a

cubic equation) to the slice, moving to the next slice, fitting another curve, and so on.

– The curves are calculated in such a way to form a connected , continuous curve.

• If the scatterplot does not appear linear, by itself or when fitted with a LOWESS curve, it can often be “straightened out” by a transformation of either the independent variable or the dependent variable.

• Several transformations of the independent variables can be tried to find a more linear scatterplot.

– Three common transformations are square root, natural logarithm, and inverse (one divided by the variable).

– Finding a good transformation often requires trial and error.

Page 9: Chapter 11. Linear Regression and Correlation

Smoothers (4)

Page 10: Chapter 11. Linear Regression and Correlation

Case Study: Comparison of Two Methods for Detecting E. coli

• The researcher wanted to evaluate the agreement of the results obtained using the HEC test with results obtained from an elaborate laboratory-based procedure, hydrophobic grid membrane filtration (HGMF). The HEC test is easier to inoculate, more compact to incubate, and safer to handle than conventional procedures.

Page 11: Chapter 11. Linear Regression and Correlation

Linear Regression --- Estimating Model Parameters

• The intercept β0 and β1 in the regression model are population quantities.

• We must estimate these values from sample data. The error variance is another population parameter that must be estimated.

• The first regression problem is to obtained estimates of the slope, intercept, and variance.

• The first step in examining the relation between y and x is to plot the data as a scatterplot.

εxββy ++= 10

2εσ

Page 12: Chapter 11. Linear Regression and Correlation

Least-squares Method (1)

• The regression analysis problem is to find the best straight-line prediction.

• The most common criterion for “best” is based on squared prediction error.

• We find the equation of the prediction line---that is, the slope and intercept that minimize the total squared prediction error.

• The method that accomplishes this goal is called the least-squares method because it chooses and to minimize the quantities.

0β1β

1β 0β

∑ ∑ +−=−i i

iiii xββyyy 210

2 )]ˆˆ([)ˆ(

Page 13: Chapter 11. Linear Regression and Correlation

Least-squares Method (2)

Page 14: Chapter 11. Linear Regression and Correlation

Least-squares Method (3)

• The least-squares estimates of slope and intercept are obtained as follows:

where

xx

xy

SS

β =1ˆ

xβyβ 10ˆˆ −=

∑−=

−−=

iixx

iiixy

xxS

yyxxS

2)(

))((

Page 15: Chapter 11. Linear Regression and Correlation

Least-squares Method (4)• The estimate of the regression slope can potentially be greatly affected

by high leverage point. These are points that have very high or very low values of the independent variable --- outliers in the x direction. They carry great weight in the estimate of the slope.

• A high leverage point that also happens to correspond to a y outlier is a high influence point. It will alter the slope and twist the line badly.

• A point has high influence if omitting it from the data will cause the regression line to change substantially.

• A high leverage point indicates only a potential distortion of the equation. Whether or not including the point will “twist” the equation depends on its influence (whether or not the point falls near the line through the remaining points). A point must have both high leverage and outlying y value to qualify as a high influence point.

Page 16: Chapter 11. Linear Regression and Correlation

Least-squares Method (5)

Page 17: Chapter 11. Linear Regression and Correlation

Least-squares Method (6)

• Most computer programs that perform regression analyses will calculate one or another of several diagnostics measures of leverage and influence.

• Very large values of any of these measures correspond to very high leverage or influence points.

• The distinction between high leverage (x outlier) and high influence (xoutlier and y outlier) points is not universally agreed upon yet.

Page 18: Chapter 11. Linear Regression and Correlation

Least-squares Method (7)

• The standard error of the slope is calculated by all statistical packages. Typically, it is shown in output in a column to the right of the coefficient column.

• Like any standard error, it indicates how accurately one can estimate the correct population or process value.

• The quality of estimation of is influenced by two quantities: the error variance and the amount of variation in the independent variable :

1β2εσ xxS

xx

εβ S

σσ =1

ˆ

Page 19: Chapter 11. Linear Regression and Correlation

Least-squares Method (8)

• The greater the variability of the y value for a given value of x, the larger is. Sensibly, if there is high variability around the regression line, it is difficult to estimate that line.

• Also, the smaller the variation in x values (as measured by Sxx), the larger is. The slope is the predicted change in y per unit change in x; if x changes very little in the data, so that Sxx is small, it is difficult to estimate the rate of change in y accurately.

• The standard error of the estimated intercept is influenced by n, naturally and also by the size of the square of the sample mean, , relative to Sxx.

εσ

1βσ

1βσ

0β2x

xxεβ S

xn

σσ2

ˆ1

0+=

Page 20: Chapter 11. Linear Regression and Correlation

Least-squares Method (9)

• The estimate of based on the sample data is the sum of squared residuals divided by n-2, the degrees of freedom. The estimated variance is often shown in computer output as MS(Error) or MS(Residual).

• Recall that MS stands for “mean square” and is always a sum of squares divided by the appropriate degrees of freedom:

2εσ

2)(

2

)ˆ( 2

2

−=

−=∑

nresidualSS

n

yyS i

ii

ε

Page 21: Chapter 11. Linear Regression and Correlation

Least-squares Method (10)

• The square root of Sε of the sample variance is called the sample standard deviation around the regression line, the standard error of estimate, or the residual standard deviation.

• Because Sε estimates σε, the standard deviation of yi, σε estimates the standard deviation of the population of y values associated with a given value of the independent variable x.

Page 22: Chapter 11. Linear Regression and Correlation

Example 11.4

Page 23: Chapter 11. Linear Regression and Correlation

Answers to Example 11.4 (1)

Page 24: Chapter 11. Linear Regression and Correlation

Answers to Example 11.4 (2)

Page 25: Chapter 11. Linear Regression and Correlation

Answers to Example 11.4 (3)

Page 26: Chapter 11. Linear Regression and Correlation

Inferences about Regression Parameters (1)

• The t distribution can be used to make significance tests and confidence intervals for the true slope and intercept.

• One natural null hypothesis is that the true slope β0 equals to 0. If this H0 is true, a change in x yields no predicted change in y, and it follows that x has no value in predicting y.

• The sample slope has the expected value β1 and standard error

xxεβ Sσσ 1

1ˆ =

Page 27: Chapter 11. Linear Regression and Correlation

Inferences about Regression Parameters (2)

• In practice, σε is not known and must be estimate by Sε, the residual standard deviation. In almost all regression analysis computer outputs, the estimated standard error is shown next to the coefficient. A test of this null hypothesis is given by the t statistic

• The most common use of this statistic is shown in the following summary:

xxε SSββ

βββt

/1

ˆ

)ˆerror( standard estimated

ˆ11

1

11

⋅−

=−

=

Page 28: Chapter 11. Linear Regression and Correlation

Inferences about Regression Parameters (3)

• It is also possible to calculate a confidence interval for the true slope. This is an excellent way to communicate the likely degree of inaccuracy in the estimate of that slope. The confidence interval once again is simply the estimate plus or minus a t table value times the standard error.

• The required degrees of freedom for the table value tα/2 is n-2, the error df.

xxεα

xxεα S

StββS

Stβ 1ˆ1ˆ21121 +≤≤−

Page 29: Chapter 11. Linear Regression and Correlation

Inferences about Regression Parameters (4)

• There is an alternative test, an F test, for the null hypothesis of no predictive value. It was designed to test the null hypothesis that all predictors have no values in predicting y. This test gives the same result as a two-sided t test of H0: β1 = 0 in simple linear regression; to say that all predictors have no value is to say that the (only) slope is 0. The F test is summarized below.

Page 30: Chapter 11. Linear Regression and Correlation

Inferences about Regression Parameters (5)

• The comparable hypothesis testing and confidence interval formulas for the intercept β0 using the estimate standard error of as is:

• In practice, this parameter is of less interest than the slope. In particular, there is often no reason to hypothesize that the true intercept is zero (or any other particular value). Computer packages almost always test the null hypothesis of zero slope, but some don’t bother with a test on the intercept term.

xxεβ S

xn

Sσ2

ˆ1

0+=

Page 31: Chapter 11. Linear Regression and Correlation

Predicting New y Values Using Regression (1)

• There are two possible interpretation of a y prediction based on a given x. Suppose that the highway directors substitutes x = 6 miles in the regression equation and gets = 20. This can be interpreted as either.

– The average cost E(y) of all resurfacing contracts for 6 miles of road will be $20,000; or

– The cost y of this specific resurfacing contract for 6 miles of road will be $20,000.

• The best-guess predicting in either case is 20, but the plus or minus factor differs. It is easier to predict an average value E(y) than an individual yvalue, so the plus or minus factor should be less for predicting an average.

xy 0.30.2ˆ += y

Page 32: Chapter 11. Linear Regression and Correlation

Predicting New y Values Using Regression (2)

• In the mean-value forecasting problem, the standard error of can be shown to be:

• Here Sxx is the sum of squared deviations of the original n values of xi; it can be calculated from most computer outputs as

1ˆ +ny

xx

nε S

xxn

σ2

1 )(1 −+ +

2

1)ˆerror( standard ⎟⎟⎠

⎞⎜⎜⎝

βSε

Page 33: Chapter 11. Linear Regression and Correlation

Predicting New y Values Using Regression (3)

• The forecasting plus or minus term in the confidence interval for E(yn+1) depends on the sample size n and the standard deviation around the regression line, as one might expect.

• It also depends on the squared distance of xn+1 from (the mean of the previous xi values) relative to Sxx.

• As xn+1 gets farther from , the term get larger.

• When xn+1 is far away from the other x values, so that this term is large, the prediction is a considerable extrapolation from the data.

• Small errors in estimating the regression line are magnified by the extrapolation. The term could be called an extrapolation penalty because it increases with the degree of extrapolation.

x

x xxn Sxx 21 )( −+

xxn Sxx 21 )( −+

Page 34: Chapter 11. Linear Regression and Correlation

Predicting New y Values Using Regression (4)

• Extrapolation---predicting the results at independent variable values far from the data---is often tempting and always dangerous. Using it requires an assumption that the relation will continue to be linear, far beyond the data.

• The extrapolation penalty term actually understates the risk of extrapolation. It is based on the assumption of a linear relation, and that assumption gets very shaky for large extrapolations.

• The confidence and prediction intervals also depend heavily on the assumption of constant variance.

Page 35: Chapter 11. Linear Regression and Correlation

Predicting New y Values Using Regression (5)

• Usually, the more relevant forecasting problem is that of predicting an individual yn+1 value rather than E(yn+1). In most computer packages, the interval for predicting an individual values is called a prediction interval. The same best guess is used , but the forecasting plus or minus term is larger when predicting yn+1 than E(yn+1).

• In fact, it can be shown that the plus or minus forecasting error using to predict yn+1 is as follows.

• Prediction interval for yn+1

1ˆ +ny

1ˆ +ny

df. 2-on with distributi theof right tail in the /2 area off cuts where

)(11ˆ)(11ˆ

2

21

211

21

21

ntαtS

xxn

StyyS

xxn

Sty

α

xx

nεαnn

xx

nεαn

−+++≤≤

−++− +

+++

+

Page 36: Chapter 11. Linear Regression and Correlation

Predicting New y Values Using Regression (6)• The only difference between prediction of a mean E(yn+1) and prediction of an

individual yn+1 is the term +1 in the standard error formula. The presence of this extra term indicates that predictions of individual values are less accurate than predictions of means.

• If n is large and the extrapolation term is small, the +1 term dominates the square root factor in the prediction interval.

Page 37: Chapter 11. Linear Regression and Correlation

Examining Lack of Fit in Linear Regression (1)

• We have been concerned with how well a linear regression model fits, but only from an intuitive.

• We could examine a scatterpolt of the data to see whether it looked linear and we could test whether the slope differed from 0; however, we had no way of testing to see whether a higher-order model would be a more appropriate model for the relationship between y and x.

• Pictures (or graphs) are always a good starting point for examining lack of fit.– First, use a scatterplot of y versus x.– Second, a plot of residuals versus predicted values may give an indication of the

following problems:• outliers or erroneous observations. In examining the residual plot, your eye will

naturally be drawn to data points with unusually high residuals.• Violation of the assumptions. For the model , we have assumed a linear

between y and dependent variable x, and independent, normally distributed errors with a constant variance.

εxββy ++= 10

ii yy ˆ− iy

εxββy ++= 10

Page 38: Chapter 11. Linear Regression and Correlation

Examining Lack of Fit in Linear Regression (2)

• The residual plot for a model and data set that has none of these apparent problems would look much like the plot below.

• When a higher-order model is more appropriate, a residual plot more like the following plot.

Page 39: Chapter 11. Linear Regression and Correlation

Examining Lack of Fit in Linear Regression (3)

• A check of the constant variance assumption can be addressed in the yversus x scatterplot or with a plot of the residuals versus xi.

• Homogeneous error variance across values of x

• The error variances increase with increasing values of x

)ˆ( ii yy −

Page 40: Chapter 11. Linear Regression and Correlation

Examining Lack of Fit in Linear Regression (4)

• When there is more than one observation per level of the independent variable, we can conduct a test for lack of fit of the fitted model by partitioning SS (residuals) into two parts, one pure experimental error and the other lack of fit.

• Let yij denote the response for the jth observation at the ith level of the independent variable. Then, if there are ni observations at the ith level of the independent variable, the quantity

provides a measure of what we will call pure experimental error. This sum of squares has ni-1 degrees of freedom.

• Similarly, for each of the other levels of x, we can compute a sum of squares due to pure experimental error. The pooled sum of squares

∑ −j

iij yy 2)(

∑ −=ij

iij yySSP 2exp )(

Page 41: Chapter 11. Linear Regression and Correlation

Examining Lack of Fit in Linear Regression (5)

called the sum of squares for pure experimental error, has degrees of freedom. With SSlack representing the remaining portion of SSE, we have

• If SS(residuals) is based on n-2 degrees of freedom in the linear regression model, then SSlack will have df = n-2- .

• Under the null hypothesis that our model is correct, we can form independent estimates of , the model error variance, by dividing SSPexp and SSlack by their respective degrees of freedom; these estimates are called mean squares and are denoted by MSPexp and MSlack, respectively.

∑ −i

in )1(

fit lack to todue error alexperiment pure todue )( exp lackSSSSPresidualsSS +=

∑ −i

in )1(

2εσ

Page 42: Chapter 11. Linear Regression and Correlation

Examining Lack of Fit in Linear Regression (6)

• The test for lack of fit is summarized here.

• Conclusion– If the F test is significant, this indicates that the linear regression model is

inadequate.– A nonsignificant result indicates that there is insufficient evidence to suggest that

the linear regression model is inappropriate.

Page 43: Chapter 11. Linear Regression and Correlation

The Inverse Regression Problem (Calibration) (1)

• In experimental situations, we are often interested in estimating the value of the independent variable corresponding to a measured value of the dependent variable.

• The most commonly used estimate is found by replacing by y and solving the least-squares equation for x:

• Two different inverse prediction problems will be discussed here.– The first for predicting x corresponding to an observed value of y; second is for

predicting x corresponding to the mean of m > 1 values of y that were obtained independent of the regression data.

yxββy 10

ˆˆˆ +=

1

0

ˆˆ

ˆββyx −

=

Page 44: Chapter 11. Linear Regression and Correlation

The Inverse Regression Problem (Calibration) (2)

• Predicting x based on an observed y-value

• The greater the strength of the linear relationship between x and y, the larger the quantity (1-c2), making the width of the predicting interval narrower.

Page 45: Chapter 11. Linear Regression and Correlation

The Inverse Regression Problem (Calibration) (3)

• Predicting x based on m y-values

Page 46: Chapter 11. Linear Regression and Correlation

Analyzing Data from the E. coli Concentrations Case Study (1)

Page 47: Chapter 11. Linear Regression and Correlation

Analyzing Data from the E. coliConcentrations Case Study (2)

Page 48: Chapter 11. Linear Regression and Correlation

Analyzing Data from the E. coliConcentrations Case Study (3)

Page 49: Chapter 11. Linear Regression and Correlation

Analyzing Data from the E. coliConcentrations Case Study (4)

• The width of the 95% prediction intervals was slightly less than one unit for most values of HEC. Thus, HEC determination in the field of E. coli concentrations in the -1 to 2 range would result in 95% prediction intervals for the corresponding HGMF determinations.

• This degree of accuracy would not be acceptable. One way to reduce the width of the intervals would be to conduct an expanded study involving considerably more observations than the 17 obtained in this study.

Page 50: Chapter 11. Linear Regression and Correlation

Correlation (1)

• Correlation coefficient– This proportionate reduction in error is closely related to the correlation

coefficient of x and y. A correlation measures the strength of the linear relation between x and y.

– The stronger the correlation, the better x predicts y.– Given n pairs of observations (xi, yi), we compute the sample correlation r as:

yyxx

xy

yyxx

iiyx SS

SSS

yyxxr =

−−= ∑ ))((

)total()( 2 SSyySi

iyy =−=∑

Page 51: Chapter 11. Linear Regression and Correlation

Correlation (2)

• Coefficient of determination– Correlation and regression predictability are closely related. The

proportionate reduction in error for regression we defined earlier is called the coefficient of determination.

– The coefficient of determination is simply the square of the correlation coefficient.

which is the proportionate reduction in error.

)total()residual()total(2

SSSSSSryx

−=

Page 52: Chapter 11. Linear Regression and Correlation

Correlation (3)

• Assumptions for correlation inference– The assumptions of regression analysis---linear relation between x and y

and constant variance around the regression line, in particular---are also assumed in correlation inference.

– In regression analysis, we regard the x values as predetermined constants.– In correlation analysis, we regard the x values as randomly selected (and

the regression inferences are conditional on the sampled x values).– If the xs are not drawn randomly, it is possible that the correlation

estimates are biased. In some texts, the additional assumption is made that the x values are drawn from a normal population. The inferences we make do not depend crucially on this normality assumption.

– The most basic inference problem is potential bias in estimation of .– The choice of x values can systematically increase or decrease the sample

correlation.– In general, a wide range of x values tends to increase the magnitude of the

correlation coefficient and a small range to decrease it.

yxρ

Page 53: Chapter 11. Linear Regression and Correlation

Correlation (4)

• Correlation coefficients can be affected by systematic choices of x values; the residual standard deviation is not affected systematically, although it may change randomly if part of the x range changes.

• Thus, it is a good idea to consider the residual standard deviation and the magnitude of the slope when you decide how well a linear regression line predicts y.

εS

Page 54: Chapter 11. Linear Regression and Correlation

Summary of a Statistical Test for yxρ

Page 55: Chapter 11. Linear Regression and Correlation

Example 11.16 (1)

Page 56: Chapter 11. Linear Regression and Correlation

Example 11.16 (2)

Page 57: Chapter 11. Linear Regression and Correlation

Example 11.16 (3)

Page 58: Chapter 11. Linear Regression and Correlation

Example 11.16 (4)

Page 59: Chapter 11. Linear Regression and Correlation

Example 11.16 (4)

• We would reject the null hypothesis at any reasonable α level, so the correlation is “statistically significant.” However, the test account for only 0.035 of the squared error in the dependent variable, so it is almost worthless as a predictor.

• Remember, the rejection of the null hypothesis in a statistical test is the conclusion that the sample results cannot plausibly have occurred by chance if the null hypothesis is true.

• The test itself does not address the practical significance of the result. Clearly, for a sample size of 40,000, even a trivial sample correlation like 0.035 is not likely to occur by mere luck of the draw. There is no practically meaningful relationship between the dependent and independent variables in this example.