Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Lectures 9: Simple Linear Regression
Lectures 9: Simple Linear Regression
Junshu Bao
University of Pittsburgh
1 / 32
Lectures 9: Simple Linear Regression
Table of contents
Introduction
Data Exploration
Simple Linear Regression
Utility Test
Assess Model Adequacy
Inference about Linear Regression
Model Diagnostics
Models without an Intercept
2 / 32
Lectures 9: Simple Linear Regression
Introduction
Simple Linear Regression
The simple linear regression model is
yi = β0 + β1xi + εi
where
I yi is the ith observation of the response variable y
I xi is the ith observation of the explanatory variable x
I εi is an error term and ε ∼ N(0, σ2)
I β0 is the intercept
I β1 is the slope of linear relationship between y and x
The �simple" here means the model has only a singleexplanatory variable.
3 / 32
Lectures 9: Simple Linear Regression
Introduction
Least-Squares Method
Choose β0 and β1 that minimize the sum of the squares of thevertical deviations.
Q(β0, β1) =
n∑i=1
[yi − (β0 + β1xi)]2.
Taking partial derivatives of Q(β0, β1) and solve a system ofequations yields the least squares estimators:
β0 = y − β1x
β1 =
∑ni=1(xi − x)(yi − y)∑n
i=1(xi − x)2=SxySxx
.
4 / 32
Lectures 9: Simple Linear Regression
Introduction
Predicted Values and Residuals
I Predicted (estimated) values of y from a model
yi = β0 + β1xi
I Residualsei = yi − yi
I Play an important role in diagnosing a �tted model.I Sometimes standardized before use (di�erent options from
di�erent statisticians)
5 / 32
Lectures 9: Simple Linear Regression
Introduction
Alcohol and Death Rate
A study in Osborne (1979):
I Independent variable: Average alcohol consumption in litersper person per year
I Dependent variable: Death rate per 100,000 people fromcirrhosis or alcoholism
I Data on 15 countries (No. of observations)
I Question: Can we predict the cirrhosis death rate fromalcohol consumption?
6 / 32
Lectures 9: Simple Linear Regression
Introduction
Analysis Plan
I Obtain basic descriptive statistics (mean and std)
I Scatterplot to visualize relationship
I Run simple linear regression
I Diagnostic checks
I Interpret model
7 / 32
Lectures 9: Simple Linear Regression
Introduction
Data Structure
Country Alcohol Consumption Cirrhosis and Alcoholism(l/Person/Year) (Death Rate/100,000)
France 24.7 46.1Italy 15.2 23.6West Germany 12.3 23.7Australia 10.9 7.0Belgium 10.8 12.3USA 9.9 14.2Canada 8.3 7.4...
......
Israel 3.1 5.4
8 / 32
Lectures 9: Simple Linear Regression
Introduction
Reading Data
data drinking;
input country $ 1-12 alcohol cirrhosis;
cards;
France 24.7 46.1
Italy 15.2 23.6
W.Germany 12.3 23.7
... ...
Israel 3.1 5.4
;
run;
I Some of the country names are longer than the default of eightfor character variables, so column input is used to read them in.
I The values of the two numeric variables can then be read in withlist input.
9 / 32
Lectures 9: Simple Linear Regression
Data Exploration
Descriptive Statistics, Correlation and Scatterplot
proc means data = drinking;
var alcohol cirrhosis;
run;
proc sgplot data=drinking;
scatter x=alcohol y=cirrhosis / datalabel=country;
run;
proc corr data=drinking;
run;
I There is a very strong, positive, linear relationship betweendeath rate and a country's average alcohol consumption.
I Notice that France is an outlier in both x and y directions.We will address this problem later.
I The correlation between the two variables is 0.9388 withp-value less than 0.0001.
I The datalabel option speci�es a variable whose values areto be used as labels on the plot.
10 / 32
Lectures 9: Simple Linear Regression
Simple Linear Regression
Fit Linear Regression Model
The linear regression model can be �tted using the followingSAS code.
proc reg data=drinking;
model cirrhosis=alcohol;
run;
The �tted linear regression equation is
Death rate = −6.00 + 1.98×Average Alcohol Consumption
11 / 32
Lectures 9: Simple Linear Regression
Utility Test
Partition of Sum of Squares and ANOVA Table
The analysis of variance identity of a linear regression model is:n∑
i=1
(yi − y)2︸ ︷︷ ︸SST
=
n∑i=1
(yi − y)2︸ ︷︷ ︸SSR
+
n∑i=1
(yi − yi)2︸ ︷︷ ︸SSE
The corresponding ANOVA table is:
Source DF SS MS F Value Pr > FModel 1 SSR MSR =SSR/1 F0=MSR/MSE p-valueError n-2 SSE MSE = SSE/(n-2)Total n-1 SST
where SS is sum of squares, MS is mean squares and
p-value = P (F1,n−2 > F0)
12 / 32
Lectures 9: Simple Linear Regression
Utility Test
Utility Test
For a simple linear regression (SLR) model, if the slope (β1) is zerothere is no linear relationship between x and y (the model is useless).
A utility test of SLR is as follows:
1. H0 : β1 = 0 versus Ha : β1 6= 0
2. Test statistic: F0 = MSR/MSE
3. p-value = P (F1,n−2 > F0)
4. Decision and Conclusion
The test results of the alcohol consumption example:
I Test statistic: F0 = 96.61
I p-value = P (F1,15−2 > 96.61) < 0.0001
I Reject the null.
13 / 32
Lectures 9: Simple Linear Regression
Assess Model Adequacy
Coe�cient of Determination
I The analysis of variance identityn∑
i=1
(yi − y)2︸ ︷︷ ︸SST
=
n∑i=1
(yi − y)2︸ ︷︷ ︸SSR
+
n∑i=1
(yi − yi)2︸ ︷︷ ︸SSE
I Coe�cient of Determination, denoted by R2:
R2 =SSR
SST= 1− SSE
SST
I R2 is often used to judge the adequacy of a regression model.
I We often refer loosely to R2 as the amount of variability in thedata explained or accounted for by the regression model.
I 0 ≤ R2 ≤ 1. It could be shown that R2 = r2, the square of thesample correlation coe�cient.
In the alcohol consumption example, R2 = 0.8814. 14 / 32
Lectures 9: Simple Linear Regression
Inference about Linear Regression
Estimation of Variances
I The variance σ2 is estimated as s2 given by
s2 =
∑ni=1(yi − yi)2
n− 2=SSE
n− 2≡MSE
I The estimated variance of β1 is
Var (β1) =s2∑n
i=1(xi − x)2
I The estimated variance of a predicted value y∗ at a givenvalue of x, say x∗, is
Var (y∗) = s2(
1 +1
n+
(x∗ − x)2∑ni=1(xi − x)2
)15 / 32
Lectures 9: Simple Linear Regression
Inference about Linear Regression
Estimation of Variances (example)
I The estimator of variance, s2, is MSE:
s2 = 17.39076 =⇒ s =√
17.39076 = 4.17022
In the SAS output, s is called �Root MSE�.
I The standard error of β1 is given in the table of parameterestimates:
se(β1) = 0.20123
16 / 32
Lectures 9: Simple Linear Regression
Inference about Linear Regression
Inference about the Slope (β1)
Under our model assumptions,
T =β1 − β1se(β1)
∼ tn−2
where se(β1) =√s2/∑n
i=1(xi − x)2 ≡√s2/Sxx.
I Suppose we want to test β1 equals to a certain value, β1,0
H0 : β1 = β1,0 versus Ha : β1 6= (< or >)β1,0
I The test statistic under the null is
t0 =β1 − β1,0√s2/Sxx
I Note that if the alternative is the �not equal� case andβ1,0 = 0 then it is the same as the utility test.
17 / 32
Lectures 9: Simple Linear Regression
Inference about Linear Regression
Hypothesis Test about β1
In the standard SAS output, the result of the test that isequivalent to the utility test is automatically included. Forexample, in the alcohol example,
I Hypotheses: H0 : β1 = 0 versus Ha : β1 6= 0I Test statistic:
t0 =β1√s2/Sxx
=1.97792
0.20123= 9.83
I p-value < 0.0001.I Reject the null.
With these standard output, it is straightforward to test othercases. For example, if the alternative is right-sided, Ha : β1 > 0,then the p-value is P (Tn−2 > t0), which is technically a half ofthe p-value of the two-sided test.
18 / 32
Lectures 9: Simple Linear Regression
Model Diagnostics
Model Diagnostics
Check model assumptions:
I Constant variance
I Normality of error terms
The residuals ei = yi − yi play an essential role in diagnosing a�tted model.
I Diagnosis plots:I Residuals versus �tted valuesI Residuals versus explanatory variablesI Normal probability plot of the residuals
I Residuals and Studentized residuals
I Cook's distance
I Leverage
19 / 32
Lectures 9: Simple Linear Regression
Model Diagnostics
Residuals and Studentized Residual
Residuals are used to identify outlying or extreme Yobservations.
I Residualsei = yi − yi
I Studentized residuals
ri =eis(ei)
=ei
s√
1− hii
where hii is the ith element in the diagonal of the hat
matrix.
While the residuals ei will have substantially di�erent samplingvariations if their standard deviations di�er markedly, thestudentized residuals ri have constant variance.
20 / 32
Lectures 9: Simple Linear Regression
Model Diagnostics
Deleted Residuals and Studentized Deleted Residual
If yi is far outlying, the regression line may be in�uenced tocome close to yi, yielding a �tted value yi near yi, so that theresidual ei will be small and will not disclose that yi is outlying.
A re�nement to make residuals more e�ective for detecting youtliers is to measure the ith residual when the �tted regressionis based on all the cases except the ith one.I Deleted residuals
di = yi − yi(i) =ei
1− hiiyi(i) is the �tted value of yi with the ith case omitted.
I Studentized deleted residuals
ti =dis(di)
=ei
s(i)√
1− hiiwhere s(i) is s with the ith case omitted. 21 / 32
Lectures 9: Simple Linear Regression
Model Diagnostics
Cook's Distance
This Cook's D statistic is a measure of the in�uence of anobservation on the regression line. The statistic is de�ned asfollows:
Dk =1
(p+ 1)s2
n∑i=1
[yi(k) − yi
]2where yi(k) is the �tted value of the ith observation when the kth
observation is omitted from the model. p is the number ofexplanatory variables.
I The values of Dk assess the impact of the kth observationon the estimated regression coe�cients.
I Values of Dk greater than 1 implies undue in�uence on themodel coe�cients.
22 / 32
Lectures 9: Simple Linear Regression
Model Diagnostics
Leverage
This is a measure of how unusual the x value of a point is,relative to the x observations as a whole.
I The leverage of observation i is hii, the ith diagonal element
of the hat matrix.
I A leverage value hii is usually considered to be large if it ismore than twice as large as the mean leverage value,(p+ 1)/n, i.e.
hii >2(p+ 1)
n
where p is the number of predictors in the regression model.I Another suggested guideline is that
I hii > 0.5 =⇒ high leverageI 0.2 ≤ hii ≤ 0.5 =⇒ moderate leverageI hii < 0.2 =⇒ low leverage
23 / 32
Lectures 9: Simple Linear Regression
Model Diagnostics
Diagnostic PlotsODS graphics are used to generate diagnostic plots:
ods graphics on;
proc reg data=drinking;
model cirrhosis=alcohol;
run;
ods graphics off;
I �Raw� residuals against predicted values
I Standardized/Studentized residuals against predicted values
I Studentized residuals against leverage
I Cook's D against observation No.
I Histogram of residuals
In both of the two residual plots, there is a large negative residual. In
the residual vs. leverage plot, there is a large leverage. The �rst
observation has a very large Cook's D.24 / 32
Lectures 9: Simple Linear Regression
Model Diagnostics
Identifying Outlying Observations and In�uential PointsTo identify the observations with large residuals or high leverage orlarge Cook's distance we need to save these quantities.
proc reg data=drinking;
model cirrhosis=alcohol;
output out = drinkingfit
predicted = yhat
student = student /*Studentized residuals*/
rstudent = rstudent /*Studentized deleted residuals*/
h = leverage
cookd = cooksd;
run;
proc print data = drinkingfit;
where abs(rstudent) > 2 or leverage >.3 or cooksd > 1;
run;
I In�uential point: France (Cook's D = 1.54)
I X-outlier and in�uential point: France (Leverage = 0.64)
I Y-outlier: Australia (Studentized deleted residual = -2.55)
25 / 32
Lectures 9: Simple Linear Regression
Model Diagnostics
Problem of Outlying Observations
We've found that France is an X-outlier and in�uential point.Observations that are outliers may distort the value of the anestimated regression coe�cient. Let us simply drop it and re�t themodel:
ods graphics on;
proc reg data=drinking;
model cirrhosis=alcohol;
where country ne 'France';
run;
ods graphics off;
I The residuals are more �normal� look.
I No in�uential points.
I Two marginally large residuals
26 / 32
Lectures 9: Simple Linear Regression
Models without an Intercept
Models without an Intercept
In some applications of simple linear regression, a model without anintercept is required. The model is of the following form:
yi = βxi + εi
In this case, application of the least squares method gives thefollowing estimator of β:
β =
∑ni=1 xiyi∑ni=1 x
2i
27 / 32
Lectures 9: Simple Linear Regression
Models without an Intercept
Example: Estimating the Age of the Universe
The dominant motion in the universe is the smooth expansion knownas Hubble's Law.
V = H0D
where
I V is Recessional Velocity, the observed velocity of the galaxyaway from us, usually in km/sec
I H0 is Hubble's �constant�, in km/sec/Mpc
I D is the distance to the galaxy in Mpc
Based on this law, we can easily estimate the age of the universe
age =1
H0
28 / 32
Lectures 9: Simple Linear Regression
Models without an Intercept
Example: Estimating the Age of the Universe (cont.)
Wood (2006) gives the relative velocity and the distances of 24galaxies, according to measurements made using the Hubble SpaceTelescope.
Observation Galaxy Velocity (km/sec) Distance (mega-parsec)1 NGC0300 133 2.002 NGC0925 664 9.163 NGC1326A 1794 16.14...
......
...24 NGC7331 999 14.72
where �mega-parsec� is Mpc and
1 Mpc = 3.09× 1019 km
29 / 32
Lectures 9: Simple Linear Regression
Models without an Intercept
Scatterplot
After reading in the data �le, we can create a scatterplot to visualizethe relationship between velocity and distance.
proc sgplot data=universe;
scatter y=velocity x=Distance;
yaxis label='velocity (kms)';
xaxis label='Distance (mega-parsec)';
run;
I The yaxis and xaxis statements are used to give the units ofmeasurement to the axis labels.
I The diagram shows a clear, strong relationship between velocityand distance.
I The scatterplot also shows that the intercept probably is zero.
30 / 32
Lectures 9: Simple Linear Regression
Models without an Intercept
Model Diagnostics
There are two large Studentized deleted residuals (absolute value > 2)and two large leverages (> 2/24 ≈ 0.08). We can identify theseobservations by saving these values.
proc reg data=universe;
model velocity= distance / noint;
output out=regout
predicted=pred rstudent=rstudent h=leverage;
run; quit;
proc print data=regout;
where abs(rstudent)>2;
run;
proc print data=regout;
where leverage>.08;
run;31 / 32
Lectures 9: Simple Linear Regression
Models without an Intercept
Age of the Universe
Now we can use the estimated value of β to �nd an approximate valuefor age of the universe.
Age =1
β km/sec/Mpc× 3.09× 1019 km
1 Mpc× 1 year
3.15× 107 sec
It is approximately 12.8 billion years.
32 / 32