Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression


Junshu Bao

University of Pittsburgh

1 / 32


Table of contents

Introduction

Data Exploration

Simple Linear Regression

Utility Test

Assess Model Adequacy

Inference about Linear Regression

Model Diagnostics

Models without an Intercept

2 / 32


Introduction


The simple linear regression model is

yi = β0 + β1xi + εi

where

I yi is the ith observation of the response variable y

I xi is the ith observation of the explanatory variable x

I εi is an error term and ε ∼ N(0, σ2)

I β0 is the intercept

I β1 is the slope of linear relationship between y and x

The �simple" here means the model has only a singleexplanatory variable.

3 / 32


Introduction

Least-Squares Method

Choose β0 and β1 that minimize the sum of the squares of thevertical deviations.

Q(β0, β1) =

n∑i=1

[yi − (β0 + β1xi)]2.

Taking partial derivatives of Q(β0, β1) and solve a system ofequations yields the least squares estimators:

β0 = y − β1x

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2=SxySxx

.

4 / 32


Introduction

Predicted Values and Residuals

I Predicted (estimated) values of y from a model

yi = β0 + β1xi

I Residualsei = yi − yi

I Play an important role in diagnosing a �tted model.I Sometimes standardized before use (di�erent options from

di�erent statisticians)

5 / 32


Introduction

Alcohol and Death Rate

A study in Osborne (1979):

I Independent variable: Average alcohol consumption in litersper person per year

I Dependent variable: Death rate per 100,000 people fromcirrhosis or alcoholism

I Data on 15 countries (No. of observations)

I Question: Can we predict the cirrhosis death rate fromalcohol consumption?

6 / 32


Introduction

Analysis Plan

I Obtain basic descriptive statistics (mean and std)

I Scatterplot to visualize relationship

I Run simple linear regression

I Diagnostic checks

I Interpret model

7 / 32


Introduction

Data Structure

Country Alcohol Consumption Cirrhosis and Alcoholism(l/Person/Year) (Death Rate/100,000)

France 24.7 46.1Italy 15.2 23.6West Germany 12.3 23.7Australia 10.9 7.0Belgium 10.8 12.3USA 9.9 14.2Canada 8.3 7.4...

......

Israel 3.1 5.4

8 / 32


Introduction

Reading Data

data drinking;

input country $ 1-12 alcohol cirrhosis;

cards;

France 24.7 46.1

Italy 15.2 23.6

W.Germany 12.3 23.7

... ...

Israel 3.1 5.4

;

run;

I Some of the country names are longer than the default of eightfor character variables, so column input is used to read them in.

I The values of the two numeric variables can then be read in withlist input.

9 / 32


Data Exploration

Descriptive Statistics, Correlation and Scatterplot

proc means data = drinking;

var alcohol cirrhosis;

run;

proc sgplot data=drinking;

scatter x=alcohol y=cirrhosis / datalabel=country;

run;

proc corr data=drinking;

run;

I There is a very strong, positive, linear relationship betweendeath rate and a country's average alcohol consumption.

I Notice that France is an outlier in both x and y directions.We will address this problem later.

I The correlation between the two variables is 0.9388 withp-value less than 0.0001.

I The datalabel option speci�es a variable whose values areto be used as labels on the plot.

10 / 32



Fit Linear Regression Model

The linear regression model can be �tted using the followingSAS code.

proc reg data=drinking;

model cirrhosis=alcohol;

run;

The �tted linear regression equation is

Death rate = −6.00 + 1.98×Average Alcohol Consumption

11 / 32


Utility Test

Partition of Sum of Squares and ANOVA Table

The analysis of variance identity of a linear regression model is:n∑

i=1

(yi − y)2︸︷︷︸SST

=

n∑i=1

(yi − y)2︸︷︷︸SSR

+

n∑i=1

(yi − yi)2︸︷︷︸SSE

The corresponding ANOVA table is:

Source DF SS MS F Value Pr > FModel 1 SSR MSR =SSR/1 F0=MSR/MSE p-valueError n-2 SSE MSE = SSE/(n-2)Total n-1 SST

where SS is sum of squares, MS is mean squares and

p-value = P (F1,n−2 > F0)

12 / 32


Utility Test

Utility Test

For a simple linear regression (SLR) model, if the slope (β1) is zerothere is no linear relationship between x and y (the model is useless).

A utility test of SLR is as follows:

1. H0 : β1 = 0 versus Ha : β1 6= 0

2. Test statistic: F0 = MSR/MSE

3. p-value = P (F1,n−2 > F0)

4. Decision and Conclusion

The test results of the alcohol consumption example:

I Test statistic: F0 = 96.61

I p-value = P (F1,15−2 > 96.61) < 0.0001

I Reject the null.

13 / 32


Assess Model Adequacy

Coe�cient of Determination

I The analysis of variance identityn∑

i=1

(yi − y)2︸︷︷︸SST

=

n∑i=1

(yi − y)2︸︷︷︸SSR

+

n∑i=1

(yi − yi)2︸︷︷︸SSE

I Coe�cient of Determination, denoted by R2:

R2 =SSR

SST= 1− SSE

SST

I R2 is often used to judge the adequacy of a regression model.

I We often refer loosely to R2 as the amount of variability in thedata explained or accounted for by the regression model.

I 0 ≤ R2 ≤ 1. It could be shown that R2 = r2, the square of thesample correlation coe�cient.

In the alcohol consumption example, R2 = 0.8814. 14 / 32



Estimation of Variances

I The variance σ2 is estimated as s2 given by

s2 =

∑ni=1(yi − yi)2

n− 2=SSE

n− 2≡MSE

I The estimated variance of β1 is

Var (β1) =s2∑n

i=1(xi − x)2

I The estimated variance of a predicted value y∗ at a givenvalue of x, say x∗, is

Var (y∗) = s2(

1 +1

n+

(x∗ − x)2∑ni=1(xi − x)2

)15 / 32



Estimation of Variances (example)

I The estimator of variance, s2, is MSE:

s2 = 17.39076 =⇒ s =√

17.39076 = 4.17022

In the SAS output, s is called �Root MSE�.

I The standard error of β1 is given in the table of parameterestimates:

se(β1) = 0.20123

16 / 32



Inference about the Slope (β1)

Under our model assumptions,

T =β1 − β1se(β1)

∼ tn−2

where se(β1) =√s2/∑n

i=1(xi − x)2 ≡√s2/Sxx.

I Suppose we want to test β1 equals to a certain value, β1,0

H0 : β1 = β1,0 versus Ha : β1 6= (< or >)β1,0

I The test statistic under the null is

t0 =β1 − β1,0√s2/Sxx

I Note that if the alternative is the �not equal� case andβ1,0 = 0 then it is the same as the utility test.

17 / 32



Hypothesis Test about β1

In the standard SAS output, the result of the test that isequivalent to the utility test is automatically included. Forexample, in the alcohol example,

I Hypotheses: H0 : β1 = 0 versus Ha : β1 6= 0I Test statistic:

t0 =β1√s2/Sxx

=1.97792

0.20123= 9.83

I p-value < 0.0001.I Reject the null.

With these standard output, it is straightforward to test othercases. For example, if the alternative is right-sided, Ha : β1 > 0,then the p-value is P (Tn−2 > t0), which is technically a half ofthe p-value of the two-sided test.

18 / 32


Model Diagnostics

Model Diagnostics

Check model assumptions:

I Constant variance

I Normality of error terms

The residuals ei = yi − yi play an essential role in diagnosing a�tted model.

I Diagnosis plots:I Residuals versus �tted valuesI Residuals versus explanatory variablesI Normal probability plot of the residuals

I Residuals and Studentized residuals

I Cook's distance

I Leverage

19 / 32


Model Diagnostics

Residuals and Studentized Residual

Residuals are used to identify outlying or extreme Yobservations.

I Residualsei = yi − yi

I Studentized residuals

ri =eis(ei)

=ei

s√

1− hii

where hii is the ith element in the diagonal of the hat

matrix.

While the residuals ei will have substantially di�erent samplingvariations if their standard deviations di�er markedly, thestudentized residuals ri have constant variance.

20 / 32


Model Diagnostics

Deleted Residuals and Studentized Deleted Residual

If yi is far outlying, the regression line may be in�uenced tocome close to yi, yielding a �tted value yi near yi, so that theresidual ei will be small and will not disclose that yi is outlying.

A re�nement to make residuals more e�ective for detecting youtliers is to measure the ith residual when the �tted regressionis based on all the cases except the ith one.I Deleted residuals

di = yi − yi(i) =ei

1− hiiyi(i) is the �tted value of yi with the ith case omitted.

I Studentized deleted residuals

ti =dis(di)

=ei

s(i)√

1− hiiwhere s(i) is s with the ith case omitted. 21 / 32


Model Diagnostics

Cook's Distance

This Cook's D statistic is a measure of the in�uence of anobservation on the regression line. The statistic is de�ned asfollows:

Dk =1

(p+ 1)s2

n∑i=1

[yi(k) − yi

]2where yi(k) is the �tted value of the ith observation when the kth

observation is omitted from the model. p is the number ofexplanatory variables.

I The values of Dk assess the impact of the kth observationon the estimated regression coe�cients.

I Values of Dk greater than 1 implies undue in�uence on themodel coe�cients.

22 / 32


Model Diagnostics

Leverage

This is a measure of how unusual the x value of a point is,relative to the x observations as a whole.

I The leverage of observation i is hii, the ith diagonal element

of the hat matrix.

I A leverage value hii is usually considered to be large if it ismore than twice as large as the mean leverage value,(p+ 1)/n, i.e.

hii >2(p+ 1)

n

where p is the number of predictors in the regression model.I Another suggested guideline is that

I hii > 0.5 =⇒ high leverageI 0.2 ≤ hii ≤ 0.5 =⇒ moderate leverageI hii < 0.2 =⇒ low leverage

23 / 32


Model Diagnostics

Diagnostic PlotsODS graphics are used to generate diagnostic plots:

ods graphics on;



run;

ods graphics off;

I �Raw� residuals against predicted values

I Standardized/Studentized residuals against predicted values

I Studentized residuals against leverage

I Cook's D against observation No.

I Histogram of residuals

In both of the two residual plots, there is a large negative residual. In

the residual vs. leverage plot, there is a large leverage. The �rst

observation has a very large Cook's D.24 / 32


Model Diagnostics

Identifying Outlying Observations and In�uential PointsTo identify the observations with large residuals or high leverage orlarge Cook's distance we need to save these quantities.



output out = drinkingfit

predicted = yhat

student = student /*Studentized residuals*/

rstudent = rstudent /*Studentized deleted residuals*/

h = leverage

cookd = cooksd;

run;

proc print data = drinkingfit;

where abs(rstudent) > 2 or leverage >.3 or cooksd > 1;

run;

I In�uential point: France (Cook's D = 1.54)

I X-outlier and in�uential point: France (Leverage = 0.64)

I Y-outlier: Australia (Studentized deleted residual = -2.55)

25 / 32


Model Diagnostics

Problem of Outlying Observations

We've found that France is an X-outlier and in�uential point.Observations that are outliers may distort the value of the anestimated regression coe�cient. Let us simply drop it and re�t themodel:

ods graphics on;



where country ne 'France';

run;

ods graphics off;

I The residuals are more �normal� look.

I No in�uential points.

I Two marginally large residuals

26 / 32




In some applications of simple linear regression, a model without anintercept is required. The model is of the following form:

yi = βxi + εi

In this case, application of the least squares method gives thefollowing estimator of β:

β =

∑ni=1 xiyi∑ni=1 x

2i

27 / 32



Example: Estimating the Age of the Universe

The dominant motion in the universe is the smooth expansion knownas Hubble's Law.

V = H0D

where

I V is Recessional Velocity, the observed velocity of the galaxyaway from us, usually in km/sec

I H0 is Hubble's �constant�, in km/sec/Mpc

I D is the distance to the galaxy in Mpc

Based on this law, we can easily estimate the age of the universe

age =1

H0

28 / 32



Example: Estimating the Age of the Universe (cont.)

Wood (2006) gives the relative velocity and the distances of 24galaxies, according to measurements made using the Hubble SpaceTelescope.

Observation Galaxy Velocity (km/sec) Distance (mega-parsec)1 NGC0300 133 2.002 NGC0925 664 9.163 NGC1326A 1794 16.14...

......

...24 NGC7331 999 14.72

where �mega-parsec� is Mpc and

1 Mpc = 3.09× 1019 km

29 / 32



Scatterplot

After reading in the data �le, we can create a scatterplot to visualizethe relationship between velocity and distance.

proc sgplot data=universe;

scatter y=velocity x=Distance;

yaxis label='velocity (kms)';

xaxis label='Distance (mega-parsec)';

run;

I The yaxis and xaxis statements are used to give the units ofmeasurement to the axis labels.

I The diagram shows a clear, strong relationship between velocityand distance.

I The scatterplot also shows that the intercept probably is zero.

30 / 32



Model Diagnostics

There are two large Studentized deleted residuals (absolute value > 2)and two large leverages (> 2/24 ≈ 0.08). We can identify theseobservations by saving these values.

proc reg data=universe;

model velocity= distance / noint;

output out=regout

predicted=pred rstudent=rstudent h=leverage;

run; quit;

proc print data=regout;

where abs(rstudent)>2;

run;

proc print data=regout;

where leverage>.08;

run;31 / 32



Age of the Universe

Now we can use the estimated value of β to �nd an approximate valuefor age of the universe.

Age =1

β km/sec/Mpc× 3.09× 1019 km

1 Mpc× 1 year

3.15× 107 sec

It is approximately 12.8 billion years.

32 / 32

Documents

Lectures 9: Simple Linear Regression