32

Lectures 9: Simple Linear Regression

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Junshu Bao

University of Pittsburgh

1 / 32

Page 2: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Table of contents

Introduction

Data Exploration

Simple Linear Regression

Utility Test

Assess Model Adequacy

Inference about Linear Regression

Model Diagnostics

Models without an Intercept

2 / 32

Page 3: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Introduction

Simple Linear Regression

The simple linear regression model is

yi = β0 + β1xi + εi

where

I yi is the ith observation of the response variable y

I xi is the ith observation of the explanatory variable x

I εi is an error term and ε ∼ N(0, σ2)

I β0 is the intercept

I β1 is the slope of linear relationship between y and x

The �simple" here means the model has only a singleexplanatory variable.

3 / 32

Page 4: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Introduction

Least-Squares Method

Choose β0 and β1 that minimize the sum of the squares of thevertical deviations.

Q(β0, β1) =

n∑i=1

[yi − (β0 + β1xi)]2.

Taking partial derivatives of Q(β0, β1) and solve a system ofequations yields the least squares estimators:

β0 = y − β1x

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2=SxySxx

.

4 / 32

Page 5: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Introduction

Predicted Values and Residuals

I Predicted (estimated) values of y from a model

yi = β0 + β1xi

I Residualsei = yi − yi

I Play an important role in diagnosing a �tted model.I Sometimes standardized before use (di�erent options from

di�erent statisticians)

5 / 32

Page 6: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Introduction

Alcohol and Death Rate

A study in Osborne (1979):

I Independent variable: Average alcohol consumption in litersper person per year

I Dependent variable: Death rate per 100,000 people fromcirrhosis or alcoholism

I Data on 15 countries (No. of observations)

I Question: Can we predict the cirrhosis death rate fromalcohol consumption?

6 / 32

Page 7: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Introduction

Analysis Plan

I Obtain basic descriptive statistics (mean and std)

I Scatterplot to visualize relationship

I Run simple linear regression

I Diagnostic checks

I Interpret model

7 / 32

Page 8: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Introduction

Data Structure

Country Alcohol Consumption Cirrhosis and Alcoholism(l/Person/Year) (Death Rate/100,000)

France 24.7 46.1Italy 15.2 23.6West Germany 12.3 23.7Australia 10.9 7.0Belgium 10.8 12.3USA 9.9 14.2Canada 8.3 7.4...

......

Israel 3.1 5.4

8 / 32

Page 9: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Introduction

Reading Data

data drinking;

input country $ 1-12 alcohol cirrhosis;

cards;

France 24.7 46.1

Italy 15.2 23.6

W.Germany 12.3 23.7

... ...

Israel 3.1 5.4

;

run;

I Some of the country names are longer than the default of eightfor character variables, so column input is used to read them in.

I The values of the two numeric variables can then be read in withlist input.

9 / 32

Page 10: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Data Exploration

Descriptive Statistics, Correlation and Scatterplot

proc means data = drinking;

var alcohol cirrhosis;

run;

proc sgplot data=drinking;

scatter x=alcohol y=cirrhosis / datalabel=country;

run;

proc corr data=drinking;

run;

I There is a very strong, positive, linear relationship betweendeath rate and a country's average alcohol consumption.

I Notice that France is an outlier in both x and y directions.We will address this problem later.

I The correlation between the two variables is 0.9388 withp-value less than 0.0001.

I The datalabel option speci�es a variable whose values areto be used as labels on the plot.

10 / 32

Page 11: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Simple Linear Regression

Fit Linear Regression Model

The linear regression model can be �tted using the followingSAS code.

proc reg data=drinking;

model cirrhosis=alcohol;

run;

The �tted linear regression equation is

Death rate = −6.00 + 1.98×Average Alcohol Consumption

11 / 32

Page 12: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Utility Test

Partition of Sum of Squares and ANOVA Table

The analysis of variance identity of a linear regression model is:n∑

i=1

(yi − y)2︸ ︷︷ ︸SST

=

n∑i=1

(yi − y)2︸ ︷︷ ︸SSR

+

n∑i=1

(yi − yi)2︸ ︷︷ ︸SSE

The corresponding ANOVA table is:

Source DF SS MS F Value Pr > FModel 1 SSR MSR =SSR/1 F0=MSR/MSE p-valueError n-2 SSE MSE = SSE/(n-2)Total n-1 SST

where SS is sum of squares, MS is mean squares and

p-value = P (F1,n−2 > F0)

12 / 32

Page 13: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Utility Test

Utility Test

For a simple linear regression (SLR) model, if the slope (β1) is zerothere is no linear relationship between x and y (the model is useless).

A utility test of SLR is as follows:

1. H0 : β1 = 0 versus Ha : β1 6= 0

2. Test statistic: F0 = MSR/MSE

3. p-value = P (F1,n−2 > F0)

4. Decision and Conclusion

The test results of the alcohol consumption example:

I Test statistic: F0 = 96.61

I p-value = P (F1,15−2 > 96.61) < 0.0001

I Reject the null.

13 / 32

Page 14: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Assess Model Adequacy

Coe�cient of Determination

I The analysis of variance identityn∑

i=1

(yi − y)2︸ ︷︷ ︸SST

=

n∑i=1

(yi − y)2︸ ︷︷ ︸SSR

+

n∑i=1

(yi − yi)2︸ ︷︷ ︸SSE

I Coe�cient of Determination, denoted by R2:

R2 =SSR

SST= 1− SSE

SST

I R2 is often used to judge the adequacy of a regression model.

I We often refer loosely to R2 as the amount of variability in thedata explained or accounted for by the regression model.

I 0 ≤ R2 ≤ 1. It could be shown that R2 = r2, the square of thesample correlation coe�cient.

In the alcohol consumption example, R2 = 0.8814. 14 / 32

Page 15: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Inference about Linear Regression

Estimation of Variances

I The variance σ2 is estimated as s2 given by

s2 =

∑ni=1(yi − yi)2

n− 2=SSE

n− 2≡MSE

I The estimated variance of β1 is

Var (β1) =s2∑n

i=1(xi − x)2

I The estimated variance of a predicted value y∗ at a givenvalue of x, say x∗, is

Var (y∗) = s2(

1 +1

n+

(x∗ − x)2∑ni=1(xi − x)2

)15 / 32

Page 16: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Inference about Linear Regression

Estimation of Variances (example)

I The estimator of variance, s2, is MSE:

s2 = 17.39076 =⇒ s =√

17.39076 = 4.17022

In the SAS output, s is called �Root MSE�.

I The standard error of β1 is given in the table of parameterestimates:

se(β1) = 0.20123

16 / 32

Page 17: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Inference about Linear Regression

Inference about the Slope (β1)

Under our model assumptions,

T =β1 − β1se(β1)

∼ tn−2

where se(β1) =√s2/∑n

i=1(xi − x)2 ≡√s2/Sxx.

I Suppose we want to test β1 equals to a certain value, β1,0

H0 : β1 = β1,0 versus Ha : β1 6= (< or >)β1,0

I The test statistic under the null is

t0 =β1 − β1,0√s2/Sxx

I Note that if the alternative is the �not equal� case andβ1,0 = 0 then it is the same as the utility test.

17 / 32

Page 18: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Inference about Linear Regression

Hypothesis Test about β1

In the standard SAS output, the result of the test that isequivalent to the utility test is automatically included. Forexample, in the alcohol example,

I Hypotheses: H0 : β1 = 0 versus Ha : β1 6= 0I Test statistic:

t0 =β1√s2/Sxx

=1.97792

0.20123= 9.83

I p-value < 0.0001.I Reject the null.

With these standard output, it is straightforward to test othercases. For example, if the alternative is right-sided, Ha : β1 > 0,then the p-value is P (Tn−2 > t0), which is technically a half ofthe p-value of the two-sided test.

18 / 32

Page 19: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Model Diagnostics

Model Diagnostics

Check model assumptions:

I Constant variance

I Normality of error terms

The residuals ei = yi − yi play an essential role in diagnosing a�tted model.

I Diagnosis plots:I Residuals versus �tted valuesI Residuals versus explanatory variablesI Normal probability plot of the residuals

I Residuals and Studentized residuals

I Cook's distance

I Leverage

19 / 32

Page 20: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Model Diagnostics

Residuals and Studentized Residual

Residuals are used to identify outlying or extreme Yobservations.

I Residualsei = yi − yi

I Studentized residuals

ri =eis(ei)

=ei

s√

1− hii

where hii is the ith element in the diagonal of the hat

matrix.

While the residuals ei will have substantially di�erent samplingvariations if their standard deviations di�er markedly, thestudentized residuals ri have constant variance.

20 / 32

Page 21: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Model Diagnostics

Deleted Residuals and Studentized Deleted Residual

If yi is far outlying, the regression line may be in�uenced tocome close to yi, yielding a �tted value yi near yi, so that theresidual ei will be small and will not disclose that yi is outlying.

A re�nement to make residuals more e�ective for detecting youtliers is to measure the ith residual when the �tted regressionis based on all the cases except the ith one.I Deleted residuals

di = yi − yi(i) =ei

1− hiiyi(i) is the �tted value of yi with the ith case omitted.

I Studentized deleted residuals

ti =dis(di)

=ei

s(i)√

1− hiiwhere s(i) is s with the ith case omitted. 21 / 32

Page 22: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Model Diagnostics

Cook's Distance

This Cook's D statistic is a measure of the in�uence of anobservation on the regression line. The statistic is de�ned asfollows:

Dk =1

(p+ 1)s2

n∑i=1

[yi(k) − yi

]2where yi(k) is the �tted value of the ith observation when the kth

observation is omitted from the model. p is the number ofexplanatory variables.

I The values of Dk assess the impact of the kth observationon the estimated regression coe�cients.

I Values of Dk greater than 1 implies undue in�uence on themodel coe�cients.

22 / 32

Page 23: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Model Diagnostics

Leverage

This is a measure of how unusual the x value of a point is,relative to the x observations as a whole.

I The leverage of observation i is hii, the ith diagonal element

of the hat matrix.

I A leverage value hii is usually considered to be large if it ismore than twice as large as the mean leverage value,(p+ 1)/n, i.e.

hii >2(p+ 1)

n

where p is the number of predictors in the regression model.I Another suggested guideline is that

I hii > 0.5 =⇒ high leverageI 0.2 ≤ hii ≤ 0.5 =⇒ moderate leverageI hii < 0.2 =⇒ low leverage

23 / 32

Page 24: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Model Diagnostics

Diagnostic PlotsODS graphics are used to generate diagnostic plots:

ods graphics on;

proc reg data=drinking;

model cirrhosis=alcohol;

run;

ods graphics off;

I �Raw� residuals against predicted values

I Standardized/Studentized residuals against predicted values

I Studentized residuals against leverage

I Cook's D against observation No.

I Histogram of residuals

In both of the two residual plots, there is a large negative residual. In

the residual vs. leverage plot, there is a large leverage. The �rst

observation has a very large Cook's D.24 / 32

Page 25: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Model Diagnostics

Identifying Outlying Observations and In�uential PointsTo identify the observations with large residuals or high leverage orlarge Cook's distance we need to save these quantities.

proc reg data=drinking;

model cirrhosis=alcohol;

output out = drinkingfit

predicted = yhat

student = student /*Studentized residuals*/

rstudent = rstudent /*Studentized deleted residuals*/

h = leverage

cookd = cooksd;

run;

proc print data = drinkingfit;

where abs(rstudent) > 2 or leverage >.3 or cooksd > 1;

run;

I In�uential point: France (Cook's D = 1.54)

I X-outlier and in�uential point: France (Leverage = 0.64)

I Y-outlier: Australia (Studentized deleted residual = -2.55)

25 / 32

Page 26: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Model Diagnostics

Problem of Outlying Observations

We've found that France is an X-outlier and in�uential point.Observations that are outliers may distort the value of the anestimated regression coe�cient. Let us simply drop it and re�t themodel:

ods graphics on;

proc reg data=drinking;

model cirrhosis=alcohol;

where country ne 'France';

run;

ods graphics off;

I The residuals are more �normal� look.

I No in�uential points.

I Two marginally large residuals

26 / 32

Page 27: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Models without an Intercept

Models without an Intercept

In some applications of simple linear regression, a model without anintercept is required. The model is of the following form:

yi = βxi + εi

In this case, application of the least squares method gives thefollowing estimator of β:

β =

∑ni=1 xiyi∑ni=1 x

2i

27 / 32

Page 28: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Models without an Intercept

Example: Estimating the Age of the Universe

The dominant motion in the universe is the smooth expansion knownas Hubble's Law.

V = H0D

where

I V is Recessional Velocity, the observed velocity of the galaxyaway from us, usually in km/sec

I H0 is Hubble's �constant�, in km/sec/Mpc

I D is the distance to the galaxy in Mpc

Based on this law, we can easily estimate the age of the universe

age =1

H0

28 / 32

Page 29: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Models without an Intercept

Example: Estimating the Age of the Universe (cont.)

Wood (2006) gives the relative velocity and the distances of 24galaxies, according to measurements made using the Hubble SpaceTelescope.

Observation Galaxy Velocity (km/sec) Distance (mega-parsec)1 NGC0300 133 2.002 NGC0925 664 9.163 NGC1326A 1794 16.14...

......

...24 NGC7331 999 14.72

where �mega-parsec� is Mpc and

1 Mpc = 3.09× 1019 km

29 / 32

Page 30: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Models without an Intercept

Scatterplot

After reading in the data �le, we can create a scatterplot to visualizethe relationship between velocity and distance.

proc sgplot data=universe;

scatter y=velocity x=Distance;

yaxis label='velocity (kms)';

xaxis label='Distance (mega-parsec)';

run;

I The yaxis and xaxis statements are used to give the units ofmeasurement to the axis labels.

I The diagram shows a clear, strong relationship between velocityand distance.

I The scatterplot also shows that the intercept probably is zero.

30 / 32

Page 31: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Models without an Intercept

Model Diagnostics

There are two large Studentized deleted residuals (absolute value > 2)and two large leverages (> 2/24 ≈ 0.08). We can identify theseobservations by saving these values.

proc reg data=universe;

model velocity= distance / noint;

output out=regout

predicted=pred rstudent=rstudent h=leverage;

run; quit;

proc print data=regout;

where abs(rstudent)>2;

run;

proc print data=regout;

where leverage>.08;

run;31 / 32

Page 32: Lectures 9: Simple Linear Regression

Lectures 9: Simple Linear Regression

Models without an Intercept

Age of the Universe

Now we can use the estimated value of β to �nd an approximate valuefor age of the universe.

Age =1

β km/sec/Mpc× 3.09× 1019 km

1 Mpc× 1 year

3.15× 107 sec

It is approximately 12.8 billion years.

32 / 32