Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression

Stat 112: Notes 2

• Today’s class: Section 3.3.– Full description of simple linear regression

model. – Checking the assumptions of the simple linear

regression model.– Inferences for simple linear regression model.

Wages and Education• A random sample of 100 men (ages 18-70) was

surveyed about their weekly wages in 1988 and their education (part of the 1988 March U.S. Current Population Survey) (in file wagedatasubset.JMP)

• How much more on average do men with one extra year of education make?

• If a man has a high school diploma but no further education, what’s the best prediction of his earnings?

• Regression addresses these two questions X=Education, Y= Weekly Wage

Bivariate Fit of wage By educ

0

500

1000

1500

2000

2500

wag

e

5 10 15 20

educ

Simple Linear Regression ModelBivariate Fit of wage By educ

0

500

1000

1500

2000

2500

wag

e

5 10 15 20

educ

Linear Fit

Linear Fit wage = -89.74965 + 51.225264*educ Summary of Fit RSquare 0.139941 RSquare Adj 0.131165 Root Mean Square Error 331.48

The mean of weekly wages is estimated to increase 1 51.23b dollars for each

extra year of education. The average absolute error from using

a man’s education to predict his wages is about RMSE=331.48 dollars

Sample vs. Population

• We can view the data – -- as a sample from a population.

• Our goal is to learn about the relationship between X and Y in the population: – We don’t care about the particular 100 men sampled

but about the population of US men ages 18-70. – From Notes 1, we don’t care about the relationship

between tracks counted and the density of deer for the particular sample, but the relationship among the population of all tracks; this enables to predict in the future the density of deer from the number of tracks counted.

1 1( , ), , ( , )n nX Y X Y

Simple Linear Regression ModelThe simple linear regression model:

0 1 0 1( | ) , i i i i i iE Y X X X Y X e , 2~ (0, )i ee N

The ie are called disturbances and represent the deviation

of iY from its mean given iX . The disturbances are

estimated by the residuals 0 1ˆ ( )i i ie Y b b X .

Assumptions of the Simple Linear Regression Model

For each value of the explanatory variable X=x, there is a subpopulation of outcomes (responses) Y for units with X=x. Assumptions of the simple linear regression model: 1. Linearity: The means of the subpopulations fall on a straight line function of the explanatory variable. 2. Constant variance: The subpopulation standard deviations are all equal (to ). 3. Normality: The subpopulations are all normally distributed. 4. Independence: The selection of an outcome from any of the subpopulations is independent of the selection of any other outcomes.

Checking the AssumptionsSimple Linear Regression Model for Population:

0 1i i iY x e . Before making any inferences using the simple linear regression model, we need to check the assumptions:

Based on the data 1 1 ,( , ), , ( )n nX Y X Y ,

1. We estimate 0 and 1 by the least squares estimates

0b and 1b .

2. We estimate the disturbances ie by the residuals

0 1ˆˆ ( | ) ( )i i i i ie Y E Y X Y b b X .

3. We check if the residuals approximate satisfy

(1) Linearity: ˆ( ) 0iE e for all range of iX .

(2) Constant Variance: ˆ( )iVar e constant for all

range of iX .

(3) Normality: ie are approximately normally distributed.

(4) Independence : ie are independent (only worry about for time series data).

Residual Plot A useful tool for checking the assumptions is the residual plot. Residual for observation i

0 1ˆˆ ( | ) ( )i i i i i ie y E y x y b b x .

The residual plot is a plot of the residuals ie versus ix . It is constructed in JMP by after fitting the least squares line, clicking the red triangle next to Linear Fit and clicking Plot Residuals.

-500

0

500

1000

1500

Res

idua

l

5 10 15 20

educ

Checking Linearity Assumption

To check if the linearity assumption holds (i.e., the model

for the mean is correct), check if ˆ( )iE e is zero for each

range of iX .

-500

0

500

1000

1500

Res

idua

l

5 10 15 20

educ

Linearity Assumption appears reasonable but it appears that very high education individuals and low education individuals earn more than expected (most residuals are positive) [we will consider a nonlinear model for this data in Chapter 5, for now we’ll assume linearity is okay).

Violation of LinearityFor a sample of McDonald’s restaurants Y=Revenue of Restaurant X=Mean Age of Children in Neighborhood of Restaurant Bivariate Fit of Revenue By Age

800

900

1000

1100

1200

1300

Rev

enue

2.5 5.0 7.5 10.0 12.5 15.0

Age

-200

-100

0

100

200

300

Res

idua

l2.5 5.0 7.5 10.0 12.5 15.0

Age

The mean of the residuals is negative for small and large ages and positive for large ages – linearity appears to be violated (we will see what to do when linearity is violated in Chapter 5).

Checking Constant VarianceTo check that the constant variance assumption holds, check that there is no pattern in the spread of the residuals as X varies.

-500

0

500

1000

1500

Res

idua

l

5 10 15 20

educ Constant variance appears reasonable.

Checking NormalityFor checking normality, we can look at whether the overall distribution of the residuals looks approximately normal by making a histogram of the residuals. Save the residuals by clicking the red triangle next to Linear Fit after Fit Line and then clicking Save Residuals. Then click Analyze, Distribution and put the saved residuals column into Y, Columns. The histogram should be approximately bell shaped if the normality assumption holds. Distributions Residuals wage

-500 0 500 1000 1500

The residuals from the wage data have approximately a bell shaped histogram although there is some indication of skewness to the right. The normality assumption seems roughly reasonable. We will look at more formal tools for assessing normality in Chapter 6.

Checking Assumptions

• It is important to check the assumptions of a regression model because the inferences depend on the assumptions approximately holding. The assumptions don’t need to hold exactly but only approximately.

• We will study more about checking assumptions and how to deal with violations of the assumptions in Chapters 5 and 6.

InferencesSimple Linear Regression Model for Population:

0 1 0 1( | ) , i i i i i iE Y X X X Y X e , 2~ (0, )i ee N

Data: 1 1( , ), , ( , )n nX Y X Y .

The least squares estimates 0b and 1b will typically not be

exactly equal to the true 0 and 1 .

Inferences: Draw conclusions about 0 and 1 based on the

data 1 1( , ), , ( , )n nX Y X Y .

1. Point Estimates: Best estimates of 0 and 1 . 2. Confidence intervals: Ranges of plausible values of

0 and 1 . 3. Hypothesis tests: Test whether it is plausible that

0 and 1 equal certain values.

Sampling Distribution of b0,b1

• The sampling distribution of describes the probability distribution of the estimates over repeated samples from the simple linear regression model.

• Understanding the sampling distribution is the key to drawing inferences from the sample to the population.

),(,),,( 11 nn yxyx

10 ,bb

Sampling distribution in wage data• To see how the least squares estimates can differ over

different samples from the population, we consider the “population” to be all 25,632 men surveyed in the March 1988 Current Population Survey in wagedata1988.JMP and the sample to be random samples of size 100 like the one in wagedatasubset.JMP.

“Population”: Bivariate Fit of wage By educ

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

wag

e

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18

educ

Linear Fit wage = -19.06983 + 50.414381*educ

0

1

19.07

50.41

Samples of wage data• To take samples in JMP, click the Tables menu, then click Subset

and then click the circle next to Random Sample Size and set the sample size. JMP will create a new data subset which is a random

sample of the original data set. Sample 1: Bivariate Fit of wage By educ

0

500

1000

1500

2000

2500

wag

e

2 4 6 8 10 12 14 16 18 20

educ

Linear Fit wage = -288.6577 + 71.530586*educ

0

1

288.66

71.53

b

b

Sample 2: Bivariate Fit of wage By educ

0

500

1000

1500

2000

2500

3000

wag

e0 5 10 15 20

educ

Linear Fit wage = 188.82961 + 38.453459*educ

0

1

188.83

38.45

b

b

Sampling distributions• Only sample, not population, is usually available so

we need to understand sampling distribution.• Sampling distribution of

– –

– Sampling distribution is normally distributed.– Even if normality assumption fails, sampling distributions of

are still approximately normal if n>30.

1b

11)( bE2

1 2

2 2

1 1

( )( 1)

1 1( ) ,

1

e

x

n n

x i ii i

Var bn s

s x x x xn n

1b

Properties of and as estimators of and

• Unbiased Estimators: Mean of the sampling distribution is equal to the population parameter being estimated.

• Consistent Estimators: As the sample size n increases, the probability that the estimator will become as close as you specify to the true parameter converges to 1.

• Minimum Variance Estimator: The variance of the estimator is smaller than the variance of any other linear unbiased estimator of , say

0b 1b

0 1

1b

1 *1b

Confidence Intervals

• Point Estimate: • Confidence interval: range of plausible values for the

true slope • Confidence Interval: where is an estimate of the standard

deviation of ( )Typically we use a 95% CI.

• 95% CI is approximately 95% CIs for a parameter are usually approximately where the standard error of the point estimate is an

estimate of the standard deviation of the point estimate.

1b

%100)1( / 2 11 , 2n bb t s

2)1(

11

x

ebsn

ss

1b

11 2* bb s

point estimate 2*Standard Error (point estimate)

1

es RMSE

Computing Confidence Interval with JMP

In the Fit Line output in JMP, information for computing the confidence interval for 1 is

given under Parameter Estimates.. Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -89.74965 173.4267 -0.52 0.6060 educ 51.225264 12.82813 3.99 0.0001

Std Error of slope for educ = 1bs

Approximate 95% confidence Interval for 1 : 11 2* 51.225 2*12.828 (25.57,76.88)bb s

The exact 95% confidence interval can be computed by moving the mouse to the Parameter Estimates, right clicking, clicking Columns and then clicking Lower 95% and Upper 95%. Parameter Estimates Term Lower 95% Upper 95% Intercept -433.9092 254.40995 educ 25.768251 76.682276

Exact 95% confidence interval for 1 : (25.77,76.68)

Interpretation: Increase in mean wages for one extra year of education is likely to be between 25.77 and 76.68 based on the sample in wagedatasubset.JMP

Summary

• We have described the assumptions of the simple linear regression model and how to check them.

• We have come up with a method of describing the uncertainty in our estimates of the slope and the intercept via confidence intervals.

• Note: These confidence intervals are only accurate if the assumptions of the simple linear regression model are approximately correct.

• Next class: Hypothesis tests.

Documents

Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression