Upload
harvey-mathews
View
218
Download
0
Embed Size (px)
Citation preview
Stat 112: Notes 2
• Today’s class: Section 3.3.– Full description of simple linear regression
model. – Checking the assumptions of the simple linear
regression model.– Inferences for simple linear regression model.
Wages and Education• A random sample of 100 men (ages 18-70) was
surveyed about their weekly wages in 1988 and their education (part of the 1988 March U.S. Current Population Survey) (in file wagedatasubset.JMP)
• How much more on average do men with one extra year of education make?
• If a man has a high school diploma but no further education, what’s the best prediction of his earnings?
• Regression addresses these two questions X=Education, Y= Weekly Wage
Bivariate Fit of wage By educ
0
500
1000
1500
2000
2500
wag
e
5 10 15 20
educ
Simple Linear Regression ModelBivariate Fit of wage By educ
0
500
1000
1500
2000
2500
wag
e
5 10 15 20
educ
Linear Fit
Linear Fit wage = -89.74965 + 51.225264*educ Summary of Fit RSquare 0.139941 RSquare Adj 0.131165 Root Mean Square Error 331.48
The mean of weekly wages is estimated to increase 1 51.23b dollars for each
extra year of education. The average absolute error from using
a man’s education to predict his wages is about RMSE=331.48 dollars
Sample vs. Population
• We can view the data – -- as a sample from a population.
• Our goal is to learn about the relationship between X and Y in the population: – We don’t care about the particular 100 men sampled
but about the population of US men ages 18-70. – From Notes 1, we don’t care about the relationship
between tracks counted and the density of deer for the particular sample, but the relationship among the population of all tracks; this enables to predict in the future the density of deer from the number of tracks counted.
1 1( , ), , ( , )n nX Y X Y
Simple Linear Regression ModelThe simple linear regression model:
0 1 0 1( | ) , i i i i i iE Y X X X Y X e , 2~ (0, )i ee N
The ie are called disturbances and represent the deviation
of iY from its mean given iX . The disturbances are
estimated by the residuals 0 1ˆ ( )i i ie Y b b X .
Assumptions of the Simple Linear Regression Model
For each value of the explanatory variable X=x, there is a subpopulation of outcomes (responses) Y for units with X=x. Assumptions of the simple linear regression model: 1. Linearity: The means of the subpopulations fall on a straight line function of the explanatory variable. 2. Constant variance: The subpopulation standard deviations are all equal (to ). 3. Normality: The subpopulations are all normally distributed. 4. Independence: The selection of an outcome from any of the subpopulations is independent of the selection of any other outcomes.
Checking the AssumptionsSimple Linear Regression Model for Population:
0 1i i iY x e . Before making any inferences using the simple linear regression model, we need to check the assumptions:
Based on the data 1 1 ,( , ), , ( )n nX Y X Y ,
1. We estimate 0 and 1 by the least squares estimates
0b and 1b .
2. We estimate the disturbances ie by the residuals
0 1ˆˆ ( | ) ( )i i i i ie Y E Y X Y b b X .
3. We check if the residuals approximate satisfy
(1) Linearity: ˆ( ) 0iE e for all range of iX .
(2) Constant Variance: ˆ( )iVar e constant for all
range of iX .
(3) Normality: ie are approximately normally distributed.
(4) Independence : ie are independent (only worry about for time series data).
Residual Plot A useful tool for checking the assumptions is the residual plot. Residual for observation i
0 1ˆˆ ( | ) ( )i i i i i ie y E y x y b b x .
The residual plot is a plot of the residuals ie versus ix . It is constructed in JMP by after fitting the least squares line, clicking the red triangle next to Linear Fit and clicking Plot Residuals.
-500
0
500
1000
1500
Res
idua
l
5 10 15 20
educ
Checking Linearity Assumption
To check if the linearity assumption holds (i.e., the model
for the mean is correct), check if ˆ( )iE e is zero for each
range of iX .
-500
0
500
1000
1500
Res
idua
l
5 10 15 20
educ
Linearity Assumption appears reasonable but it appears that very high education individuals and low education individuals earn more than expected (most residuals are positive) [we will consider a nonlinear model for this data in Chapter 5, for now we’ll assume linearity is okay).
Violation of LinearityFor a sample of McDonald’s restaurants Y=Revenue of Restaurant X=Mean Age of Children in Neighborhood of Restaurant Bivariate Fit of Revenue By Age
800
900
1000
1100
1200
1300
Rev
enue
2.5 5.0 7.5 10.0 12.5 15.0
Age
-200
-100
0
100
200
300
Res
idua
l2.5 5.0 7.5 10.0 12.5 15.0
Age
The mean of the residuals is negative for small and large ages and positive for large ages – linearity appears to be violated (we will see what to do when linearity is violated in Chapter 5).
Checking Constant VarianceTo check that the constant variance assumption holds, check that there is no pattern in the spread of the residuals as X varies.
-500
0
500
1000
1500
Res
idua
l
5 10 15 20
educ Constant variance appears reasonable.
Checking NormalityFor checking normality, we can look at whether the overall distribution of the residuals looks approximately normal by making a histogram of the residuals. Save the residuals by clicking the red triangle next to Linear Fit after Fit Line and then clicking Save Residuals. Then click Analyze, Distribution and put the saved residuals column into Y, Columns. The histogram should be approximately bell shaped if the normality assumption holds. Distributions Residuals wage
-500 0 500 1000 1500
The residuals from the wage data have approximately a bell shaped histogram although there is some indication of skewness to the right. The normality assumption seems roughly reasonable. We will look at more formal tools for assessing normality in Chapter 6.
Checking Assumptions
• It is important to check the assumptions of a regression model because the inferences depend on the assumptions approximately holding. The assumptions don’t need to hold exactly but only approximately.
• We will study more about checking assumptions and how to deal with violations of the assumptions in Chapters 5 and 6.
InferencesSimple Linear Regression Model for Population:
0 1 0 1( | ) , i i i i i iE Y X X X Y X e , 2~ (0, )i ee N
Data: 1 1( , ), , ( , )n nX Y X Y .
The least squares estimates 0b and 1b will typically not be
exactly equal to the true 0 and 1 .
Inferences: Draw conclusions about 0 and 1 based on the
data 1 1( , ), , ( , )n nX Y X Y .
1. Point Estimates: Best estimates of 0 and 1 . 2. Confidence intervals: Ranges of plausible values of
0 and 1 . 3. Hypothesis tests: Test whether it is plausible that
0 and 1 equal certain values.
Sampling Distribution of b0,b1
• The sampling distribution of describes the probability distribution of the estimates over repeated samples from the simple linear regression model.
• Understanding the sampling distribution is the key to drawing inferences from the sample to the population.
),(,),,( 11 nn yxyx
10 ,bb
Sampling distribution in wage data• To see how the least squares estimates can differ over
different samples from the population, we consider the “population” to be all 25,632 men surveyed in the March 1988 Current Population Survey in wagedata1988.JMP and the sample to be random samples of size 100 like the one in wagedatasubset.JMP.
“Population”: Bivariate Fit of wage By educ
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
wag
e
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18
educ
Linear Fit wage = -19.06983 + 50.414381*educ
0
1
19.07
50.41
Samples of wage data• To take samples in JMP, click the Tables menu, then click Subset
and then click the circle next to Random Sample Size and set the sample size. JMP will create a new data subset which is a random
sample of the original data set. Sample 1: Bivariate Fit of wage By educ
0
500
1000
1500
2000
2500
wag
e
2 4 6 8 10 12 14 16 18 20
educ
Linear Fit wage = -288.6577 + 71.530586*educ
0
1
288.66
71.53
b
b
Sample 2: Bivariate Fit of wage By educ
0
500
1000
1500
2000
2500
3000
wag
e0 5 10 15 20
educ
Linear Fit wage = 188.82961 + 38.453459*educ
0
1
188.83
38.45
b
b
Sampling distributions• Only sample, not population, is usually available so
we need to understand sampling distribution.• Sampling distribution of
– –
– Sampling distribution is normally distributed.– Even if normality assumption fails, sampling distributions of
are still approximately normal if n>30.
1b
11)( bE2
1 2
2 2
1 1
( )( 1)
1 1( ) ,
1
e
x
n n
x i ii i
Var bn s
s x x x xn n
1b
Properties of and as estimators of and
• Unbiased Estimators: Mean of the sampling distribution is equal to the population parameter being estimated.
• Consistent Estimators: As the sample size n increases, the probability that the estimator will become as close as you specify to the true parameter converges to 1.
• Minimum Variance Estimator: The variance of the estimator is smaller than the variance of any other linear unbiased estimator of , say
0b 1b
0 1
1b
1 *1b
Confidence Intervals
• Point Estimate: • Confidence interval: range of plausible values for the
true slope • Confidence Interval: where is an estimate of the standard
deviation of ( )Typically we use a 95% CI.
• 95% CI is approximately 95% CIs for a parameter are usually approximately where the standard error of the point estimate is an
estimate of the standard deviation of the point estimate.
1b
%100)1( / 2 11 , 2n bb t s
2)1(
11
x
ebsn
ss
1b
11 2* bb s
point estimate 2*Standard Error (point estimate)
1
es RMSE
Computing Confidence Interval with JMP
In the Fit Line output in JMP, information for computing the confidence interval for 1 is
given under Parameter Estimates.. Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -89.74965 173.4267 -0.52 0.6060 educ 51.225264 12.82813 3.99 0.0001
Std Error of slope for educ = 1bs
Approximate 95% confidence Interval for 1 : 11 2* 51.225 2*12.828 (25.57,76.88)bb s
The exact 95% confidence interval can be computed by moving the mouse to the Parameter Estimates, right clicking, clicking Columns and then clicking Lower 95% and Upper 95%. Parameter Estimates Term Lower 95% Upper 95% Intercept -433.9092 254.40995 educ 25.768251 76.682276
Exact 95% confidence interval for 1 : (25.77,76.68)
Interpretation: Increase in mean wages for one extra year of education is likely to be between 25.77 and 76.68 based on the sample in wagedatasubset.JMP
Summary
• We have described the assumptions of the simple linear regression model and how to check them.
• We have come up with a method of describing the uncertainty in our estimates of the slope and the intercept via confidence intervals.
• Note: These confidence intervals are only accurate if the assumptions of the simple linear regression model are approximately correct.
• Next class: Hypothesis tests.