Prediction concerning the response Y

Prediction concerning the response Y

Where does this topic fit in?

• Model formulation

• Model estimation

• Model evaluation

• Model use

Translating two research questions into two reasonable statistical answers

• What is the mean weight, μ, of all American women, aged 18-24? – If we want to estimate μ, what would be a good

estimate?

• What is the weight, y, of a randomly selected American woman, aged 18-24?– If we want to predict y, what would be a good

prediction?

62 66 70 74

110

120

130

140

150

160

170

180

190

200

210

height

we

ight

Could we do better by taking into account a person’s height?

8.158y

hw 1.65.266

One thing to estimate (μy) and one thing to predict (y)

54321

22

18

14

10

6

High school gpa

Co

llege

ent

ranc

e te

st s

core

xYEY 10

iii xY 10

Two different research questions

• What is the mean response μY when the predictor value is xh?

• What value will a new observation Ynew be when the predictor value is xh?

Example: Skin cancer mortality and latitude

• What is the expected (mean) mortality rate for all locations at 40o N latitude?

• What is the predicted mortality rate for 1 new randomly selected location at 40o N?

504030

200

150

100

Latitude

Mo

rta

lity

S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 %Mortality = 389.189 - 5.97764 Latitude

Regression Plot


1.150)40(9776.519.389ˆ y

“Point estimators”

is the best answer to each research question.

hh xbby 10ˆ

That is, it is:

• the best guess of the mean response at xh

• the best guess of a new observation at xh

But, as always, to be confident in the answer to our research question, we should put an interval around our best guess.

It is dangerous to “extrapolate” beyond scope of model.

6543210

30

25

20

15

conc

colo

nie

s

S = 2.67546 R-Sq = 66.8 % R-Sq(adj) = 63.5 %colonies = 16.0667 + 1.61576 conc

Regression Plot

It is dangerous to “extrapolate” beyond scope of model.

10 5 0

30

20

10

conc

colo

nie

s

S = 2.74819 R-Sq = 69.6 % R-Sq(adj) = 64.5 % - 0.276956 conc**2colonies = 15.0205 + 3.22113 conc

Regression Plot

A confidence interval for the population mean response μY

… when the predictor value is xh

Again, what are we estimating?

54321

22

18

14

10

6

High school gpa

Co

llege

ent

ranc

e te

st s

core

xYEY 10

iii xY 10

(1-α)100% t-interval for mean response μY

Formula in notation:

Formula in words:

Sample estimate ± (t-multiplier × standard error)

2

2

2,2

1ˆ

xx

xx

nMSEty

i

hnh


Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI1 150.08 2.75 (144.56, 155.61) (111.23,188.93)

Values of Predictors for New Observations

New Obs Lat

1 40.0

Factors affecting the length of the confidence interval for μY

2

2

2,2

1ˆ

xx

xx

nMSEty

i

hnh

• As the confidence level decreases, …• As MSE decreases, …• As the sample size increases, …• The more spread out the predictor values, …• The closer xh is to the sample mean, …

Does the estimate of μY when xh = 1 vary more here …?

10987654321

25

15

5

x

y

Var N StDevyhat(x=1) 5 0.320

… or here?

10987654321

30

20

10

0

x

y

Var N StDev yhat(x=1) 5 2.127

Does the estimate of μY vary more when xh = 1 or when xh = 5.5?

10987654321

30

20

10

0

x

y Var N StDev yhat(x=1) 5 2.127yhat(x=5.5) 5 0.512


New Fit SE Fit 95.0% CI 95.0% PI1 150.08 2.75 (144.6,155.6) (111.2,188.93) 2 221.82 7.42 (206.9,236.8) (180.6,263.07)X X denotes a row with X values away from the center

Values of Predictors for New ObservationsNew Obs Latitude1 40.0 Mean of Lat = 39.5332 28.0


When is it okay to use the confidence interval for μY formula?

• When xh is a value within the scope of the model – xh does not have to be one of the actual x values in the data set.

• When the “LINE” assumptions are met.– The formula works okay even if the error terms

are only approximately normal.– If you have a large sample, the error terms can

even deviate substantially from normality.

Prediction interval for a new response Ynew

Again, what are we predicting?

54321

22

18

14

10

6

High school gpa

Co

llege

ent

ranc

e te

st s

core

xYEY 10

iii xY 10

(1-α)100% prediction interval for new response Ynew

Formula in notation:

Formula in words:

Sample prediction ± (t-multiplier × standard error)

2

2

2,2

11ˆ

xx

xx

nMSEty

i

hnh



New Obs Fit SE Fit 95.0% CI 95.0% PI1 150.08 2.75 (144.56, 155.61) (111.23,188.93)

Values of Predictors for New Observations

New Obs Lat

1 40.0

When is it okay to use the prediction interval for Ynew formula?

• When xh is a value within the scope of the model – xh does not have to be one of the actual x values in the data set.

• When the “LINE” assumptions are met.– The formula for the prediction interval depends

strongly on the assumption that the error terms are normally distributed.

What’s the difference in the two formulas?

Confidence interval for μY :

2

2

2,2

1ˆ

xx

xx

nMSEty

i

hnh

Prediction interval for Ynew:

2

2

2,2

11ˆ

xx

xx

nMSEty

i

hnh

Prediction of Ynew if the mean μY is known

21019017015013011090

0.02

0.01

0.00

Mortality

No

rma

l cur

ve

0.95

Suppose it were known that the mean skin cancer mortality at xh = 40o N is 150 deaths per million (with variance 400)?

What is the predicted skin cancer mortality in Columbus, Ohio?

And then reality sets in

• The mean μY is not known.

– Estimate it with the predicted response y

– The cost of using y to estimate μY is the

• The variance σ2 is not known.

variance of

y

– Estimate it with MSE.

Variance of the prediction

)ˆ(22hY

n

ii

hn

ii

h

xx

xx

nMSE

xx

xx

nMSEMSE

1

2

2

1

2

2 11

1

which is estimated by:

The variation in the prediction of a new response depends on two components:

1. the variation due to estimating the mean μY with

2. the variation in Y

hy

What’s the effect of the difference in the two formulas?

Confidence interval for μY :

2

2

2,2

1ˆ

xx

xx

nMSEty

i

hnh

Prediction interval for Ynew:

2

2

2,2

11ˆ

xx

xx

nMSEty

i

hnh

What’s the effect of the difference in the two formulas?

• A (1-α)100% confidence interval for μY at xh will always be narrower than a (1-α)100% prediction interval for Ynew at xh.

• The confidence interval’s standard error can approach 0, whereas the prediction interval’s standard error cannot get close to 0.

Confidence intervals and prediction intervals for response in Minitab

• Stat >> Regression >> Regression …• Specify response and predictor(s).• Select Options…

– In “Prediction intervals for new observations” box, specify either the X value or a column name containing multiple X values.

– Specify confidence level (default is 95%).

• Click on OK. Click on OK.• Results appear in session window.



C64028


New Fit SE Fit 95.0% CI 95.0% PI1 150.08 2.75 (144.6,155.6) (111.2,188.93) 2 221.82 7.42 (206.9,236.8) (180.6,263.07)X X denotes a row with X values away from the center

Values of Predictors for New ObservationsNew Obs Latitude1 40.0 Mean of Lat = 39.5332 28.0


A plot of the confidence interval and prediction interval in Minitab

• Stat >> Regression >> Fitted line plot …

• Specify predictor and response.

• Under Options …– Select Display confidence bands. – Select Display prediction bands. – Specify desired confidence level (95% default)

• Select OK. Select OK.



30 40 50

50

150

250

Latitude

Mo

rta

lity

Mortality = 389.189 - 5.97764 LatitudeS = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 %

Regression

95% CI

95% PI

Regression Plot

Documents

Prediction concerning the response Y