18
Describing Relationships: Regression, Prediction and Causation Chapter 15 plus extra November 5, 2012 Prediction Predictions for a Scatter Plot Anatomy of a Line The Regression Line Regression and Least Squares Regression Fallacy

Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

Describing Relationships:

Regression, Prediction and

CausationChapter 15 plus extra

November 5, 2012

PredictionPredictions for a Scatter PlotAnatomy of a LineThe Regression LineRegression and Least SquaresRegression Fallacy

Page 2: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

1.0 Prediction

If we have two quantitative variables X and Y that arelinearly related to each other, then knowing the particularvalue of X for one individual can help us to estimate(or predict) the value of Y for that individual.

We will explore what is the best prediction of theresponse variable (Y ) given a value of the explanatoryvariable (X ) in football shaped data clouds.

What is the likely size of the prediction error?

Page 3: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

1.1 Fundamental Principle of

Prediction

Incoming students at a large law school have an averageL.S.A.T. score of 163 and a S.D. of 8. You may assume thehistogram of these data follows a normal curve approximately.Tomorrow one of these students will be chosen at random.

What is your best guess for their score?

The guess will be compared to their actual score to seehow far off it is. What is the likely size for the error inyour guess?

Page 4: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

2.0 Predictions for a Scatter Plot

55 60 65 70 75 80

6065

7075

Father's height (inches)

Son

's h

eigh

t (in

ches

)

55 60 65 70 75 80

6065

7075

Father's height (inches) S

on's

hei

ght (

inch

es)

Page 5: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

2.0 Predictions for a Scatter Plot

55 60 65 70 75 80

6065

7075

Father's height (inches)

Son

's h

eigh

t (in

ches

)

The graph ofaverages shows theaverage son’s height foreach father’s height.

It is close to a straightline in the middle.

At the ends, it is quitebumpy.

Page 6: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

2.0 Prediction for a Scatter Plot

Use the mean of the relevant sub-group of data as ourpredictor.

S.D. of the group gives the “likely size” of the error inour prediction.

Page 7: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

2.0 Prediction for a Scatter Plot

55 60 65 70 75 80

6065

7075

Father's height (inches)

Son

's h

eigh

t (in

ches

)

The regressionmethod for predictionuses a line fit to thegraph of averages.

It smooths away someof the chance variationin the group averages.

Page 8: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

2.1 Example

The figure below is based on a representative sample ofmarried couples in New York. The graph shows the averageincome of the wives (Y), given their husband’s income (X).The regression line is plotted too. Predict the income of wiveswhen the husband makes $60,000.

Page 9: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

3.0 Anatomy of a Line

Any line can be described by its slope and intercept.The y -intercept is the height of the line when x is 0.The slope is the rate at which y increases, per unitincrease in x .The equation of a line can be written in terms of its slopeand intercept:

y = slope × x + intercept

Page 10: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

3.1 Father and Son Example

..

Average height of fathers ~ 68 in.SD of height of fathers ~ 2.7 inAverage height of sons ~ 69 in.SD of height of sons ~ 2.7 in

r ~ 0.5

X

X

X

S.D. Line: the line where 1 SD change in X is matchedby 1 SD change in Y

RegressionLine

....

....

.

..

...

.. ..

...

Page 11: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

3.1 Father and Son Example

Take the fathers’ who are 72 inches tall:

average X + 1.5 × SD of X ≈ 72 in.

The average height of their sons’ is only 71 inches.

average Y + 0.5 × 1.5 × SD of Y ≈ 71 in.

What about the fathers who are 64 inches tall?

average X - 1.5 × SD of X ≈ 64 in.

The average height of their sons’ is only 67 inches.

average Y - 0.5 × 1.5 × SD of Y ≈ 67 in.

Page 12: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

3.1 Father and Son Example

SonsHeight

Fathers'Height

Ave. sons ht.69 in.

Ave. fathers ht. 68 in

1.5 x SD of fathers ht.

r x 1.5 x SD ofsons ht.

Slope =( r x 1.5 x SD of sons ht) / (1.5 x SD of fathers ht.) = 0.5 in. / in.

68in x 0.5 in./in= 34 in.

Intercept69in-34in=35in.

Page 13: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

4.0 The Regression Line

The regression line for predicting Y from X has the form:

Y = slope × X + intercept

where

b = slope,

=r × S.D. of Y

S.D of X.

a = intercept,

= Y − b X ,

= Y −(

r × S.D. of Y

S.D of X

)X .

Page 14: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

4.1 Prediction from a Regression Line

The predicted value of Y for a given value of X say X ∗

has the form:

Y ∗ = b X ∗ + a,

=

(r × S.D. of Y

S.D of X

)X ∗ + Y −

(r × S.D. of Y

S.D of X

)X

Page 15: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

4.2 Example

A university has made a statistical analysis of the relationshipbetween Math S.A.T. scores (ranging from 200 to 800) andfirst year G.P.A.s (ranging from 0 to 4.0), for students whocomplete the first year. The results:

average S.A.T. score = 550, S.D. = 80,average first year G.P.A. = 2.6, S.D. = 0.6,

r=0.4

The scatter diagram is football shaped. A student is chosen atrandom, and has an S.A.T. of 650. Predict this individual’sfirst year G.P.A.

Page 16: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

5.0 Regression and Least Squares

The Regression Line is familiarly referred to as the leastsquares line. This is because it minimizes the sum of thesquares of the vertical distances of the data points.

Data point

Vertical distanceto line

y

x

Regression Line

Figure 15.3

Page 17: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

6.0 The Regression Fallacy

In virtually every scatterplot with less than perfect correlation,the data points that are extreme along the x axis tend not tobe as extreme on the y axis. This is called the regressioneffect.

DefinitionThinking that the regression effect must be due to somethingimportant, not just chance error, is called the regressionfallacy.

Page 18: Describing Relationships: Regression, Prediction and Causationfaculty.washington.edu/grover4/class11au12.pdf · 2012. 11. 5. · Describing Relationships: Regression, Prediction and

6.1 Example

An instructor standardizes both her midterm and the finaleach semester so the class average is 50 and the S.D. is 10 onboth tests. The correlation between the tests is around 0.5.One semester she took all the students who scored below 30 inthe midterm and gave them special tutoring. On average, theygained 10 points the final. She claims that her tutoringworked. Can you give her alternate explanation?