Upload
caitlin-jordan
View
229
Download
0
Tags:
Embed Size (px)
Citation preview
Anthony Greene 1
Regression
Using Correlation To Make Predictions
Anthony Greene 2
Making a prediction
xy zrz ˆ
To obtain the predicted value of y based on a known value of x and a known correlation.
Note what happens for positive and negative values of r and for high and low values of r and for near-zero values of r.
Anthony Greene 3
Graph of y = 5 – 3 x
Anthony Greene 4
y-Intercept and Slope
For a linear equation y = a + bx, the constant a is the y-intercept and the constant b is the slope.
x and y are related variables
Anthony Greene 5
Straight-line graphs of three linear equations
Y = a + bXa = y-interceptb = slope (rise/run)
Anthony Greene 6
Graphical Interpretation of Slope
The straight-line graph of the linear equation y = a +bx slopes upward if b > 0, slopes downward if b < 0, and is horizontal if b = 0
Anthony Greene 7
Graphical interpretation of slope
Anthony Greene 8
Four data points
Anthony Greene 9
Scatter plot
Anthony Greene 10
Two possible straight-line fits to the data points
Anthony Greene 11
Determining how well the data points in are fit by Line A Vs.Line B
Anthony Greene 12
Least-Squares Criterion
The straight line that best fits a set of data points is the one having the smallest possible sum of squared errors. Recall that the sum of squared errors is error variance.
Anthony Greene 13
Regression Line and Regression Equation
Regression line: The straight line that best fits a set of data points according to the least-squares criterion.Regression equation: The equation of the regression line.
Anthony Greene 14
The best-fit line minimizes the distance between the actual data and the predicted value
Anthony Greene 15
Residual, e, of a data point
Anthony Greene 16
nyySS
nyxxyS
nxxSS
MySS
MyMxS
MxSS
y
p
x
yy
yxp
xx
22
22
2
2
formula nalcomputatio Or the
We define SSx, SSP and SSy by
Notation Used in Regression and Correlation
Anthony Greene 17
Regression Equation
xbyn
aSS
Sb
xbay
x
P 1 and where
ˆ
The regression equation for a set of n data points is
xy bMMa
Anthony Greene 18
The relationship between b and r
• That is, the regression slope is just the correlation coefficient scaled up to the right size for the variables x and y.
same thehave and becausewhich
x
y
x
y
x
p
yx
p
s
srb
nyx
SS
SSr
SS
Sb
SSSS
Sr
Anthony Greene 19
abxy
bMMaMbMbxy
MxbMy
s
srbMx
s
srMy
s
Mxr
s
My
zrz
xyyx
xy
x
yx
x
yy
x
x
y
y
xy
ˆ
recall ˆ
ˆ
recall ˆ
ˆˆ
Anthony Greene 20
Criterion for Finding a Regression Line
Before finding a regression line for a set of data points, draw a scatter diagram. If the data points do not appear to be scattered about a straight line, do not determine a regression line.
Anthony Greene 21
Linear regression requires linear data:(a) Data points scattered about a curve (b) Inappropriate straight line fit to the dataHigher order regression equations exist but are outside the range of this course
Anthony Greene 22
Uniform Variance
0102030405060708090100
1 2 3 4 5
Math Proficiency By Grade
Anthony Greene 23
Assumptions for Regression Inferences
Anthony Greene 24
Table for obtaining the three sums of squares for the used car data
Anthony Greene 25
Regression line and data points for used car data
What is a fair asking price for a 2.5 year old car?
71.172ˆ
5.226.2047.195ˆ
26.2047.195ˆ
y
y
xy
So since the price unit is $100s, the best prediction is $17,271
Anthony Greene 26
Extrapolation in the used car example
27
Total sum of squares, SST: The variation in the observed values of the response variable:
Regression sum of squares, SSR: The variation in the observed values of the response variable that is explained by the regression:
Error sum of squares, SSE: The variation in the observed values of the response variable that is not explained by the regression:
Sums of Squares in Regression
xxxyy SSMySSR 22ˆ
yyy SMySST 2
xx
xyyy S
SSyySSE
22
ˆ
Anthony Greene 28
Regression Identity
The total sum of squares equals the regression sum of squares plus the error sum of squares. In symbols,
SST = SSR + SSE.
Anthony Greene 29
Graphical portrayal of regression for used cars
y = a + bx
Anthony Greene 30
What sort of things could regression be used for?
Any instance where a known correlation exists, regression can be used to predict a new score. Examples:
1. If you knew that there was a past correlation between the amount of study time and the grade on an exam, you could make a good prediction about the grade before it happened.
2. If you knew that certain features of a stock correlate with its price, you can use regression to predict the price before it happens.
Anthony Greene 31
Regression Example: Low Correlation
0
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300 350weight
hei
gh
t
Find the regression equation for predicting height based on knowledge of weight. The existing data is for 10 male stats students?
Anthony Greene 32
287.00 75.00 21,525.00 82,369.00 5,625.00
300.00 71.00 21,300.00 90,000.00 5,041.00
255.00 80.00 20,400.00 65,025.00 6,400.00
180.00 69.00 12,420.00 32,400.00 4,761.00
130.00 70.00 9,100.00 16,900.00 4,900.00
215.00 77.00 16,555.00 46,225.00 5,929.00
165.00 71.00 11,715.00 27,225.00 5,041.00
240.00 71.00 17,040.00 57,600.00 5,041.00
160.00 72.00 11,520.00 25,600.00 5,184.00
150.00 65.00 9,750.00 22,500.00 4,225.002,082.00 721.00 151,325.00 465,844.00 52,147.00
X Y
Anthony Greene 33
287.00 75.00 21,525.00 82,369.00 5,625.00
300.00 71.00 21,300.00 90,000.00 5,041.00
255.00 80.00 20,400.00 65,025.00 6,400.00
180.00 69.00 12,420.00 32,400.00 4,761.00
130.00 70.00 9,100.00 16,900.00 4,900.00
215.00 77.00 16,555.00 46,225.00 5,929.00
165.00 71.00 11,715.00 27,225.00 5,041.00
240.00 71.00 17,040.00 57,600.00 5,041.00
160.00 72.00 11,520.00 25,600.00 5,184.00
150.00 65.00 9,750.00 22,500.00 4,225.002,082.00 721.00 151,325.00 465,844.00 52,147.00
X Y XY X2 Y2
Anthony Greene 34
287.00 75.00 21,525.00 82,369.00 5,625.00
300.00 71.00 21,300.00 90,000.00 5,041.00
255.00 80.00 20,400.00 65,025.00 6,400.00
180.00 69.00 12,420.00 32,400.00 4,761.00
130.00 70.00 9,100.00 16,900.00 4,900.00
215.00 77.00 16,555.00 46,225.00 5,929.00
165.00 71.00 11,715.00 27,225.00 5,041.00
240.00 71.00 17,040.00 57,600.00 5,041.00
160.00 72.00 11,520.00 25,600.00 5,184.00
150.00 65.00 9,750.00 22,500.00 4,225.002,082.00 721.00 151,325.00 465,844.00 52,147.00
X Y XY X2 Y2
Anthony Greene 35
SSx = x2 - (x)2/n = 465,844-433,472.4 = 32,372
SP = xy - x y/n = 151,325-150, 112.2
b=SP/SSx, so b = 1,213/32,372=0.03
a = (1/n)(y-bx), so a = 0.1(721-60.38) = 66
So, Y=0.03x+66
X Y XY X2 Y2
2,082 721 151,325 465,844 52,147
^
Anthony Greene 36
Y=0.03x+66^
0
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300 350weight
hei
gh
t
Anthony Greene 37
Regression Example: High Correlation
Find the regression equation for predicting probability of a teenage suicide attempt based on weekly heroine usage.
0102030405060708090100
1 2 3 4 5 6 7
200020012002
38
X Y XY X2 Y2
1 0.2 0.2 1 0.04
1 0.31 0.31 1 0.0961
1 0.18 0.18 1 0.0324
2 0.27 0.54 4 0.0729
2 0.38 0.76 4 0.1444
2 0.46 0.92 4 0.2116
3 0.9 2.7 9 0.81
3 0.58 1.74 9 0.3364
3 0.45 1.35 9 0.2025
4 0.84 3.36 16 0.7056
4 0.74 2.96 16 0.5476
4 0.68 2.72 16 0.4624
5 0.85 4.25 25 0.7225
5 0.78 3.9 25 0.6084
5 0.73 3.65 25 0.5329
6 0.88 5.28 36 0.7744
6 0.82 4.92 36 0.6724
6 0.78 4.68 36 0.6084
7 0.92 6.44 49 0.8464
7 0.85 5.95 49 0.7225
7 0.91 6.37 49 0.8281
84 13.51 63.18 420 9.9779
39
X Y XY X2 Y2
1 0.2 0.2 1 0.04
1 0.31 0.31 1 0.0961
1 0.18 0.18 1 0.0324
2 0.27 0.54 4 0.0729
2 0.38 0.76 4 0.1444
2 0.46 0.92 4 0.2116
3 0.9 2.7 9 0.81
3 0.58 1.74 9 0.3364
3 0.45 1.35 9 0.2025
4 0.84 3.36 16 0.7056
4 0.74 2.96 16 0.5476
4 0.68 2.72 16 0.4624
5 0.85 4.25 25 0.7225
5 0.78 3.9 25 0.6084
5 0.73 3.65 25 0.5329
6 0.88 5.28 36 0.7744
6 0.82 4.92 36 0.6724
6 0.78 4.68 36 0.6084
7 0.92 6.44 49 0.8464
7 0.85 5.95 49 0.7225
7 0.91 6.37 49 0.8281
84 13.51 63.18 420 9.9779
40
X Y XY X2 Y2
1 0.2 0.2 1 0.04
1 0.31 0.31 1 0.0961
1 0.18 0.18 1 0.0324
2 0.27 0.54 4 0.0729
2 0.38 0.76 4 0.1444
2 0.46 0.92 4 0.2116
3 0.9 2.7 9 0.81
3 0.58 1.74 9 0.3364
3 0.45 1.35 9 0.2025
4 0.84 3.36 16 0.7056
4 0.74 2.96 16 0.5476
4 0.68 2.72 16 0.4624
5 0.85 4.25 25 0.7225
5 0.78 3.9 25 0.6084
5 0.73 3.65 25 0.5329
6 0.88 5.28 36 0.7744
6 0.82 4.92 36 0.6724
6 0.78 4.68 36 0.6084
7 0.92 6.44 49 0.8464
7 0.85 5.95 49 0.7225
7 0.91 6.37 49 0.8281
84 13.51 63.18 420 9.9779Σ
41
n = 21
SSx = x2 - (x)2/n = 420 - 336 = 84
SP = xy - x y/n = 63.18 – 54.04 = 9.14
b=SP/SSx, so b = 9.14/84 = 0.109
a=(1/n)(y-bx), so a = (1/21)(13.51-9.156) = 0.207
So, Y= 0.109x + 0.207
X Y XY X2 Y2
84 13.51 63.18 420 9.9779Σ
^
Anthony Greene 42
Why Is It Called Regression?
• For low correlations, the predicted value is close to the mean
• For zero correlations the prediction is the mean• Only for perfect correlations R2= 1.0 do the
predicted scores show as much variation as the actual scores
• Since perfect correlations are rare, we say that the predicted scores show regression towards the mean