Upload
dr-singh
View
237
Download
1
Embed Size (px)
Citation preview
8/3/2019 Statistics-Linear Regression and Correlation Analysis
1/50
1
Linear Regressionand
Correlation Analysis
Prof Rushen Chahal
8/3/2019 Statistics-Linear Regression and Correlation Analysis
2/50
2
Chapter Goals
To understand the methods for displaying and describingrelationship among two variables
Models
Linear regression
Correlations Frequency tables
Methods for Studying Relationships Graphical
Scatter plots
Line plots 3-D plots
8/3/2019 Statistics-Linear Regression and Correlation Analysis
3/50
3
Two Quantitative Variables
The response variable, also called thedependent variable, is the variable we want
to predict,and is usually denoted by y.
The explanatory variable, also called theindependent variable, is the variable thatattempts to explain the response, and isdenoted byx.
8/3/2019 Statistics-Linear Regression and Correlation Analysis
4/50
4
Scatter Plots and Correlation
A scatter plot (or scatter diagram) is usedto show the relationship between twovariables
Correlation analysis is used to measurestrength of the association (linearrelationship) between two variables
Only concerned with strength of therelationship
No causal effect is implied
8/3/2019 Statistics-Linear Regression and Correlation Analysis
5/50
5
Example
The following graphshows the scatter plotofExam 1 score (x)
and Exam 2 score (y)for 354 students in aclass.
Is there a relationship
between x and y?
8/3/2019 Statistics-Linear Regression and Correlation Analysis
6/50
6
Scatter Plot Examples
y
x
y
x
y
y
x
x
Linear relationships Curvilinear relationships
8/3/2019 Statistics-Linear Regression and Correlation Analysis
7/50
7
Scatter Plot Examples
y
x
y
x
No relationship(continued)
8/3/2019 Statistics-Linear Regression and Correlation Analysis
8/50
8
Correlation Coefficient
The population correlation coefficient (rho) measures the strength of theassociation between the variables
The sample correlation coefficient ris an estimate of and is used tomeasure the strength of the linearrelationship in the sample observations
(continued)
8/3/2019 Statistics-Linear Regression and Correlation Analysis
9/50
9
Features of and r
Unit free
Range between -1 and 1
The closer to -1, the stronger thenegative linear relationship
The closer to 1, the stronger the positivelinear relationship
The closer to 0, the weaker the linearrelationship
8/3/2019 Statistics-Linear Regression and Correlation Analysis
10/50
10
Examples of Approximate r Values
yy
x
y
x
x
y
x
y
x
Tag with appropriate value: -1, -.6, 0, +.3, 1
8/3/2019 Statistics-Linear Regression and Correlation Analysis
11/50
11
Earlier Example
Correlations
.
.
.
.
earson orre a on
g. -ta e )
earson orre at on
g. -ta e )
Exam1
Exam2
orrea ton s sgn cant at t e . eve.
8/3/2019 Statistics-Linear Regression and Correlation Analysis
12/50
12
Questions?
What kind of relationship would you expectin the following situations:
Age (in years) of a car, and its price.
Number of calories consumed per day andweight.
Height and IQ of a person.
8/3/2019 Statistics-Linear Regression and Correlation Analysis
13/50
13
Exercise
Identify the two variables that vary and decidewhich should be the independent variable andwhich should be the dependent variable.
Sketch a graph that you think best representsthe relationship between the two variables.
1. The size of a persons vocabulary over his orher lifetime.
2. The distance from the ceiling to the tip of theminute hand of a clock hung on the wall.
8/3/2019 Statistics-Linear Regression and Correlation Analysis
14/50
14
Introduction to Regression Analysis
Regression analysis is used to:
Predict the value of a dependent variable basedon the value of at least one independent variable.
Explain the impact of changes in an independentvariable on the dependent variable.
Dependent variable: the variable we wish to
explain.Independent variable: the variable used to
explain the dependent variable.
8/3/2019 Statistics-Linear Regression and Correlation Analysis
15/50
15
Simple Linear Regression
Model Only one independent variable, x.
Relationship between x and y isdescribed by a linear function.
Changes in y are assumed to be
caused by changes in x.
8/3/2019 Statistics-Linear Regression and Correlation Analysis
16/50
16
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT
Linear
No Relationship
8/3/2019 Statistics-Linear Regression and Correlation Analysis
17/50
17
ni ,...,2,1; !! ii10i XY
Linear component
Population Linear Regression
The population regression model:
Populationy intercept
PopulationSlopeCoefficient
RandomError term,
or residual
Dependent
Variable
IndependentVariable
Random Errorcomponent
8/3/2019 Statistics-Linear Regression and Correlation Analysis
18/50
18
Linear Regression Assumptions
The Error terms i, i=1, 2. , n areindependent and i ~ Normal (0,
2).
The Error terms i, i=1, 2. , n have constant
variance 2
. The underlying relationship between the X
variable and the Y variable is linear.
8/3/2019 Statistics-Linear Regression and Correlation Analysis
19/50
19
Population Linear Regression(continued)
Random Errorfor this xi value
Y
X
Observed Valueof y
i
for xi
Predicted Valueof yi for xi
ii10i XY !
xi
Slope = 1
Y-Intercept = 0
i
8/3/2019 Statistics-Linear Regression and Correlation Analysis
20/50
20
i10i XbbY !
The sample regression line provides an estimateof the population regression function.
Estimated Regression Model
Estimate ofthe regressiony-intercept
Estimate of theregression slopeEstimated
(or predicted)y value
Independent
variable
The individual random error terms ei have a mean of zero
The Regression Function: iiii XEXYE 1010 )()( FFIFF !!
8/3/2019 Statistics-Linear Regression and Correlation Analysis
21/50
21
Earlier Example
8/3/2019 Statistics-Linear Regression and Correlation Analysis
22/50
22
ResidualA residualis the difference between the observed
response yiand the predicted response i. Thus, foreach pair of observations (xi,yi), the i
th residual is
ei= yi i= yi (b0 + b1xi)
Least Squares Criterion b0 and b1 are obtained by finding the
values of b0 and b1 that minimize the sum
of the squared residuals.
2
10
22
x))b(b(y
)y(ye
!
!
8/3/2019 Statistics-Linear Regression and Correlation Analysis
23/50
23
b0 is the estimated average value of
y when the value of x is zero.
b1 is the estimated change in the
average value of y as a result of a
one-unit change in x.
Interpretation of theSlope and the Intercept
8/3/2019 Statistics-Linear Regression and Correlation Analysis
24/50
24
The Least Squares Equation
The formulas for b1 and b0 are:
algebraic equivalent:
!
nxx
n
yxxy
b2
2
1
)(
!
21 )(
))((
xx
yyxxb
xbyb 10 !and
8/3/2019 Statistics-Linear Regression and Correlation Analysis
25/50
25
Finding the Least Squares Equation
The coefficients b0 and b1 willusually be found using computersoftware, such as Excel, Minitab, or
SPSS.
Other regression measures will alsobe computed as part of computer-based regression analysis.
8/3/2019 Statistics-Linear Regression and Correlation Analysis
26/50
26
Simple Linear RegressionExample
A real estate agent wishes to examine therelationship between the selling price of a
home and its size (measured in square feet)
A random sample of 10 houses is selected
Dependent variable (y) = house price in$1000s
Independent variable (x) = square feet
8/3/2019 Statistics-Linear Regression and Correlation Analysis
27/50
27
Sample Data for House PriceModel
House Price in $1000s(y)
Square Feet(x)
245 1400
312 1600279 1700
308 1875
199 1100
219 1550405 2350
324 2450
319 1425
255 1700
8/3/2019 Statistics-Linear Regression and Correlation Analysis
28/50
28
Model Summary
. a . . .
Adjusted
Std. Error of
re c ors: ons an , quare ee.
Coefficientsa
. . . .
. . . . .
ons an
quare ee
1
.
Unstandardized Standardized
.
epen en ar a e: ouse r ce.
SPSS Output
The regression equation is:
feet)(square0.11098.248pricehouse !
8/3/2019 Statistics-Linear Regression and Correlation Analysis
29/50
29
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
HousePrice($1000s)
Graphical Presentation
House price model: scatter plot andregression line
feet)(square0.11098.248pricehouse !
Slope= 0.110
Intercept
= 98.248
8/3/2019 Statistics-Linear Regression and Correlation Analysis
30/50
30
Interpretation of the Intercept, b0
b0 is the estimated average value of Y when
the value of X is zero (if x = 0 is in the rangeof observed x values)
Here, no houses had 0 square feet, so
b0
= 98.24833 just indicates that, for houses
within the range of sizes observed,
$98,248.33 is the portion of the house price
not explained by square feet
feet)(square0.11098.248pricehouse !
8/3/2019 Statistics-Linear Regression and Correlation Analysis
31/50
31
Interpretation of theSlope Coefficient, b1
b1 measures the estimated change in
the average value of Y as a result of
a one-unit change in X
Here, b1 = .10977
tells us that the averagevalue of a house increases by .10977($1000)
= $109.77, on average, for each additional
one square foot of size
feet)(square0.1097798.24833pricehouse !
8/3/2019 Statistics-Linear Regression and Correlation Analysis
32/50
32
Least Squares Regression Properties
The sum of the residuals from the leastsquares regression line is 0 ( ).
The sum of the squared residuals is a
minimum (minimized ). The simple regression line always passes
through the mean of the y variable and the
mean of the x variable.
The least squares coefficients b0 and b1 are
unbiased estimates of 0 and 1.
0)( ! yy
2)( yy
8/3/2019 Statistics-Linear Regression and Correlation Analysis
33/50
33
Exercise
The growth of children from early childhood throughadolescence generally follows a linear pattern. Data onthe heights of female Americans during childhood, fromfour to nine years old, were compiled and the leastsquares regression line was obtained as = 32 + 2.4xwhere is the predicted height in inches, and xis age in
years. Interpret the value of the estimated slope b1 = 2. 4. Would interpretation of the value of the estimated
y-intercept, b0 = 32, make sense here? What would you predict the height to be for a female
American at 8 years old? What would you predict the height to be for a femaleAmerican at 25 years old? How does the quality of thisanswer compare to the previous question?
8/3/2019 Statistics-Linear Regression and Correlation Analysis
34/50
34
The coefficient of determination is theportion of the total variation in thedependent variable that is explained byvariation in the independent variable.
The coefficient of determination is alsocalled R-squared and is denoted as R2
Coefficient of Determination, R2
1R02ee
8/3/2019 Statistics-Linear Regression and Correlation Analysis
35/50
35
Coefficient of Determination, R2
(continued)
Note: In the single independent variable case, the coefficientof determination is
where:R2 = Coefficient of determinationr = Simple correlation coefficient
22
rR !
8/3/2019 Statistics-Linear Regression and Correlation Analysis
36/50
36
Examples of ApproximateR2 Values
y
x
y
x
y
x
y
x
8/3/2019 Statistics-Linear Regression and Correlation Analysis
37/50
37
Examples of ApproximateR2 Values
R2 = 0
No linear relationshipbetween x and y:
The value of Y does not
depend on x. (None of thevariation in y is explainedby variation in x)
y
xR2 = 0
8/3/2019 Statistics-Linear Regression and Correlation Analysis
38/50
38
SPSS OutputModel Summary
. a . . .1
Adjusted
Std. Error of
re c ors: ons an , quare ee.
ANOVAb
. . . . a
. .
.
egress on
es ua
ota
1
Sum of
.
re c ors: ons an , quare eea.
e en en ar a e: ouse r ce.
Coefficientsa
. . . .
. . . . .
onstant
quare eet
1.
Unstandardized Standardized
.
epen en ar a e: ouse r ce.
8/3/2019 Statistics-Linear Regression and Correlation Analysis
39/50
39
Standard Error of Estimate
The standard deviation of the variation ofobservations around the regression line iscalled the standard error of estimate
The standard error of the regressionslope coefficient (b1) is given by sb1
Is
8/3/2019 Statistics-Linear Regression and Correlation Analysis
40/50
40
Model Summary
. a . . .1
Adjusted
Std. Error of
re c ors: ons an , quare eea.
Coefficientsa
. . . .
. . . . .
(Constant)
quare ee
1
.
Unstandardized Standardized
.
epen en ar a e: ouse r ce.
SPSS Output
41.33032s !
0.03297s1b
!
8/3/2019 Statistics-Linear Regression and Correlation Analysis
41/50
41
Comparing Standard Errors
y
y y
x
x
x
y
x
1bssmall
1bslarge
Issmall
Islarge
Variation of observed y valuesfrom the regression line
Variation in the slope of regressionlines from different possible samples
8/3/2019 Statistics-Linear Regression and Correlation Analysis
42/50
42
Inference about the Slope: t-Test t-test for a population slope
Is there a linear relationship between X and Y? Null and alternate hypotheses
H0: 1 = 0 (no linear relationship)
H1: 1 { 0 (linear relationship does exist) Test statistic:
Degree of Freedom:
1b
11
s
bt
!
2nd.f. !
where:
b1 = Sample regression slopecoefficient
1 = Hypothesized slopesb1 = Estimator of the standard
error of the slope
8/3/2019 Statistics-Linear Regression and Correlation Analysis
43/50
43
House Pricein $1000s
(y)
Square Feet(x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
(sq.ft.)0.109898.25pricehouse !
Estimated Regression Equation:
The slope of this model is 0.1098
Does square footage of the houseaffect its sales price?
Example: Inference about the Slope:t Test
(continued)
8/3/2019 Statistics-Linear Regression and Correlation Analysis
44/50
44
Inferences about the Slope:t Test Example - Continue
H0: 1 = 0
HA: 1 { 0
Test Statistic: t = 3.329
There is sufficient evidencethat square footage affects
house price
From Excel output:
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
1bs t*
b1
Decision: Reject H0
Conclusion:
Reject H0Reject H0
E/2=.025
-t/2Do not reject H0
0t(1-/2)
E/2=.025
-2.3060 2.3060 3.329
d.f. = 10-2 = 8
8/3/2019 Statistics-Linear Regression and Correlation Analysis
45/50
45
Confidence Interval for the Slope
Confidence Interval Estimate of the Slope:
Excel Printout for House Prices:
At 95% level of confidence, the confidence interval forthe slope is (0.0337, 0.1858)
1b/211sb Es t
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
d.f. = n - 2
8/3/2019 Statistics-Linear Regression and Correlation Analysis
46/50
46
Confidence Interval for the Slope
Since the units of the house price variable is$1000s, we are 95% confident that the averageimpact on sales price is between $33.70 and$185.80 per square foot of house size
Coefficient
s
Standard
Error t Stat P-value Lower 95%
Upper
95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
This 95% confidence interval does not include 0.
Conclusion: There is a significant relationship betweenhouse price and square feet at the .05 level of significance
8/3/2019 Statistics-Linear Regression and Correlation Analysis
47/50
47
Residual Analysis
PurposesExamine for linearity assumption
Examine for constant variance for alllevels of x
Evaluate normal distribution assumption
Graphical Analysis of Residuals
Can plot residuals vs. xCan create histogram of residuals tocheck for normality
8/3/2019 Statistics-Linear Regression and Correlation Analysis
48/50
48
Residual Analysis for Linearity
Not Linear Linear
x
residua
ls
x
y
x
y
x
residua
ls
8/3/2019 Statistics-Linear Regression and Correlation Analysis
49/50
49
Residual Analysis for Constant Variance
Non-constant variance Constant variance
x x
y
x x
y
resid
uals
resid
uals
8/3/2019 Statistics-Linear Regression and Correlation Analysis
50/50
50
House Price Model Residual Plot
-60
-40
-20
0
20
40
60
80
0 1000 2000 3000
Square Feet
Residuals
Example: Residual Output
RESIDUAL OUTPUT
Predicted
House
Price Residuals
1 251.92316 -6.923162
2 273.87671 38.12329
3 284.85348 -5.853484
4 304.06284 3.937162
5 218.99284 -19.99284
6 268.38832 -49.38832
7 356.20251 48.79749
8 367.17929 -43.17929
9 254.6674 64.33264
10 284.85348 -29.85348