8/3/2019 Presentation Stats Updated
1/21
REGRESSION
MODELSBy:
Ayush Sharma 09Mickey Haldia 19
Prerna Makhijani 29
Sanoj George 39
Sushant Jaggi 49
Nitish Dorle 59
8/3/2019 Presentation Stats Updated
2/21
Example
Year Population on Farm (in
millions)
1935 32.1
1940 30.5
1945 24.4
1950 23.0
1955 19.11960 15.6
1965 12.5
8/3/2019 Presentation Stats Updated
3/21
Scatter Plot
0
5
10
15
20
25
30
35
1930 1940 1950 1960 1970
Population(in millions)
Poplation(in millions)
8/3/2019 Presentation Stats Updated
4/21
Correlation Coefficient (r)
It is a measure of strength of the linear
relationship between two variables and iscalculated using the following formula:
8/3/2019 Presentation Stats Updated
5/21
Interpretation
After calculating we find r = -0.993
There is a strong negative correlation.
8/3/2019 Presentation Stats Updated
6/21
Coefficient of Determination
Squaring the correlation coefficient (r) gives us
the percent variation in the y-variable that is
described by the variation in the x-variable
To relate x and y, the Regression Equation is
calculated using Least Squares technique.
Regression Equation: Y = a +bX Slope of the regression line:
8/3/2019 Presentation Stats Updated
7/21
To continue with the example
We found r = -0.993. By squaring we get the
Coefficient of Determination (R^2) = 0.987
y = -0.671 x + 1,330.350R = 0.987
10
15
20
25
30
35
1930 1940 1950 1960 1970
Populatio
nonFarm(
in
mi
llions)
Year
Regression
8/3/2019 Presentation Stats Updated
8/21
Interpretation
We conclude that 98.7% of the decrease in
farm population can be explained by timelineprogression.
Theoretically, population is a dependent
variable (y-axis) and timeline is an independent
variable (x-axis).
8/3/2019 Presentation Stats Updated
9/21
Assumptions of the Regression Model
The following assumptions are made about the
errors:
a) The errors are independentb) The errors are normally distributed
c) The errors have a mean of zero
d)
The errors have a constant variance(regardlessof the value of X)
8/3/2019 Presentation Stats Updated
10/21
Patterns of Indicating Errors
Error
X
8/3/2019 Presentation Stats Updated
11/21
Estimating the Variance
The error variance is measured by the MSE
s2 = MSE= SSE
n-k-1
where n = number of observations in the sample
k = number of independent variables
Therefore the standard deviation will be
s = sqrt (MSE)
8/3/2019 Presentation Stats Updated
12/21
Testing the Model for Significance
MSE and co-efficient of determination (r2) does notprovide a good measure of accuracy when thesample size is small
In this case, it is necessary to test the model forsignificance
Linear Model is given by,
Y=0 + 1X +
Null Hypothesis :If 1 = 0, then there is no linear relationshipbetween X and Y
Alternate Hypothesis : If 1 0, then there is a linear relationship
8/3/2019 Presentation Stats Updated
13/21
Steps in Hypothesis Test for a Significant
Regression Model
1. Specify null and alternative hypothesis.
2. Select the level of significance (). Common
values are between 0.01 and 0.053. Calculate the value of the test statistic using the
formula:
F = MSR/MSE
4. Make a decision using one of the followingmethods:
a) Reject if Fcalculated > Ftableb) Reject if p-value <
8/3/2019 Presentation Stats Updated
14/21
Multiple regression Analysis
More than one independent variable
Y=0+1X1+2X2++kXk+
Where,
Y=dependent variable(response variable)
Xi=ith independent variable(predictor variable or explanatory
variable)
0= intercept(value of Y when all Xi = 0)i= coefficient of the ith independent variable
k= number of independent variables
= random error
To estimate the values of these coefficients, a sample is taken and the
following equation is developed :
= b0+b1X1+b2X2+.+bkXkwhere,
= predicted value of Y
b0= sample intercept (and is an estimate of
0)
bi= sample coefficient of ith variable(and is an
estimate of i)
8/3/2019 Presentation Stats Updated
15/21
Selling Price ($) Suare Footage AGE Condition
95000 1926 30 GOOD
119000 2069 40 Excellent
124800 1720 30 Excellent
135000 1396 15 GOOD
142800 1706 32 Mint
145000 1847 38 Mint159000 1950 27 Mint
165000 2323 30 Excellent
182000 2285 26 Mint
183000 3752 35 GOOD
200000 2300 18 GOOD
211000 2525 17 GOOD
215000 3800 40 Excellent
219000 1740 12 Mint
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.819680305
R Square 0.671875802
Adjusted R Square 0.612216857
Standard Error 24312.60729
Observations 14
ANOVA
df SS MS F Significance F
Regression 2 13313936968 6.7E+09 11.262 0.002178765
Residual 11 6502131603 5.9E+08
Total 13 19816068571
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 146630.89 25482.08287 5.75427 0.0001 90545.20735 202717 90545 202717
SF 43.819366 10.28096507 4.26218 0.0013 21.19111495 66.448 21.191 66.448
AGE -2898.686 796.5649421 -3.639 0.0039 -4651.91386 -1145 -4651.9 -1145.5
The p-values are
used to test the
individual
variables forsignificance
The coefficient of
determination r2
The regression
coefficients
Jenny Wilson Reality
8/3/2019 Presentation Stats Updated
16/21
Binary or Dummy Variables
Indicator Variable
Assigned a value of 1 if a particular condition ismet, 0 otherwise
The number of dummy variables must equal oneless than the number of categories of aqualitative variable
The Jenny Wilson realty example :
X3= 1 for excellent condition= 0 otherwise
X4= 1 for mint condition
= 0 otherwise
8/3/2019 Presentation Stats Updated
17/21
Selling Price
($)Suare Footage AGE X3(Exc.) X4(Mint) Condition
95000 1926 30 0 0 GOOD
119000 2069 40 1 0 Excellent
124800 1720 30 1 0 Excellent
135000 1396 15 0 0 GOOD
142800 1706 32 0 1 Mint
145000 1847 38 0 1 Mint
159000 1950 27 0 1 Mint
165000 2323 30 1 0 Excellent
182000 2285 26 0 1 Mint
183000 3752 35 0 0 GOOD
200000 2300 18 0 0 GOOD
211000 2525 17 0 0 GOOD
215000 3800 40 1 0 Excellent
219000 1740 12 0 1 Mint
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.94762
R Square 0.89798
Adjusted R Square 0.85264
Standard Error 14987.6
Observations 14
ANOVA
df SS MS F Significance F
Regression 4 17794427451 4E+09 19.8044 0.000174421
Residual 9 2021641120 2E+08
Total 13 19816068571
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 121658 17426.61432 6.9812 6.5E-05 82236.71393 161080 82236.71 161080
SF 56.4276 6.947516792 8.122 2E-05 40.71122594 72.144 40.71123 72.144
AGE -3962.82 596.0278736 -6.6487 9.4E-05 -5311.12866 -2614.5 -5311.129 -2614.5
X3(Exc.) 33162.6 12179.62073 2.7228 0.0235 5610.432651 60714.9 5610.433 60715
X4(Mint) 47369.2 10649.26942 4.4481 0.0016 23278.92699 71459.6 23278.93 71460
The coefficients of age is negative, indicating
that the price decreases as a house gets older
Jenny Wilson Reality
8/3/2019 Presentation Stats Updated
18/21
Model Building
The value of r2 can never decrease when morevariables are added to the model
Adjusted r2 often used to determine if an additionalindependent variable is beneficial
The adjusted r
2
is
A variable should not be added to the model if itcauses the adjusted r2 to decrease
8/3/2019 Presentation Stats Updated
19/21
Multiple Regression
Sales/Decision to buy = B0+ B1* Price
Sales/Decision to buy = B0+ B1* (Price)3+
B2*(Design)2+B3*(Performance)
L = (Price)3
M = (Design)2
N = (Performance)
Sales/Decision to buy = B0+ B1* L+ B2* M+ B3* N
8/3/2019 Presentation Stats Updated
20/21
Pitfalls In Regression
A High Correlation does not mean one variable is causing a
change in another (Some regressions have shown a
significantly positive relation between individuals' college
GPA and future salary. )
Values of the dependent variable should not be used that
are above or below the ones from the sample
The number of independent variables that should be used
in the model is limited by the number of observations.
8/3/2019 Presentation Stats Updated
21/21