Download pdf - Presentation Stats Updated

8/3/2019 Presentation Stats Updated

1/21

REGRESSION

MODELSBy:

Ayush Sharma 09Mickey Haldia 19

Prerna Makhijani 29

Sanoj George 39

Sushant Jaggi 49

Nitish Dorle 59


2/21

Example

Year Population on Farm (in

millions)

1935 32.1

1940 30.5

1945 24.4

1950 23.0

1955 19.11960 15.6

1965 12.5


3/21

Scatter Plot

0

5

10

15

20

25

30

35

1930 1940 1950 1960 1970

Population(in millions)

Poplation(in millions)


4/21

Correlation Coefficient (r)

It is a measure of strength of the linear

relationship between two variables and iscalculated using the following formula:


5/21

Interpretation

After calculating we find r = -0.993

There is a strong negative correlation.


6/21

Coefficient of Determination

Squaring the correlation coefficient (r) gives us

the percent variation in the y-variable that is

described by the variation in the x-variable

To relate x and y, the Regression Equation is

calculated using Least Squares technique.

Regression Equation: Y = a +bX Slope of the regression line:


7/21

To continue with the example

We found r = -0.993. By squaring we get the

Coefficient of Determination (R^2) = 0.987

y = -0.671 x + 1,330.350R = 0.987

10

15

20

25

30

35

1930 1940 1950 1960 1970

Populatio

nonFarm(

in

mi

llions)

Year

Regression


8/21

Interpretation

We conclude that 98.7% of the decrease in

farm population can be explained by timelineprogression.

Theoretically, population is a dependent

variable (y-axis) and timeline is an independent

variable (x-axis).


9/21

Assumptions of the Regression Model

The following assumptions are made about the

errors:

a) The errors are independentb) The errors are normally distributed

c) The errors have a mean of zero

d)

The errors have a constant variance(regardlessof the value of X)


10/21

Patterns of Indicating Errors

Error

X


11/21

Estimating the Variance

The error variance is measured by the MSE

s2 = MSE= SSE

n-k-1

where n = number of observations in the sample

k = number of independent variables

Therefore the standard deviation will be

s = sqrt (MSE)


12/21

Testing the Model for Significance

MSE and co-efficient of determination (r2) does notprovide a good measure of accuracy when thesample size is small

In this case, it is necessary to test the model forsignificance

Linear Model is given by,

Y=0 + 1X +

Null Hypothesis :If 1 = 0, then there is no linear relationshipbetween X and Y

Alternate Hypothesis : If 1 0, then there is a linear relationship


13/21

Steps in Hypothesis Test for a Significant

Regression Model

1. Specify null and alternative hypothesis.

2. Select the level of significance (). Common

values are between 0.01 and 0.053. Calculate the value of the test statistic using the

formula:

F = MSR/MSE

4. Make a decision using one of the followingmethods:

a) Reject if Fcalculated > Ftableb) Reject if p-value <


14/21

Multiple regression Analysis

More than one independent variable

Y=0+1X1+2X2++kXk+

Where,

Y=dependent variable(response variable)

Xi=ith independent variable(predictor variable or explanatory

variable)

0= intercept(value of Y when all Xi = 0)i= coefficient of the ith independent variable

k= number of independent variables

= random error

To estimate the values of these coefficients, a sample is taken and the

following equation is developed :

= b0+b1X1+b2X2+.+bkXkwhere,

= predicted value of Y

b0= sample intercept (and is an estimate of

0)

bi= sample coefficient of ith variable(and is an

estimate of i)


15/21

Selling Price ($) Suare Footage AGE Condition

95000 1926 30 GOOD

119000 2069 40 Excellent

124800 1720 30 Excellent

135000 1396 15 GOOD

142800 1706 32 Mint

145000 1847 38 Mint159000 1950 27 Mint

165000 2323 30 Excellent

182000 2285 26 Mint

183000 3752 35 GOOD

200000 2300 18 GOOD

211000 2525 17 GOOD

215000 3800 40 Excellent

219000 1740 12 Mint

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.819680305

R Square 0.671875802

Adjusted R Square 0.612216857

Standard Error 24312.60729

Observations 14

ANOVA

df SS MS F Significance F

Regression 2 13313936968 6.7E+09 11.262 0.002178765

Residual 11 6502131603 5.9E+08

Total 13 19816068571

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 146630.89 25482.08287 5.75427 0.0001 90545.20735 202717 90545 202717

SF 43.819366 10.28096507 4.26218 0.0013 21.19111495 66.448 21.191 66.448

AGE -2898.686 796.5649421 -3.639 0.0039 -4651.91386 -1145 -4651.9 -1145.5

The p-values are

used to test the

individual

variables forsignificance

The coefficient of

determination r2

The regression

coefficients

Jenny Wilson Reality


16/21

Binary or Dummy Variables

Indicator Variable

Assigned a value of 1 if a particular condition ismet, 0 otherwise

The number of dummy variables must equal oneless than the number of categories of aqualitative variable

The Jenny Wilson realty example :

X3= 1 for excellent condition= 0 otherwise

X4= 1 for mint condition

= 0 otherwise


17/21

Selling Price

($)Suare Footage AGE X3(Exc.) X4(Mint) Condition

95000 1926 30 0 0 GOOD

119000 2069 40 1 0 Excellent

124800 1720 30 1 0 Excellent

135000 1396 15 0 0 GOOD

142800 1706 32 0 1 Mint

145000 1847 38 0 1 Mint

159000 1950 27 0 1 Mint

165000 2323 30 1 0 Excellent

182000 2285 26 0 1 Mint

183000 3752 35 0 0 GOOD

200000 2300 18 0 0 GOOD

211000 2525 17 0 0 GOOD

215000 3800 40 1 0 Excellent

219000 1740 12 0 1 Mint

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.94762

R Square 0.89798

Adjusted R Square 0.85264

Standard Error 14987.6

Observations 14

ANOVA

df SS MS F Significance F

Regression 4 17794427451 4E+09 19.8044 0.000174421

Residual 9 2021641120 2E+08

Total 13 19816068571

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 121658 17426.61432 6.9812 6.5E-05 82236.71393 161080 82236.71 161080

SF 56.4276 6.947516792 8.122 2E-05 40.71122594 72.144 40.71123 72.144

AGE -3962.82 596.0278736 -6.6487 9.4E-05 -5311.12866 -2614.5 -5311.129 -2614.5

X3(Exc.) 33162.6 12179.62073 2.7228 0.0235 5610.432651 60714.9 5610.433 60715

X4(Mint) 47369.2 10649.26942 4.4481 0.0016 23278.92699 71459.6 23278.93 71460

The coefficients of age is negative, indicating

that the price decreases as a house gets older

Jenny Wilson Reality


18/21

Model Building

The value of r2 can never decrease when morevariables are added to the model

Adjusted r2 often used to determine if an additionalindependent variable is beneficial

The adjusted r

2

is

A variable should not be added to the model if itcauses the adjusted r2 to decrease


19/21

Multiple Regression

Sales/Decision to buy = B0+ B1* Price

Sales/Decision to buy = B0+ B1* (Price)3+

B2*(Design)2+B3*(Performance)

L = (Price)3

M = (Design)2

N = (Performance)

Sales/Decision to buy = B0+ B1* L+ B2* M+ B3* N


20/21

Pitfalls In Regression

A High Correlation does not mean one variable is causing a

change in another (Some regressions have shown a

significantly positive relation between individuals' college

GPA and future salary. )

Values of the dependent variable should not be used that

are above or below the ones from the sample

The number of independent variables that should be used

in the model is limited by the number of observations.


21/21