Regression for class teaching

Preview:

Citation preview

Regression

Understanding

Example

Diagnostics

Regression With SPSS

Transformation

Regression

What is

Regression

A Statistical Technique that is used to relate two or

more variables.

Use the independent variable(s) to predict the value of

dependent variable. Objective

Example

For a given value of advertisement expenditure, how

much sales will be generated.

With a given diet plan, how much weight an individual

will be able to reduce.

With a unit increase in green house gases, how much

will be the rise in the temperature?

Regression Understanding

A layman

Question

Suppose we want to find out how much the age of the

car helps you to determine the price of the car

The older the car ______ will be the priceA layman Answer

Regression in

Simple Words

As the age of the car increases by one year the price of

the car is estimated to decrease by a certain amount.

Y(Estimated) = b0 + b1 X Regression in

Statistical Terms

Regression Understanding

Data Set: Age &

Price of the Cars

A Negative RelationshipWhat Relation Do

you see?

Age 1 2 1 2 3 4 3 4 3

Price 90 85 93 84 80 74 81 76 79

A Convenient

Way to Look

(What is this tool

Called?)

Price

Age

70

80

90

1 2 3 4

Price

Age

70

80

90

1 2 3 4

How

to S

how

it

Sta

tistic

ally

Y (E) = b0 + b1 X

Y (E) = 97 – 5 X

Y = 97 – 5 X +E

Term

Y (E)

X

b0

b1

What it is!

Dependent Variable whose behavior is to be determined

Independent Variable whose effect to be determined

Intercept: Value of Y(E) when X = 0

Estimated Change in Y in response to unit Change in X

E Difference between the actual and estimated

Assessing the Goodness of Fit: Graphical Way

Goodness of

Fit Means

How well the model fits the actual data. Less residual

means a good fit, more residual means bad Fit

Bad Fit Good Fit Perfect Fit

Assessing the Goodness of Fit: Statistical Way

Expected Y

Estimated YActual Y

SSR

SSR =Σ (Estimated – Expected)2

SST

SST =Σ (Real – Expected)2

SSE

SSE =Σ (Actual – Estimated)2

Assessing the Goodness of Fit: Statistical Way R2

SST =Σ (Real – Expected)2

SSR =Σ (Estimated – Expected)2

SSE =Σ (Actual – Expected)2

A good Model is the one in

which SSE is the lowest

SSE = 0

SST = SSR + SSE R2 = SSR/SST R2 = 1 - SSE/SST

Residual Analysis

Why

The purpose of Modeling is to predict

(interpolate), the interpolation can be

correct when the assumptions about the

behavior of the data hold true.

Assumptions:

Response

Variable

is independentIs Normally

Distributed

Has constant

Variance

Has straight line

Relation with IV

Residual Analysis

In Terms of Response

VariableIn Terms of Residual

Independence

Normality

Constant

Variance

Linearity

Response Variable Random Errors

is independent

Is Normally

Distributed

Has constant Variance

Has straight line

Relation with IV

are independent

are Normally

Distributed

Have constant

Variance

Have straight line

relation with IV

Inferring About the Population

Assumptions

Expected Value

of Residual

Variance of

Residual

Distribution of

Residual

Dependency of

Residuals

E(ei ) = 0

σe1= σe2= …. = σei

Normal

Independent

What it means

No apparent pattern in residual plot

Residual Plot has consistent Spread

Histogram is symmetric or normal

(Histogram & Probability Plot of Residual)

Relationship

b/w IndV & DVLinear Linear Scatter Plot

How to Check it

The Three Conditions Shown Together

As the distribution is symmetric, the

mean distribution of error term will

be zero

The distribution of error term is

shown to be normally distributed

Variance of error term for different

values of x appear to be same

Residual Analysis

Types of Residuals

Normal or Raw

Residual: RESID

Standardized

Residual: ZRESID

Studentized

Residual: SRESID

Y – Y(Estimated)

{Y – Y(Estimated)}/Standard Error of Residual

{Y – Y(Estimated)}/ Varying Standard Error of Residual

Influential Observation

Outliers Observations with large error

Leverage

Points

Distinct from other values on the basis of

independent values

Influential

Observation

Value the inclusion of which can affect the

coefficient of regression line

Any Value can be Influential Observation

Outliers With Residuals

Standardized Residuals Un standardized Residuals

Can not tell how big

residual will be considered

big.

Using the Properties of

ND helps us in making a

rule for deciding large or

small

Rule of 3.28

Rule of 2.58

Rule of 1.96

SR > 3.28

1% or More % SR > 2.58

Model is Unacceptable When

5% or More % SR > 1.96

Identifying Influential Cases

I Will Look at

the World

Without You

Regression is done with a particular

data set removed and that particular

value is predicted

How it Looks

This adjusted Predicted value is similar

to the Predicted Value then the value is

not an influential observation

Identifying Influential Cases

Adjusted

Predicted Value

The predicted value of a case without

including that case for Predicting it

DFFit Original Predicted – Adjusted Predicted

Deleted

Residual

Studentized

Deleted Residual

Original Observed– Adjusted Predicted

Deleted Residual / Standard Deviation

Influential Cases

Coefficient with (xa, ya) included

&

Coefficient with (xa, ya) not included

Large Change in

Coefficient

Not Large Change

in Coefficient

Influential

Observation

Not an Influential

Observation

Influential Cases

(Adjusted Predicted Value)

Predicted Value

DFFit =Difference= PV - APV

Influential

Observation

Small Difference

Adjusted

Predicted Value

Large Difference

Not an Influential

Observation

Influential Cases

(Adjusted Predicted Value)

Original Value

(OV)

Deleted Residual (DR)= OV - APV

SDR Can be compared for different

Regression Models

Adjusted Predicted

Value

(APV)

Studentized Deleted Residual=DR/SE

Identifying Influential Cases

Cook’s

Distance

What is it?

Leverage

Is the measure of overall

influence of the case on

the model

Mah

ala

nob

is

Dis

tan

ceObservation is

influential if

CD > 1

Influence of observed on

predicted

Average Leverage(AL) =

(K+1)/2

AL > 2(k+1)/2

Or

AL > 3(k+1)/2

Distance of Cases from

mean of Predictor

variables

Use Barnett & Lewis

Table

Identifying Influential Cases

DfBeta/Standard

Error

DfBeta

Standardized

DfBeta

Covariance

Ratio = CVR

What is it?Observation is

influential if

>1

>2

Delete case if

CVR < 1-3(k+1)/n

Don’t Delete case if

CVR > 1+3(k+1)/n

K = Number of Predictors

Difference Between

Parameter with &

without Case

It measures whether

the case affects the

variance of Regression

Parameter

Scale Sensitive

therefore does not

provide Good CV

Heteroscedasticity

What is it?Changing Variance at different level

of predictor

+ + ++

+ ++

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

The spread increases with y

Residual

y

Measure

Multicolinearity

What is it?Strong correlation between the

predictor variables

Eff

ect

s

Untrustworthy

bs

Restricted R2

Difficulty in

Picking the

Right Variable

Inflated

Standard

Error

Not

Significant

bs

Varying bs

from Sample

to Sample

The inclusion of new varaible which is

strongly correlated with the first one, R2

will not increase

The inclusion of new varaible which is

strongly correlated with the first one, R2

will not increase

Multicolinearity:

Measure

What is it? VIF = 1/(1-R2)

Interpretation

The lower the

value the better:

VIF < 10

VIF < 10

VIF: Variance

Inflating FactorDurbin Watson

Range of Value is

between 0 & 4

0 = Negative correlation

4= Positive Correlation

2 = No Correlation

Desired value is 2 or

near

Measures of Multicolinearity:

Variance

Inflation Factor

Tolerance

Eigen Value

Variance

Proportion

The Lower the

Better

Higher the Better

The Lower the

Better

Higher the Better

MeasureDesired

Behavior

VIF > 10

T < 0.1

The Lower the

Better

Each Dimension be

related with

separate Variable

Critical Value

Checking Assumptions Through Plots

P-P Plot: Standardized Residual

Normality

Scatter Plot: Standardized Residual /

Standardized Predicted ValueHeteroscedasticity

&

OutliersScatter Plot: Residual / Stadnardized

Predicted Value

Q-Q Plot: Standardized Residual

Transformation of a Variable

ReasonNonliear is translated into linear

Methods of explanation for linear relation are known

How

Justified

Theoretically

Diagnostic Plots

Transform x Y Both

Transformation of a Variable

Function

Reciprocal

Y =α+ β/x

ExponentialY =αebx

PowerY =αxb

Log

Y =α+ β log

x

Transform

Y’ =ln(Y) Y’ =lnα+ β x

Linear Form

Y’=log(Y),

X’=log(X) Y’ =logα+ β x’

X’ =log(X) Y’ =α+ β x’

X’ =1/x Y’ =α+ β x’

Regression through SPSS

Coefficients

Model Fit

Assumption

b0 & b1

SST =SSR + SSE

t

F=MSR/MSE

e is independent

e is Normally Distributed

e has constant Variance

e has straight line Relation with IV

Multicolinearity

Data Set

Variables

Study Time

Interest

Marks

Standardized Predicted

Standardized Residual

Deleted Residual

Adjusted Predicted

Studentized Residual

Studentized Deleted

Residual

Ass

um

pti

on e is independent

e is Normally Distributed

e has constant Variance

e has straight line Relation with IV

Multicolinearity

Norm

ali

ty Normal Probability Plot of the Standardized Residual

Histogram of the Standardized Residual

SK and Shapiro Test

Norm

ali

ty Normal Probability Plot of the Standardized Residual

Histogram of the Standardized Residual

SK and Shapiro Test

Getting the Residual & Standardized Residual

Norm

ali

ty Normal Probability Plot of the Standardized Residual

Histogram of the Standardized Residual

SK and Shapiro Test

Norm

ali

ty Normal Probability Plot of the Standardized Residual

Histogram of the Standardized Residual

SK and Shapiro Test

Norm

ali

ty Normal Probability Plot of the Standardized Residual

Histogram of the Standardized Residual

SK and Shapiro Test

Ass

um

pti

on e is independent

e is Normally Distributed

e has constant Variance

e has straight line Relation with IV

Multicolinearity

Z Predicted

Z Residual

-3 -2 -1 0 321

-3

-2

-1

0

1

2

3

Ass

um

pti

on e is independent

e is Normally Distributed

e has constant Variance

e has straight line Relation with IV

Multicolinearity

Recommended