7_Regression.pptx

Embed Size (px)

Citation preview

  • 8/14/2019 7_Regression.pptx

    1/96

    Linear Regression Analysis

    Correlation

    Simple Linear Regression

    The Multiple Linear Regression Model

    Least Squares Estimates

    R2and AdjustedR2

    Overall Validity of the Model (Ftest) Testing for individual regressor (ttest)

    Problem of Multicollinearity

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    2/96

    Smoking and Lung Capacity Suppose, for example, we want to investigate the

    relationship between cigarette smoking and lungcapacity

    We might ask a group of people about their smoking

    habits, and measure their lung capacities

    Cigarettes (X) Lung Capacity (Y)

    0 45

    5 4210 33

    15 31

    20 29

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    3/96

    Scatter plot of the data

    We can see that as smoking goes up, lung

    capacity tends to go down. The two variables change the values in opposite

    directions.

    0

    20

    40

    60

    0 10 20 30

    Lung Capacity

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    4/96

    Height and Weight Consider the following data of heights and weights of 5

    women swimmers:Height (inch): 62 64 65 66 68

    Weight (pounds): 102 108 115 128 132

    We can observe that weight is also increasing withheight.

    0

    50

    100

    150

    60 65 70

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    5/96

    Sometimes two variables are related to eachother.

    The values of both of the variables are paired.

    Change in the value of one affects the value ofother.

    Usually these two variables are two attributes of

    each member of the population For Example:

    Height Weight

    Advertising Expenditure Sales Volume

    Unemployment Crime Rate

    Rainfall Food Production

    Expenditure SavingsGaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    6/96

    We have already studied one measure of relationship

    between two variablesCovariance

    Covariance between two random variables, Xand Yisgiven by

    For paired observations on variables Xand Y,

    )()()(),( YEXEXYEYXCov XY

    n

    iiiXY yyxxnYXCov 1 ))((

    1

    ),(

    y

    x

    xx

    yy

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    7/96

    Properties of Covariance:

    Cov(X+a, Y+b) = Cov(X, Y) [not affected by change in location]

    Cov(aX, bY) = ab Cov(X, Y) [affected by change in scale]

    Covariance can take any value from - to +.

    Cov(X,Y) > 0 means Xand Ychange in the same direction

    Cov(X,Y) < 0 means Xand Ychange in the opposite direction

    IfXand Yare independent, Cov(X,Y) = 0 [other way may not be true] It is not unit free.

    So it is not a good measure of relationship between two

    variables.

    A better measure is correlation coefficient. It is unit free and takes values in [-1,+1].

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    8/96

    Correlation Karl Pearsons Correlation coefficient is given by

    When the joint distribution of Xand Yis known

    When observations on Xand Yare available

    )()(

    ),(),(

    YVarXVar

    YXCovYXCorrrXY

    2222 )]([)()(,)]([)()(

    )()()(),(

    YEYEYVarXEXEXVar

    YEXEXYEYXCov

    n

    i

    i

    n

    i

    i

    n

    i

    ii

    yyn

    YVarxxn

    XVar

    yyxxn

    YXCov

    1

    2

    1

    2

    1

    )(1

    )(,)(1

    )(

    ))((1),(

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    9/96

    Properties of Correlation Coefficient

    Corr(aX+b, cY+d) = Corr(X, Y), It is unit free.

    It measures the strength of relationship on ascale of -1to +1.

    So, it can be used to compare the relationships ofvarious pairs of variables.

    Values close to 0indicate little or no correlation

    Values close to +1indicate very strong positivecorrelation.

    Values close to -1indicate very strong negativecorrelation.

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    10/96

    Scatter Diagram

    Positively Correlated Negatively Correlated

    Weakly Correlated Strongly Correlated Not Correlated

    X

    Y

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    11/96

    Correlation Coefficient measures the strength of

    linearrelationship.

    r = 0 does not necessarily imply that there is nocorrelation.

    It may be there, but is not a linearone.

    x

    y

    x

    y

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    12/96

    x

    1.25

    1.75

    2.25

    2.00

    2.50

    2.25

    2.70

    2.50

    17.50

    y

    125

    105

    65

    85

    75

    80

    50

    55

    640

    -0.9

    -0.4

    0.1

    -0.15

    0.35

    0.1

    0.55

    0.35

    0

    45

    25

    -15

    5

    -5

    0

    -30

    -25

    0

    0.8100

    0.1600

    0.0100

    0.0225

    0.1225

    0.0100

    0.3025

    0.1225

    1.560

    SSX

    2025

    625

    225

    25

    25

    0

    900

    625

    4450

    SSY

    -40.50

    -10.00

    -1.50

    -0.75

    -1.75

    0

    -16.50

    -8.75

    -79.75

    SSXY

    957.0

    445056.1

    75.79

    )()(

    ),(

    SSYSSX

    SSXY

    YVarXVar

    YXCovr

    xx yy 2)( xx 2)( yy ))(( yyxx

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    13/96

    x

    1.25

    1.75

    2.25

    2.00

    2.50

    2.25

    2.70

    2.50

    17.20

    y

    125

    105

    65

    85

    75

    80

    50

    55

    640

    x2

    1.5625

    3.0625

    5.0625

    4.0000

    6.2500

    5.0625

    7.2500

    6.2500

    38.54

    y2

    15625

    11025

    4225

    7225

    5625

    6400

    2500

    3025

    55650

    x.y

    156.25

    183.75

    146.25

    170.00

    187.50

    180.00

    135.00

    137.50

    1296.25

    ,2

    2

    nx

    xSSX ,2

    2

    ny

    ySSY nyx

    xySSXY

    SSX = 1.56

    SSY = 4450

    SSXY= -79.75

    Alternative Formulas for Sum of Squares

    957.0

    445056.1

    75.79

    )()(

    ),(

    SSYSSX

    SSXY

    YVarXVar

    YXCovr

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    14/96

    Smoking and Lung Capacity Example

    Cigarettes

    (X) XY

    Lung

    Capacity(Y)

    0 0 0 2025 45

    5 25 210 1764 42

    10 100 330 1089 33

    15 225 465 961 31

    20 400 580 841 29

    50 750 1585 6680 180

    2X 2Y

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    15/96

    2 2

    (5)(1585) (50)(180)

    (5)(750) 50 (5)(6680) 180

    7925 9000(3750 2500)(33400 32400)

    1075

    .96151250 (1000)

    xyr

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    16/96

    Regression Analysis Having determined the correlation between X and Y, we

    wish to determine a mathematical relationship betweenthem.

    Dependent variable: the variable you wish to explain

    Independent variables: the variables used to explain the

    dependent variable

    Regression analysis is used to:

    Predict the value of dependent variable based on the

    value of independent variable(s) Explain the impact of changes in an independent

    variable on the dependent variable

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    17/96

    Types of Relationships

    Y

    X

    Y

    X

    Y

    Y

    X

    X

    Linear relationships Curvilinear relationships

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    18/96

    Types of Relationships

    Y

    X

    Y

    X

    Y

    Y

    X

    X

    Strong relationships Weak relationships

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    19/96

    Types of Relationships

    Y

    X

    Y

    X

    No relationship

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    20/96

    Simple Linear Regression Analysis

    The simplest mathematical relationship is

    Y = a + bX + error (linear) Changes in Y are related to the changes inX

    What are the most suitable values of

    a (intercept) and b (slope)?

    X

    Y

    X

    Y

    y = a + b.x}a

    1

    b

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    21/96

    X

    Y

    (xi, yi)

    yi

    xi

    Method of Least Squares

    ibxa

    bXa

    The best fitted line would be for which all the

    ERRORS are minimum.

    error

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    22/96

    We want to fit a line for which all the errors are

    minimum.

    We want to obtain such values of aand b inY = a + bX + errorfor which all the errors are

    minimum.

    To minimize all the errors together we minimizethe sum of squares of errors (SSE).

    n

    i

    ii bXaYSSE1

    2)(

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    23/96

    To get the values of aand bwhich minimize SSE, we

    proceed as follows:

    Eq (1) and (2) are called normal equations.

    Solve normal equations to get aandb

    )1(

    0)(20

    11

    1

    n

    i

    i

    n

    i

    i

    i

    n

    i

    i

    XbnaY

    bXaYa

    SSE

    )2(

    0)(20

    1

    2

    11

    1

    n

    i

    i

    n

    i

    ii

    n

    i

    i

    ii

    n

    i

    i

    XbXaXY

    XbXaYb

    SSE

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    24/96

    Solving above normal equations, we get

    n

    i

    i

    n

    i

    i

    n

    i

    ii

    n

    i i

    n

    i i

    XbXaXY

    XbnaY

    1

    2

    11

    11

    SSX

    SSXY

    XX

    XXYY

    XX

    XYXYn

    bn

    ii

    n

    i

    ii

    n

    ii

    n

    ii

    n

    i

    i

    n

    i

    i

    n

    i

    ii

    1

    2

    1

    2

    11

    2

    111

    XbYa Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    25/96

    The values of aand bobtained using least squares

    method are called as least squares estimates (LSE)

    of aand b. Thus, LSE of aand bare given by

    Also the correlation coefficient between Xand Yis

    .

    SSX

    SSXY

    b,XbYa

    SSY

    SSXb

    SSY

    SSX

    SSX

    SSXY

    SSYSSX

    SSXY

    YVarXVar

    YXCovrXY

    )()(

    ),(

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    26/96

    x

    1.25

    1.75

    2.25

    2.00

    2.50

    2.25

    2.70

    2.50

    17.50

    y

    125

    105

    65

    85

    75

    80

    50

    55

    640

    -0.9

    -0.4

    0.1

    -0.15

    0.35

    0.1

    0.55

    0.35

    0

    45

    25

    -15

    5

    -5

    0

    -30

    -25

    0

    0.8100

    0.1600

    0.0100

    0.0225

    0.1225

    0.0100

    0.3025

    0.1225

    1.560

    SSX

    2025

    625

    225

    25

    25

    0

    900

    625

    4450

    SSY

    -40.50

    -10.00

    -1.50

    -0.75

    -1.75

    0

    -16.50

    -8.75

    -79.75

    SSXY

    xx yy 2)( xx 2)( yy ))(( yyxx

    .80,15.2 YXGaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    27/96

    957.0SSYSSX

    SSXYr

    12.51 SSX

    SSXYb 91.189 XbYa

    0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75

    140

    120

    100

    80

    60

    40

    XY 12.5191.189

    isLineFitted

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    28/96

    189.91is the estimated mean value of Ywhen

    the value of Xis zero.

    -51.12 is the change in the average value of Yasa result of a one-unit change in X.

    We can predict the value of Yfor some given

    value of X.

    For example at X=2.15, predicted value of Yis

    189.9151.12 x 2.15= 80.002

    XY 12.5191.189 isLineFitted

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    29/96

    Residual is the unexplained part of Y

    The smaller the residuals, the better the utility ofRegression.

    Sum of Residuals is always zero. Least Squareprocedure ensures that.

    Residuals play an important role in investigatingthe adequacy of the fitted model.

    We obtain coefficient of determination (R2)using the residuals.

    R2is used to examine the adequacy of the fittedlinear model to the given data.

    iii YYe :Residuals

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    30/96

    Coefficient of Determination

    X

    Y

    Y

    YY YY

    YY

    n

    i

    i YYSST1

    2)(:SquaresofSumTotal

    n

    ii YYSSR 1

    2

    )

    (:SquaresofSumRegression

    n

    i

    ii YYSSE1

    2)(:SquaresofSumError

    Also, SST = SSR + SSEGaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    31/96

    The fraction of SSTexplainedby Regression is given by R2 R2= SSR/ SST = 1(SSE/ SST)

    Clearly, 0 R2 1

    When SSRis closed to SST, R2will be closed to 1.

    This means that regression explains most of the variability

    in Y. (Fit is good) When SSEis closed to SST, R2will be closed to 0.

    This means that regression does not explain much

    variability in Y.(Fit is not good)

    R2is the square of correlation coefficient between Xand

    Y. (proof omitted)

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    32/96

    r = 1

    r = -1

    R2= 1

    Perfect linearrelationship

    100% of the variation

    in Yis explained by X

    0 < R2< 1

    Weak linear

    relationships

    Some but not all of

    the variation in Yis

    explained by X

    R2= 0

    No linear

    relationship

    None of the

    variation in Yis

    explained by XGaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    33/96

    Coefficient of Determination: R2= (4450-370.5)/4450 = 0.916

    Correlation Coefficient: r = -0.957

    Coefficient of Determination = (Correlation Coefficient)2

    X Y

    1.25 125 126.0 45 -1 46 2025 1 2116

    1.75 105 100.5 25 4.5 20.5 625 20.25 420.25

    2.25 65 74.9 -15 -9.9 -5.1 225 98.00 26.01

    2.00 85 87.7 5 -2.2 7.7 25 4.84 59.29

    2.50 75 62.1 -5 12.9 -17.7 25 166.41 313.29

    2.25 80 74.9 0 5.1 -5.1 0 26.01 26.01

    2.70 50 51.9 -30 -1.9 -28.1 900 3.61 789.61

    2.50 55 62.1 -25 -7.1 -17.9 625 50.41 320.41

    17.20 640 4450 370.54 4079.4

    6

    Y )( YY )( YY )( YY 2)( YY 2)( YY 2)( YY

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    34/96

    Example:

    Watching television also reduces the amount of physical exercise,

    causing weight gains.

    A sample of fifteen 10-year old children was taken.

    The number of pounds each child was overweight was recorded

    (a negative number indicates the child is underweight).

    Additionally, the number of hours of television viewing per weekswas also recorded. These data are listed here.

    Calculate the sample regression line and describe what the

    coefficients tell you about the relationship between the two

    variables.

    Y = -24.709 + 0.967 X and R2= 0.768

    TV 42 34 25 35 37 38 31 33 19 29 38 28 29 36 18

    Overweight 18 6 0 -1 13 14 7 7 -9 8 8 5 3 14 -7

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    35/96

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    36/96

    -15.00

    -10.00

    -5.00

    0.00

    5.00

    10.00

    15.00

    20.00

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

    Y

    Predicted Y

    Gaurav Garg (IIM Lucknow)

    Standard Error

  • 8/14/2019 7_Regression.pptx

    37/96

    Standard Error

    Consider a dataset.

    All the observations can not be exactly the same as

    arithmetic mean (AM).

    Variability of the observations around AM is measured

    by standard deviation.

    Similarly in regression, all Yvalues can not be the sameas predicted Yvalues.

    Variability of Yvalues around the prediction line is

    measured by STANDARD ERROR OF THE ESTIMATE.

    It is given by

    2

    )(

    2

    1

    2

    n

    YY

    n

    SSES

    n

    i

    ii

    YX

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    38/96

    Assumptions

    The relationship between X and Y is linear

    Error values are statistically independent All the Errors have a common variance.

    (Homoscedasticity)

    Var(ei)=2, where E(ei)= 0

    No distributional assumption about errors is

    required for least squares method.

    iii YYe

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    39/96

  • 8/14/2019 7_Regression.pptx

    40/96

    Independence

    Not Independent Independent

    X

    X

    residuals

    residuals

    X

    residuals

    Gaurav Garg (IIM Lucknow)

    E l V i

  • 8/14/2019 7_Regression.pptx

    41/96

    Equal VarianceUnequal variance

    (Heteroscadastic)

    Equal variance

    (Homoscadastic)

    XX

    Y

    X X

    Y

    residuals

    residuals

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    42/96

    TV WatchingWeight Gain Example

    Scatter Plot of X and Y

    Scatter Plot of X and Residuals

    -12.00

    -10.00

    -8.00

    -6.00

    -4.00

    -2.00

    0.00

    2.00

    4.00

    6.00

    0 5 10 15 20 25 30 35 40 45

    -15.00

    -10.00

    -5.00

    0.00

    5.00

    10.00

    15.00

    20.00

    0 5 10 15 20 25 30 35 40 45

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    43/96

    The Multiple Linear Regression Model In simple linear regression analysis, we fit linear relation

    between

    one independent variable (X) and

    one dependent variable (Y).

    We assume that Y is regressed on only one regressor

    variable X. In some situations, the variable Yis regressed on more

    than one regressor variables (X1, X2, X3, ).

    For EXample:

    Cost -> Labor cost, Electricity cost, Raw material cost

    Salary -> Education, EXperience

    Sales -> Cost, Advertising EXpenditure

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    44/96

    Example:

    A distributor of frozen dessert pies wants to

    evaluate factors which influence the demand

    Dependent variable:

    Y: Pie sales (units per week)

    Independent variables:

    X1:Price (in $)

    X2: Advertising Expenditure ($100s)

    Data are collected for 15 weeksGaurav Garg (IIM Lucknow)

    Pie Price Advertising

  • 8/14/2019 7_Regression.pptx

    45/96

    Week

    Pie

    Sales

    Price

    ($)

    Advertising

    ($100s)

    1 350 5.50 3.3

    2 460 7.50 3.3

    3 350 8.00 3.0

    4 430 8.00 4.5

    5 350 6.80 3.0

    6 380 7.50 4.0

    7 430 4.50 3.08 470 6.40 3.7

    9 450 7.00 3.5

    10 490 5.00 4.0

    11 340 7.20 3.5

    12 300 7.90 3.2

    13 440 5.90 4.0

    14 450 5.00 3.5

    15 300 7.00 2.7

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    46/96

    Using the given data, we wish to fit a linearfunction of the form:

    where

    Y: Pie sales (units per week)

    X1:

    Price (in $)

    X2: Advertising Expenditure ($100s)

    Fitting means, we want to get the values ofregression coefficients denoted by

    Original values ofs are not known.

    We estimate them using the given data.

    ,22110 iiii XXY .15,,2,1

    i

    Gaurav Garg (IIM Lucknow)

    h l l d l

  • 8/14/2019 7_Regression.pptx

    47/96

    The Multiple Linear Regression Model

    Examine the linear relationship between

    one dependent(Y) and two or more independent variables (X1, X2, , Xk).

    Multiple Linear Regression Model with k

    Independent Variables:

    ikikiii XXXY 22110

    Intercept Slopes Random Error

    .,,2,1 ni Gaurav Garg (IIM Lucknow)

    l l

  • 8/14/2019 7_Regression.pptx

    48/96

    Multiple Linear Regression Equation

    Intercept and Slopes are estimated using observed

    data. Multiple linear regression equation with k

    independent variables

    kikiii XbXbXbbY 22110

    Estimatedvalue Estimates of slopes

    Estimate ofintercept

    .,,2,1 ni Gaurav Garg (IIM Lucknow)

    M l i l R i E i

  • 8/14/2019 7_Regression.pptx

    49/96

    Multiple Regression Equation

    EXample with two independent variables

    Y

    X1

    X2

    22110 XbXbbY

    Gaurav Garg (IIM Lucknow)

    E i i R i C ffi i

  • 8/14/2019 7_Regression.pptx

    50/96

    Estimating Regression Coefficients The multiple linear regression model

    In matriX notations

    or

    n

    k

    nknn

    k

    k

    n XXX

    XXXXXX

    Y

    YY

    2

    1

    2

    1

    0

    21

    22221

    11211

    2

    1

    1

    11

    XY

    ,...,n,,iXXXY ikikiii 2122110

    Gaurav Garg (IIM Lucknow)

    A i

  • 8/14/2019 7_Regression.pptx

    51/96

    Assumptions

    No. of observations (n) is greater than no. of

    regressors (k). i.e., n> k Random Errors are independent

    Random Errors have the same variances.

    (Homoscedasticity) Var( i)=2

    In long run, mean effect of random errors is zero.

    E( i)= 0. No Assumption on distribution of Random errors

    is required for least squares method.Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    52/96

    In order to find the estimate of, we minimize

    We differentiate S()with respect toand equateto zero, i.e.,

    This gives

    bis called least squares estimator of.

    XXYXY-Y

    X(Y)X(YS(n

    i i

    2

    ))1

    2

    ,

    S0

    YXX)X(b 1

    Gaurav Garg (IIM Lucknow)

    Example: Consider the pie example

  • 8/14/2019 7_Regression.pptx

    53/96

    Example: Consider the pie example.

    We want to fit the model

    The variables are

    Y: Pie sales (units per week)

    X1:Price (in $)

    X2: Advertising Expenditure ($100s)

    Using the matriX formula, the least squares estimate(LSE) ofs are obtained as below:

    Pie Sales = 306.5324.98 Price + 74.13 Adv. Expend.

    ,22110 iiii XXY

    LSE of Intercept 0

    Intercept (b0) 306.53

    LSE of slope1 Price (b1) -24.98

    LSE of slope2 Advertising (b2) 74.13

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    54/96

    )(1374)(982453306Sales 21 X.X.-.

    b1= -24.98: sales will decrease, on

    average, by 24.98 pies per week for

    each $1 increase in selling price,

    while advertising expenses are kept

    fixed.

    b2= 74.13: sales will

    increase, on average, by

    74.13 pies per week foreach $100 increase in

    advertising, while selling

    price are kept fixed.

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    55/96

    Prediction:

    Predict sales for a week in which

    selling price is $5.50

    Advertising eXpenditure is $350:

    Sales = 306.5324.98 X1+ 74.13 X2

    = 306.5324.98 (5.50) + 74.13 ( 3.5)

    = 428.62

    Predicted sales is 428.62 piesNote that Advertising is in $100s, so X2= 3.5

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    56/96

    Y X1 X2 Predicted Y Residuals

    350 5.5 3.3 413.77 -63.80460 7.5 3.3 363.81 96.15

    350 8.0 3.0 329.08 20.88

    430 8.0 4.5 440.28 -10.31

    350 6.8 3.0 359.06 -9.09

    380 7.5 4.0 415.70 -35.74430 4.5 3.0 416.51 13.47

    470 6.4 3.7 420.94 49.03

    450 7.0 3.5 391.13 58.84

    490 5.0 4.0 478.15 11.83

    340 7.2 3.5 386.13 -46.16300 7.9 3.2 346.40 -46.44

    440 5.9 4.0 455.67 -15.70

    450 5.0 3.5 441.09 8.89

    300 7.0 2.7 331.82 -31.85

    21 1309674975092452619306 X.X..Y

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    57/96

    0

    100

    200

    300

    400

    500

    600

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

    Y

    Predicted Y

    Gaurav Garg (IIM Lucknow)

    Coefficient of Determination

  • 8/14/2019 7_Regression.pptx

    58/96

    Coefficient of Determination

    Coefficient of Determination (R2 ) is obtained using the

    same formula as was in simple linear regression.

    R2 = SSR/SST = 1(SSE/SST)

    R2 is the proportion of variation in Yexplained by

    regression.

    n

    i

    i YY1

    2)(SSTSquares,ofSumTotal

    n

    i

    i YY

    1

    2)(SSRSquares,ofSumRegression

    n

    i

    ii YY1

    2)(SSESquares,ofSumError

    Also, SST = SSR + SSE

    Gaurav Garg (IIM Lucknow)

    Since SST = SSR + SSE

  • 8/14/2019 7_Regression.pptx

    59/96

    Since SST = SSR + SSE

    and all three quantities are non-negative,

    Also, 0 SSR SST

    So 0 SSR/SST 1

    Or 0 R2 1

    When R2is close to 0, the linear fit is not good

    And Xvariables do not contribute in explaining thevariability in Y.

    When R2 is close to 1, the linear fit is good.

    In the previously discussed example, R2

    = 0.5215 If we consider Yand X1only, R

    2 =0.1965

    If we consider Yand X2only, R2 =0.3095

    Gaurav Garg (IIM Lucknow)

    Adjusted R2

  • 8/14/2019 7_Regression.pptx

    60/96

    Adjusted R2 If one more regressor is added to the model, the value

    of R2 will increase

    This increase is regardless of the contribution of newly

    added regressor.

    So, an adjusted value of R2 is defined, which is called as

    adjusted R2 and defined as

    This Adjusted R2 will only increase, if the additionalvariable contribute in explaining the variation in Y.

    For our example, Adjusted R2 = 0.4417

    )(nSST

    )k(nSSER

    Adj1

    112

    Gaurav Garg (IIM Lucknow)

    F -Test for Overall Significance

  • 8/14/2019 7_Regression.pptx

    61/96

    F-Test for Overall Significance

    We check if there is a linear relationship between all the

    regressors (X1, X2, , Xk)and response (Y).

    Use F test statistic

    To test:

    H0: 1=2==k= 0 (no regressor is significant)

    H1:at least one i 0 (at least one regressor affects Y)

    The technique of Analysis of Variance is used.

    Assumptions:

    n > k, Var( i)=2, E( i)= 0.

    isare independent. This implies that Corr ( i , j ) = 0, for i j is have Normal Distribution. [ i ~ N(0, 2)]

    [NEW ASSUMPTION]

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    62/96

    Total Sum of Square (SST) is partitioned into

    Sum of Squares due to Regression (SSR) and

    Sum of Squares due to Residuals (SSE)

    where

    eisare called the residuals.

    SSESSTSSR

    YYeSSE

    YYSST

    n

    i

    ii

    n

    i

    i

    n

    i

    i

    1

    2

    1

    2

    2

    1

    Gaurav Garg (IIM Lucknow)

    Analysis of Variance Table

  • 8/14/2019 7_Regression.pptx

    63/96

    Analysis of Variance Table

    Test Statistic: Fc= MSR / MSE ~ F(k, n-k-1)

    For the previous eXample, we wish to test

    H0: 1=2= 0 Against H1:at least one i 0

    ANOVA Table

    Thus H0is rejected at 5% level of significance.

    df SS MS Fc

    Regression k SSR MSR MSR/MSE

    Residual or Error n-k-1 SSE MSE

    Total n-1 SST

    df SS MS F F(2,12)(0.05)

    Regression 2 29460.03 14730.01 6.5386 3.89

    Residual or Error 12 27033.31 2252.78

    Total 14 56493.33

    Gaurav Garg (IIM Lucknow)

    Individual Variables Tests of Hypothesis

  • 8/14/2019 7_Regression.pptx

    64/96

    Individual Variables Tests of Hypothesis

    We test if there is a linear relationship between a

    particular regressor Xjand Y Hypotheses:

    H0: j= 0 (no linear relationship)

    H1: j 0 (linear relationship exists between Xjand Y)

    We use a two tailed t-test

    If H0: j= 0is accepted,

    this indicates that the variable Xjcan be deletedfrom the model.

    Gaurav Garg (IIM Lucknow)

    b

  • 8/14/2019 7_Regression.pptx

    65/96

    Test Statistic:

    Tc ~Students twith (n-k-1) degree of freedom

    bjis the least squares estimate ofj

    Cj jis the (j, j)thelement of matrix (XX)-1

    (MSEis obtained in ANOVA Table)

    jj

    j

    c

    C

    bT

    2

    MSE2

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    66/96

    In our example

    and

    To test H0: 1= 0 against H1: 1 0

    Tc = -2.3057

    To test H0: 2= 0 against H1: 2 0

    Tc =2.8548

    Two tailed critical values of t at 12 d.f. are

    3.0545 for 1% level of significance 2.6810 for 2% level of significance

    2.1788 for 5% level of significance

    2252.77552

    299300038001651

    003800521033120

    0165133120794651

    ...

    ...

    ...

    X)X(

    Gaurav Garg (IIM Lucknow)

    Standard Error

  • 8/14/2019 7_Regression.pptx

    67/96

    Consider a dataset.

    All the observations can not be exactly the same as

    arithmetic mean (AM).

    Variability of the observations around AM is measured

    by standard deviation.

    Similarly in regression, all Yvalues can not be the sameas predicted Yvalues.

    Variability of Yvalues around the prediction line is

    measured by STANDARD ERROR OF THE ESTIMATE.

    It is given by

    1

    )(

    1

    1

    2

    kn

    YY

    kn

    SSES

    n

    i

    ii

    YX

    Gaurav Garg (IIM Lucknow)

    Assumption of Linearity

  • 8/14/2019 7_Regression.pptx

    68/96

    Assumption of Linearity

    Not Linear Linear

    residua

    ls

    Y

    X

    Y

    X

    residua

    ls

    Y Y

    Gaurav Garg (IIM Lucknow)

    Assumption of Equal Variance

  • 8/14/2019 7_Regression.pptx

    69/96

    Assumption of Equal Variance

    We assume that Var( i)=2 The variance is constant for all observations.

    This assumption is examined by looking at the

    plot of

    Predicted values and residualsi

    Yiii

    YYe

    Gaurav Garg (IIM Lucknow)

    Residual Analysis for Equal Variance

  • 8/14/2019 7_Regression.pptx

    70/96

    Residual Analysis for Equal Variance

    Unequal variance Equal variance

    residuals

    residuals

    Y Y

    Gaurav Garg (IIM Lucknow)

    Assumption of Uncorrelated Residuals

  • 8/14/2019 7_Regression.pptx

    71/96

    Assumption of Uncorrelated Residuals

    DurbinWatson statistic is a test statistic used to detect

    the presence of autocorrelation. It is given by

    The value of dalways lies between 0and 4.

    d = 2 indicates no autocorrelation.

    Small values of d < 2 indicate successive error terms arepositively correlated.

    If d > 2 successive error terms are negatively correlated.

    The value of dmore than 3 and less than 1 are alarming.

    n

    i

    i

    n

    i

    ii

    e

    ee

    d

    1

    2

    2

    2

    1 )(

    Gaurav Garg (IIM Lucknow)

    Residual Analysis for Independence

  • 8/14/2019 7_Regression.pptx

    72/96

    (Uncorrelated Errors)

    Not Independent Independent

    residuals

    residuals

    residu

    als

    Y

    Y

    Y

    Gaurav Garg (IIM Lucknow)

    Assumption of Normality

  • 8/14/2019 7_Regression.pptx

    73/96

    p y

    When we use Ftest or ttest, we assume that 1,

    2, , nare normally distributed. This assumption can be examined by histogram

    of residuals.

    NOT NORMAL NORMAL

    Gaurav Garg (IIM Lucknow)

    l l b d l

  • 8/14/2019 7_Regression.pptx

    74/96

    Normality can also be examined using Q-Q plot

    or Normal probability plot.

    NOT NORMAL NORMAL

    Gaurav Garg (IIM Lucknow)

    Standardized Regression Coefficient

  • 8/14/2019 7_Regression.pptx

    75/96

    g

    In a multiple linear regression, we may like to know

    which regressor contributes more.

    We obtain standardized estimates of regression

    coefficients.

    For that, first we standardize the observations.

    1

    2

    2

    1 1

    2

    1 1 1 11 1

    2

    2 2 2 2

    1 1

    1 1, ( )

    1

    1 1

    , ( )1

    1 1, ( )

    1

    n n

    i Y i

    i i

    n n

    i X ii i

    n n

    i X i

    i i

    Y Y s Y Y n n

    X X s X Xn n

    X X s X Xn n

    Gaurav Garg (IIM Lucknow)

    Standardize all Y, X1 and X2 values as follows:

  • 8/14/2019 7_Regression.pptx

    76/96

    Standardize all Y, X1and X2values as follows:

    Fit the regression in the standardized data and obtainthe least squares estimate of regression coefficients.

    These coefficients are dimensionless or unit-free andcan be compared.

    Look for the regression coefficient having the highestmagnitude.

    Corresponding regressor contributes the most.

    1 2

    1 1 2 21 2

    Standardized ,

    Standardized , Standardized

    i

    Y

    i ii i

    X X

    Y YY

    s

    X X X XX X

    s s

    Gaurav Garg (IIM Lucknow)

    Standardized Data

  • 8/14/2019 7_Regression.pptx

    77/96

    Y= 0

    0.461 X1+ 0.570X2

    Since 0.461 < 0.570

    X2 Contributes the most

    Week

    Pie

    Sales

    Price

    ($)

    Advertising

    ($100s)

    1 -0.78 -0.95 -0.37

    2 0.96 0.76 -0.373 -0.78 1.18 -0.98

    4 0.48 1.18 2.09

    5 -0.78 0.16 -0.98

    6 -0.30 0.76 1.06

    7 0.48 -1.80 -0.98

    8 1.11 -0.18 0.45

    9 0.80 0.33 0.04

    10 1.43 -1.38 1.06

    11 -0.93 0.50 0.04

    12 -1.56 1.10 -0.57

    13 0.64 -0.61 1.06

    14 0.80 -1.38 0.04

    15 -1.56 0.33 -1.60

    Gaurav Garg (IIM Lucknow)

    Note that:

  • 8/14/2019 7_Regression.pptx

    78/96

    Note that:

    Adjusted R2can be negative Adjusted R2is always less than or equal to R2

    Inclusion of intercept term is not necessary.

    It depends on the problem. Analyst may decide on this.

    )1(

    )1)(1(1

    22

    kn

    nRR

    Adj

    )1(

    )1(2

    2

    Rk

    RknF

    c

    Gaurav Garg (IIM Lucknow)

    Example: Following data was collected for the sales, number of

  • 8/14/2019 7_Regression.pptx

    79/96

    advertisements published and advertizing expenditure for 12

    weeks. Fit a regression model to predict the sales.

    Sales (0,000 Rs) Ads (Nos.) Adv Ex (000 Rs)

    43.6 12 13.9

    38.0 11 12

    30.1 9 9.3

    35.3 7 9.7

    46.4 12 12.3

    34.2 8 11.4

    30.2 6 9.3

    40.7 13 14.3

    38.5 8 10.222.6 6 8.4

    37.6 8 11.2

    35.2 10 11.1

    Gaurav Garg (IIM Lucknow)

    ANOVAb

    Sum of

  • 8/14/2019 7_Regression.pptx

    80/96

    ModelSum of

    Squares df Mean Square F Sig.

    1 Regression 309.986 2 154.993 9.741 .006a

    Residual 143.201 9 15.911

    Total 453.187 11

    a. Predictors: (Constant), Ex_Adv, No_Adv

    b. Dependent Variable: Sales

    Coefficientsa

    Model

    Unstandardized CoefficientsStandardizedCoefficients

    t Sig.B Std. Error Beta1 (Constant) 6.584 8.542 .771 .461

    No_Adv .625 1.120 .234 .558 .591

    Ex_Adv 2.139 1.470 .611 1.455 .180

    a. Dependent Variable: Sales

    p-value < 0.05; H0is rejected; All s are not zero

    All p-values > 0.05; No H0rejected. 0 =0, 1 =0, 2 =0

    CONTRADICTION

    Gaurav Garg (IIM Lucknow)

    Multicollinearity

  • 8/14/2019 7_Regression.pptx

    81/96

    We assume that regressors are independent variables.

    When we regress Y on regressors X1

    , X2

    , , Xk

    .

    We assume that all regressors X1, X2, , Xkarestatistically independent of each other.

    All the regressors affect the values of Y.

    One regressor does not affect the values of otherregressor.

    Sometimes, in practice this assumption is not met.

    We face the problem of multicollinearity. The correlated variables contribute redundant information

    to the model

    Gaurav Garg (IIM Lucknow)

    Including two highly correlated independent variables can

  • 8/14/2019 7_Regression.pptx

    82/96

    g g y padversely affect the regression results

    Can lead to unstable coefficients

    Some Indications of Strong Multicollinearity: Coefficient signs may not match prior expectations

    Large change in the value of a previous coefficient when a newvariable is added to the model

    A previously significant variable becomes insignificant when anew independent variable is added.

    F says at least one variable is significant, but none of the tsindicates a useful variable.

    Large standard error and corresponding regressors is stillsignificant.

    MSE is very high and/or R2is very small

    Gaurav Garg (IIM Lucknow)

    EXAMPLES IN WHICH THIS MIGHT HAPPEN:

  • 8/14/2019 7_Regression.pptx

    83/96

    Miles per gallon Vs. horsepower and engine size

    Income Vs. age and experience

    Sales Vs. No. of Advertisement and Advert. Expenditure Variance Inflationary Factor:

    VIFjis used to measure multicollinearity generatedby variable Xj

    It is given by

    where R2j

    is the coefficient of determination of aregression model that uses Xjas the dependent variable and

    all other Xvariables as the independent variables.

    21

    1

    j

    jR

    VIF

    Gaurav Garg (IIM Lucknow)

    If VIF > 5 X is highly correlated with the other

  • 8/14/2019 7_Regression.pptx

    84/96

    If VIFj > 5, Xj is highly correlated with the otherindependent variables

    Mathematically, the problem of multicollinearity occurs

    when the columns of matrix Xhave near lineardependence

    LSE bcan not be obtained when the matrixXX is singular

    The matrixXXbecomes singular when

    the columns of matrix Xhave exact linear dependence

    If any of the eigen value of matrixXX is zero

    Thus, near zero eigen value is also an indication ofmulticollinearity.

    The methods of dealing with multicollinearity:

    Collecting Additional Data

    Variable Elimination

    Gaurav Garg (IIM Lucknow)

  • 8/14/2019 7_Regression.pptx

    85/96

  • 8/14/2019 7_Regression.pptx

    86/96

    We may use the method of variable elimination.

    In practice, IfCorr (X1, X2)

    is more than 0.7 or

    less than -0.7, we eliminate one of them.

    Techniques:

    Stepwise (based on ANOVA)

    Forward Inclusion (based on Correlation)

    Backward Elimination (based on Correlation)

    Gaurav Garg (IIM Lucknow)

    Stepwise Regression

  • 8/14/2019 7_Regression.pptx

    87/96

    Y = 0+ 1X1+ 2X2+ 3X3+ 4X4+ 5X5+

    Step 1: Run 5 simple linear regressions:

    Y = 0+ 1X1

    Y = 0+ 2X2

    Y = 0+ 3X3

    Y = 0+ 4X4 Y = 0+ 5X5

    Step 2: Run 4 two-variable linear regressions:

    Y = 0+ 4X4 + 1X1

    Y = 0+ 4X4 + 2X2

    Y = 0+ 4X4 + 3X3

    Y = 0+ 4X4 + 5X5

  • 8/14/2019 7_Regression.pptx

    88/96

    Step 3: Run 3 three-variable linear regressions:

    Y = 0+ 3X3 + 4X4 + 1X1

    Y = 0+ 3X3 + 4X4 + 2X2

    Y = 0+ 3X3 + 4X4 + 5X5

    Suppose none of these models have

    p-values < 0.05

    STOP Best model is the one with X3and X4only

    Gaurav Garg (IIM Lucknow)

    Example: Following data was collected for the sales, number of

    d ti t bli h d d d ti i dit f 12

  • 8/14/2019 7_Regression.pptx

    89/96

    advertisements published and advertizing expenditure for 12

    months. Fit a regression model to predict the sales.

    Sales (0,000 Rs) Ads (Nos.) Adv Ex (000 Rs)

    43.6 12 13.9

    38.0 11 12

    30.1 9 9.3

    35.3 7 9.7

    46.4 12 12.3

    34.2 8 11.4

    30.2 6 9.3

    40.7 13 14.3

    38.5 8 10.222.6 6 8.4

    37.6 8 11.2

    35.2 10 11.1

    Gaurav Garg (IIM Lucknow)

    Summary Output 1: Sales Vs. No_Adv

  • 8/14/2019 7_Regression.pptx

    90/96

    Model Summary

    Model R R Square Adjusted R SquareStd. Error of the

    Estimate

    1 .781a .610 .571 4.20570

    a. Predictors: (Constant), No_Adv

    ANOVAb

    Model Sum of Squares df Mean Square F Sig.

    1 Regression 276.308 1 276.308 15.621 .003a

    Residual 176.879 10 17.688

    Total453.187 11

    a. Predictors: (Constant), No_Adv

    b. Dependent Variable: Sales

    Coefficientsa

    Model

    Unstandardized CoefficientsStandardizedCoefficients

    t Sig.B Std. Error Beta1 (Constant) 16.937 4.982 3.400 .007

    No_Adv 2.083 .527 .781 3.952 .003

    a. Dependent Variable: Sales

    Gaurav Garg (IIM Lucknow)

    Summary Output 2: Sales Vs. Ex_Adv

  • 8/14/2019 7_Regression.pptx

    91/96

    Model Summary

    Model R R Square Adjusted R SquareStd. Error of the

    Estimate

    1 .820a .673 .640 3.84900

    a. Predictors: (Constant), Ex_Adv

    ANOVAb

    Model Sum of Squares df Mean Square F Sig.

    1 Regression 305.039 1 305.039 20.590 .001a

    Residual 148.148 10 14.815

    Total453.187 11

    a. Predictors: (Constant), Ex_Adv

    b. Dependent Variable: Sales

    Coefficientsa

    Model

    Unstandardized CoefficientsStandardizedCoefficients

    t Sig.B Std. Error Beta1 (Constant) 4.173 7.109 .587 .570

    Ex_Adv 2.872 .633 .820 4.538 .001

    a. Dependent Variable: Sales

    Gaurav Garg (IIM Lucknow)

    Summary Output 3: Sales Vs. No_Adv & Ex_Adv

  • 8/14/2019 7_Regression.pptx

    92/96

    Model Summary

    Model R R Square Adjusted R SquareStd. Error of the

    Estimate

    1 .827a .684 .614 3.98888

    a. Predictors: (Constant), Ex_Adv, No_Adv

    ANOVAb

    Model Sum of Squares df Mean Square F Sig.

    1 Regression 309.986 2 154.993 9.741 .006a

    Residual 143.201 9 15.911

    Total453.187 11

    a. Predictors: (Constant), Ex_Adv, No_Adv

    b. Dependent Variable: Sales

    Coefficientsa

    Model

    Unstandardized CoefficientsStandardizedCoefficients

    t Sig.B Std. Error Beta1 (Constant) 6.584 8.542 .771 .461

    No_Adv .625 1.120 .234 .558 .591

    Ex_Adv 2.139 1.470 .611 1.455 .180

    a. Dependent Variable: Sales

    Gaurav Garg (IIM Lucknow)

    Qualitative Independent Variables

  • 8/14/2019 7_Regression.pptx

    93/96

    Johnson Filtration, Inc., provides maintenanceservice for water filtration systems throughoutsouthern Florida.

    To estimate the service time and the service cost,the managers want to predict the repair time

    necessary for each maintenance request. Repair time is believed to be related to two

    factors-

    Number of months since the last maintenanceservice

    Type of repair problem (mechanical or electrical)

    Gaurav Garg (IIM Lucknow)

    Data for a sample of 10 service calls are given:

  • 8/14/2019 7_Regression.pptx

    94/96

    Let Ydenote the repair time, X1denote the number of

    months since last maintenance service. Regression Model that uses X1only to regress Yis

    Y=0+1X1+

    Service Call

    Months Since Last

    Service Type of Repair

    Repair Time in

    Hours

    1 2 electrical 2.9

    2 6 mechanical 3.03 8 electrical 4.8

    4 3 mechanical 1.8

    5 2 electrical 2.9

    6 7 electrical 4.9

    7 9 mechanical 4.2

    8 8 mechanical 4.8

    9 4 electrical 4.4

    10 6 electrical 4.5

    Gaurav Garg (IIM Lucknow)

    Using least squares method, we fitted the model as

  • 8/14/2019 7_Regression.pptx

    95/96

    g q

    R2 =0.534

    At 5% level of significance, we reject H0: 0= 0 (Using ttest)

    H0: 1= 0 (Using tand Ftest)

    X1 alone explains 53.4% variability in repair time.

    To introduce the type of repair into the model, we define adummy variable given as

    Regression Model that uses X1 and X2 to regress Yis

    Y=0+1X1+2X2+

    Is the new model improved?

    13041.01473.2 XY

    electricalisrepairoftypeif1,

    mechanicalisrepairoftypeif,02X

    Gaurav Garg (IIM Lucknow)

    Summary

    M l i l li i d l Y X

  • 8/14/2019 7_Regression.pptx

    96/96

    Multiple linear regression model Y=X+

    Least Squares Estimate ofis given by b= (XX)-1XY

    R2and adjusted R2 Using ANOVA (Ftest), we examine if alls are zero or

    not.

    ttest is conducted for each regressor separately.

    Using t test, we examine ifcorresponding to thatregressor is zero or not.

    Problem of MulticollinearityVIF, eigen value

    Dummy Variable

    Examining the assumptions :

    common variance, independence, normality