28
Stat 112: Lecture 15 Notes • Finish Chapter 6: – Review on Checking Assumptions (Section 6.4-6.6) – Outliers and Influential Points (Section 6.7) • Homework 4 is due this Thursday. • Please let me know of any ideas you want to discuss for the final project.

Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

  • View
    233

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Stat 112: Lecture 15 Notes

• Finish Chapter 6: – Review on Checking Assumptions (Section

6.4-6.6)– Outliers and Influential Points (Section 6.7)

• Homework 4 is due this Thursday.

• Please let me know of any ideas you want to discuss for the final project.

Page 2: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Review of Checking and Remedying Assumptions

1. Linearity: Check residual by predicted plots and residual plots for each

variable for pattern in the mean of the residuals. Remedies: Transformations and Polynomials. To see if remedy

works, check new residual plots for pattern in the mean of the residuals..

2. The standard deviation of Y for the subpopulation of units with is the same for all subpopulations. Check residual by predicted plot for pattern in the spread of the

residuals. Remedies: Transformation of Y. To see if remedy works, check

residual by predicted plot for the transformed Y regression. 3. Normality: The distribution of Y for the subpopulation of units with is normally distributed for all subpopulations. Check histogram for bell shaped distribution of residuals and

normal quantile plot of residuals for approximately straight line. Remedies: Transformation of Y. To see if remedy works, check

histogram and normal quantile plot of residuals for transformed Y regression residuals

1 0 1 1( | , , )K K KE Y X X X X

1 1, , K KX x X x

1 1, , K KX x X x

Page 3: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Checking whether a transformation of Y works for remedying Non-

constant variance1. Create a new column with the transformation

of the Y variable by right clicking in the new column and clicking Formula and putting in the appropriate formula for the transformation (Note: Log is contained in the class of transcendental functions)

2. Fit the regression of the transformation of Y on the X variables

3. Check the residual by predicted plot to see if the spread of the residuals appears constant over the range of predicted values.

Page 4: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Example 6.3 from Book: Telecore is a high-tech company located in Fort Worth, Texas. The company produces a part called a fibre-optic connector (FOC) and wants to generate a prediction of the sales of FOCs based on the week. Response SALES Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 4703.7694 512.4686 9.18 <.0001 Week 72.458877 3.340064 21.69 <.0001 Residual by Predicted Plot

-15000

-10000

-5000

0

5000

10000

15000

20000

SA

LES

Res

idua

l

0 10000 20000 30000 40000

SALES Predicted

Nonconstant variance with spread of residuals increasing as the predicted values increase.

Page 5: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

We try transforming Sales to Log (Sales) Response Log Sales Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 8.7390327 0.034298 254.80 0.0000 Week 0.0053746 0.000224 24.04 <.0001 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F Week 1 1 44.796634 578.0861 <.0001 Residual by Predicted Plot

-1

0

Log

Sal

es R

esid

ual

8.0 8.5 9.0 9.5 10.0 10.5

Log Sales Predicted

Log Sales has approximately constant variance.

Page 6: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Outliers in Residuals

• Standardized Residual:• Under normality assumption, 95% of

standardized residuals should be between -2 and 2, and 99% should be between -3 and 3.

• An observation with a standardized residual above 3 or below -3 is considered to be an outlier in its residual, i.e., its Y value is unusual given its explanatory variables. It is worth looking further at the observation to see if any reasons for the large magnitude residual can be identified.

Page 7: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

philacrimerate.JMP outliers in residuals

Bivariate Fit of HousePrice By CrimeRate

0

100000

200000

300000

400000

500000

Hou

seP

rice

GladwyneHaverford

Phila, N

Phila,CC

0 50 100 150 200 250 300 350 400

CrimeRate

Gladwyne, Villanova Haverford are outliers in residuals (their house price is considerably higher than one would expect given their crime rate).

Page 8: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Influential Points and Leverage Points

• Influential observation: Point that if it is removed would markedly change the statistical analysis. For simple linear regression, points that are outliers in the X direction are often influential.

• Leverage point: Point that is an outlier in the X direction that has the potential to be influential. It will be influential if its residual is of moderately large magnitude.

Page 9: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

B iv a r i a t e F i t o f H o u s e P r ic e B y C r im e R a t e

0

1 0 0 0 0 0

2 0 0 0 0 0

3 0 0 0 0 0

4 0 0 0 0 0

5 0 0 0 0 0

Ho

us

eP

rice

G la d w y n eH a v e rfo rd

P h ila , N

P h ila ,C C

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0

C rime R a te

L in e a r F it

L in e a r F it

L in e a r F it

A l l o b s e r v a t i o n s L i n e a r F i t H o u s e P r ic e = 1 7 6 6 2 9 .4 1 - 5 7 6 . 9 0 8 1 3 C r im e R a te

W i t h o u t C e n t e r C i t y P h i l a d e l p h i a L i n e a r F i t H o u s e P r ic e = 2 2 5 2 3 3 .5 5 - 2 2 8 8 . 6 8 9 4 C r im e R a te

W i t h o u t G l a d w y n e L i n e a r F i t H o u s e P r ic e = 1 7 3 1 1 6 . 4 3 - 5 6 7 . 7 4 5 0 8 C r im e R a te

Center City Philadelphia is influential; Gladwyne is not. In general, points that have high leverage are more likely to be influential.

Which Observations Are Influential?

Page 10: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Excluding Observations from Analysis in JMP

• To exclude an observation from the regression analysis in JMP, go to the row of the observation, click Rows and then click Exclude/Unexclude. A red circle with a diagonal line through it should appear next to the observation.

• To put the observation back into the analysis, go to the row of the observation, click Rows and then click Exclude/Unexclude. The red circle should no longer appear next to the observation.

Page 11: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Formal measures of leverage and influence

• Leverage: “Hat values” (JMP calls them hats)• Influence: Cook’s Distance (JMP calls them Cook’s D

Influence).• To obtain them in JMP, click Analyze, Fit Model, put Y

variable in Y and X variable in Model Effects box. Click Run Model box. After model is fit, click red triangle next to Response. Click Save Columns and then Click Hats for Leverages and Click Cook’s D Influences for Cook’s Distances.

• To sort observations in terms of Cook’s Distance or Leverage, click Tables, Sort and then put variable you want to sort by in By box.

Page 12: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Distributions Cook's D Influence HousePrice

HaverfordGladwyne Phila,CC

0 5 1015202530

h HousePrice

Phila,CC

0 .1.2.3.4.5.6.7.8.9

Center City Philadelphia has both influence (Cook’s Distance much Greater than 1 and high leverage (hat value > 3*2/99=0.06). No otherobservations have high influence or high leverage.

Page 13: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Rules of Thumb for High Leverage and High Influence

• High Leverage Any observation with a leverage (hat value) > (3 * # of coefficients in regression model)/n has high leverage, where

# of coefficients in regression model = 2 for simple linear regression.

n=number of observations. • High Influence: Any observation with a Cook’s

Distance greater than 1 indicates a high influence.

Page 14: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

What to Do About Suspected Influential Observations?

See flowchart attached to end of slides

Does removing the observation change the

substantive conclusions?• If not, can say something like “Observation x

has high influence relative to all other observations but we tried refitting the regression without Observation x and our main conclusions didn’t change.”

Page 15: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

• If removing the observation does change substantive conclusions, is there any reason to believe the observation belongs to a population other than the one under investigation?– If yes, omit the observation and proceed.– If no, does the observation have high leverage

(outlier in explanatory variable).• If yes, omit the observation and proceed. Report that

conclusions only apply to a limited range of the explanatory variable.

• If no, not much can be said. More data (or clarification of the influential observation) are needed to resolve the questions.

Page 16: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

General Principles for Dealing with Influential Observations

• General principle: Delete observations from the analysis sparingly – only when there is good cause (observation does not belong to population being investigated or is a point with high leverage). If you do delete observations from the analysis, you should state clearly which observations were deleted and why.

Page 17: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Influential Points, High Leverage Points, Outliers in Multiple

Regression• As in simple linear regression, we identify high leverage

and high influence points by checking the leverages and Cook’s distances (Use save columns to save Cook’s D Influence and Hats).

• High influence points: Cook’s distance > 1• High leverage points: Hat greater than (3*(# of

explanatory variables + 1))/n is a point with high leverage. These are points for which the explanatory variables are an outlier in a multidimensional sense.

• Use same guidelines for dealing with influential observations as in simple linear regression.

• Point that has unusual Y given its explanatory variables: point with a residual that is more than 3 RMSEs away from zero.

Page 18: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Multiple regression, modeling and outliers, leverage and influential points

Pollution Example

• Data set pollution.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961.

• The variables are• y (MORT)=total age adjusted mortality in deaths per 100,000

population; • PRECIP=mean annual precipitation (in inches);

EDUC=median number of school years completed for persons 25 and older; NONWHITE=percentage of 1960 population that is nonwhite; NOX=relative pollution potential of Nox (related to amount of tons of Nox emitted per day per square kilometer);

SO2=relative pollution potential of SO2

Page 19: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Scatterplot Matrix

• Before fitting a multiple linear regression model, it is good idea to make scatterplots of the response variable versus the explanatory variable. This can suggest transformations of the explanatory variables that need to be done as well as potential outliers and influential points.

• Scatterplot matrix in JMP: Click Analyze, Multivariate Methods and Multivariate, and then put the response variable first in the Y, columns box and then the explanatory variables in the Y, columns box.

Page 20: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

800900

1050

10

30

50

70

9.010.0

11.5

515

30

50

150

250

350

50

150

250

MORT

800 950 1150

PRECIP

10 30 50 70

EDUC

9.0 10.5 12.5

NONWHITE

5 1525 35

NOX

50150 300

SO2

50 150250

Scatterplot Matrix

Page 21: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Crunched Variables

• When an X variable is “crunched – meaning that most of its values are crunched together and a few are far apart – there will be influential points. To reduce the effects of crunching, it is a good idea to transform the variable to log of the variable.

Page 22: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

2. a) From the scatter plot of MORT vs. NOX we see that NOX values are crunched very tight. A Log transformation of NOX is needed.

b) The curvature in MORT vs. SO2 indicates a Log transformation for SO2 may be suitable.

After the two transformations we have the following correlations:

MORT PRECIP

EDUC NONWHITE

NOX SO2 Log(NOX)

Log(SO2)

MORT 1.0000 0.5095 -0.5110 0.6437 -0.0774 0.4259 0.2920 0.4031 PRECIP 0.5095 1.0000 -0.4904 0.4132 -0.4873 -0.1069 -0.3683 -0.1212 EDUC -0.5110 -0.4904 1.0000 -0.2088 0.2244 -0.2343 0.0180 -0.2562 NONWHITE 0.6437 0.4132 -0.2088 1.0000 0.0184 0.1593 0.1897 0.0524 NOX -0.0774 -0.4873 0.2244 0.0184 1.0000 0.4094 0.7054 0.3582 SO2 0.4259 -0.1069 -0.2343 0.1593 0.4094 1.0000 0.6905 0.7738 Log(NOX) 0.2920 -0.3683 0.0180 0.1897 0.7054 0.6905 1.0000 0.7328 Log(SO2) 0.4031 -0.1212 -0.2562 0.0524 0.3582 0.7738 0.7328 1.0000

Page 23: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

800

950

1100

10

305070

9.0

10.5

12.0

5

20

35

50

150250350

50

150

250

0

246

0

246

MORT

800 9501100

PRECIP

10 30 50 70

EDUC

9.0 10.512.0

NONWHITE

5 15 25 35

NOX

50 150 300

SO2

50 150250

Log(NOX)

0 1 2 3 4 5 6

Log(SO2)

0 1 2 3 4 5 6

Page 24: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Response MORT Summary of Fit RSquare 0.688278 RSquare Adj 0.659415 Root Mean Square Error 36.30065 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 940.6541 94.05424 10.00 <.0001 PRECIP 1.9467286 0.700696 2.78 0.0075 EDUC -14.66406 6.937846 -2.11 0.0392 NONWHITE 3.028953 0.668519 4.53 <.0001 log NOX 6.7159712 7.39895 0.91 0.3681 log S02 11.35814 5.295487 2.14 0.0365 Residual by Predicted Plot

-100

-50

0

50

100

MO

RT

Res

idua

l

New Orleans, LA

750 800 850 900 950 100010501100

MORT Predicted

0

0.5

1

1.5

2

New Orleans, LA

Cook’s Distances

NewOrleanshasCook’sDistancegreater than 1 –New Orleans may be influential.

Page 25: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Labeling Observations

• To have points identified by a certain column, go the column, click Columns and click Label (click Unlabel to Unlabel).

• To label a row, go to the row, click rows and click label.

Page 26: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Multiple Regression with New Orleans Summary of Fit RSquare 0.688278 RSquare Adj 0.659415 Root Mean Square Error 36.30065 Mean of Response 940.3568 Observations (or Sum Wgts) 60 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 157115.28 31423.1 23.8462 Error 54 71157.80 1317.7 Prob > F C. Total 59 228273.08 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 940.6541 94.05424 10.00 <.0001 PRECIP 1.9467286 0.700696 2.78 0.0075 EDUC -14.66406 6.937846 -2.11 0.0392 NONWHITE 3.028953 0.668519 4.53 <.0001 Log NOX 6.7159712 7.39895 0.91 0.3681 Log SO2 11.35814 5.295487 2.14 0.0365

Multiple Regression without New Orleans Summary of Fit RSquare 0.724661 RSquare Adj 0.698686 Root Mean Square Error 32.06752 Mean of Response 937.4297 Observations (or Sum Wgts) 59 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 143441.28 28688.3 27.8980 Error 53 54501.26 1028.3 Prob > F C. Total 58 197942.54 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 852.3761 85.9328 9.92 <.0001 PRECIP 1.3633298 0.635732 2.14 0.0366 EDUC -5.666948 6.52378 -0.87 0.3889 NONWHITE 3.0396794 0.590566 5.15 <.0001 Log NOX -9.898442 7.730645 -1.28 0.2060 Log SO2 26.032584 5.931083 4.39 <.0001

Removing New Orleans has a large impact on the coefficients of log NOX and log SO2, in particular, it reverses the sign of log S02.

Page 27: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

Leverage Plots • A “simple regression view” of a multiple regression

coefficient. For xj:

Residual y (w/o xj) vs. Residual xj (vs the rest of x’s)(both axes are recentered)

• Slope: coefficient for that variable in the multiple regression

• Distances from the points to the LS line are multiple regression residuals. Distance from point to horizontal line is the residual if the explanatory variable is not included in the model.

• Useful to identify (for xj)outliersleverageinfluential points

(Use them the same way as in a simple regression to identify the effect of points for the regression coefficient

of a particular variable)

Page 28: Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework

PRECIP Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

10 20 30 40 50 60

PRECIP Leverage, P=0.0075

EDUC Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0

EDUC Leverage, P=0.0392

NONWHITE Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

-5 0 5 10 15 20 25 30 35 40

NONWHITE Leverage, P<.0001

Log NOX Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

0 1 2 3 4 5 6

Log NOX Leverage, P=0.3681

Log SO2 Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

-1 0 1 2 3 4 5 6

Log SO2 Leverage, P=0.0365

The enlarged observation New Orleans is an outlier for estimating each coefficient and is highly leveraged for estimating the coefficients of interest on log Nox and log SO2. Since New Orleans is both highly leveraged and an outlier, we expect it to be influential.