11
Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) © 2015 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets.

Learn to Test for Heteroscedasticity in SPSS With Data

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Learn to Test for

Heteroscedasticity in SPSS With

Data From the China Health and

Nutrition Survey (2006)

© 2015 SAGE Publications, Ltd. All Rights Reserved.

This PDF has been generated from SAGE Research Methods Datasets.

Learn to Test for

Heteroscedasticity in SPSS With

Data From the China Health and

Nutrition Survey (2006)

How-to Guide for IBM® SPSS® Statistics Software

Introduction

In this guide you will learn how to detect heteroscedasticity following a linear

regression model in IBM® SPSS® Statistical Software (SPSS), using a practical

example to illustrate the process. You will find links to the example dataset and

you are encouraged to replicate this example. An additional practice example

is suggested at the end of this guide. The example assumes you have already

opened the data file in SPSS.

Contents

1. Heteroscedasticity

2. An Example in SPSS: Blood Pressure and Age in China

2.1 The SPSS Procedure

2.2 Exploring the SPSS Output

3. Your Turn

1 Heteroscedasticity

Linear regression models estimated via Ordinary Least Squares (OLS) rest on

several assumptions, one if which is that the variance of the residual from the

SAGE

2015 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

1

Page 2 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China

Health and Nutrition Survey (2006)

model is constant and unrelated to the independent variable(s). Constant variance

is called homoscedasticity, while non-constant variance is called

heteroscedasticity. This example illustrates how to detect heteroscedasticity

following the estimation of a simple linear regression model.

2 An Example in SPSS: Blood Pressure and Age in China

This example uses two variables from the 2006 China Health and Nutrition

Survey:

• A person’s systolic blood pressure (systolic).

• A person’s age, measured in years (age).

There are 9178 respondents in this survey. Systolic blood pressure measures

the pressure in a person’s arteries when their heart beats (contracts and pumps

blood). In this dataset, this variable ranges from 70 to 240 with a mean of about

122 and a standard deviation of 18.14. Age is measured in years, and in this

dataset it ranges from 17 to 95 with a mean of about 49 and a standard deviation

of 15.19. Both of these variables are continuous, making them appropriate for

simple regression.

2.1 The SPSS Procedure

Before producing the simple regression model, it is a good idea to look at each

variable separately. However, in the interest of space, we forgo doing so here.

Readers should explore the SAGE Research Methods Dataset examples

associated with Simple Regression and Multiple Regression for more information.

You estimate a simple regression model in SPSS by selecting from the menu:

Analyze → Regression → Linear

In the Linear Regression dialog box that opens, move the systolic blood pressure

SAGE

2015 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

1

Page 3 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China

Health and Nutrition Survey (2006)

variable (systolic) into the Dependent: window and move the age variable (age)

into the Independent(s): window.

Figure 1 shows what this looks like in SPSS.

Figure 1: Selecting simple regression from the Analyze menu in SPSS.

Because we want to explore whether there is evidence of heteroscedasticity

among the residuals of this regression, we also want to produce a scatter plot

that plots the standardized residuals on the Y-axis and the standardized predicted

values of the dependent variable on the X-axis.

SAGE

2015 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

1

Page 4 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China

Health and Nutrition Survey (2006)

First, we click the “Plots…” button on the right-hand side of the Linear Regression

dialog box. That opens a second dialog box. In this second dialog box, move

*ZRESID into the open box under Y: and *ZPRED into the open box under X: as

shown in Figure 2

Figure 2: Producing a two-way scatter plot of standardized residuals and

standardized predicted values for a regression model in the Linear Regression:

Plots dialog box in SPSS.

Once you are done, click Continue in this dialog box, and then click OK to perform

the analysis.

2.2 Exploring the SPSS Output

Figure 3 presents five tables of results that are produced by the simple linear

regression procedure in SPSS. The fifth table is produced because we asked

SPSS to produce plots using the standardized residuals. The fourth table in Figure

3, outlined in red, includes the results of the regression model itself.

Figure 3: Simple regression of systolic blood pressure on age, 2006 China Health

SAGE

2015 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

1

Page 5 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China

Health and Nutrition Survey (2006)

and Nutrition Survey.

SAGE

2015 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

1

Page 6 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China

Health and Nutrition Survey (2006)

SAGE

2015 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

1

Page 7 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China

Health and Nutrition Survey (2006)

The first three tables in Figure 3 report the independent variable(s) entered into

the model, some summary fit statistics for the regression model, and an analysis

of variance for the model as a whole. While detailed examination of these tables

is beyond the scope of this example, we note in the second table that R Square

measures the proportion of the variance in the dependent variable explained by

the model, which in this case consists of a single independent variable. An R

Square of 0.157 means that approximately 15.7% of the variance in systolic blood

pressure is accounted for by age.

The fourth table in Figure 3, outlined in red, presents the estimates of the

intercept, or constant, and the slope coefficient. The results report an estimate

of the intercept (or constant) as equal to approximately 98.56. The constant of a

simple regression model can be interpreted as the average expected value of the

dependent variable when the independent variable equals zero. In this case, our

independent variable, age, can never be zero, so the constant by itself does not

tell us much.

The estimated value for the slope coefficient linking age to systolic blood pressure

is estimated to be approximately 0.47. This represents the average marginal effect

of age on systolic blood pressure, and can be interpreted as the expected change

on average in the dependent variable for a one-unit increase in the independent

variable. For this example, that means that every increase in age of 1 year is

associated with an average increase of about 0.47 in systolic blood pressure.

The fourth table in Figure 3 reports that this estimate is statistically significantly

different from zero, with a p-value well below 0.001. This leads us to reject the

null hypothesis and conclude that there does appear to be a positive relationship

between a person’s age and their systolic blood pressure in China.

Figure 4 presents a plot with the standardized residuals of this regression on the

Y-axis and the standardized predicted values of the dependent variable on the X-

SAGE

2015 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

1

Page 8 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China

Health and Nutrition Survey (2006)

axis. Figure 4 shows that the vertical spread of the residuals is relatively low for

respondents with lower predicted levels of systolic blood pressure. However, as

we move left to right and the predicted level of systolic blood pressure increases,

we see the vertical spread of the residuals also increasing. This spread appears to

shrink somewhat at the very highest predicted values for systolic blood pressure.

Overall, Figure 4 shows a pattern in the variance of the residuals, meaning that

we appear to have evidence of heteroscedasticity.

Figure 4: Two-way scatter plot of standardized residuals from the regression

shown in forth table of Figure 3 on the Y-axis and standardized predicted values

of the dependent variable from that regression on the X-axis, 2006 China Health

and Nutrition Survey.

SAGE

2015 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

1

Page 9 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China

Health and Nutrition Survey (2006)

Unfortunately, SPSS does not include any formal tests of heteroscedasticity.

Users can create macros within SPSS to perform specific functions not built into

the software, but that process is beyond the scope of this example. Example code

for a macro that includes the Breusch–Pagen test, and a tutorial video on how to

use it, can be found at the following links:

http://www.spsstools.net/Syntax/RegressionRepeatedMeasure/Breusch-

PaganAndKoenkerTest.sps

https://www.youtube.com/watch?v=3QcX4jqPn14

Applying the steps of the Breusch–Pagen test to this example results in a test

statistic of 652.33. When compared to a Chi-squared distribution with 1 degree of

SAGE

2015 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

1

Page 10 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China

Health and Nutrition Survey (2006)

freedom, the resulting p-value falls well below the standard 0.05 level. Thus we

have clear evidence to reject the null hypothesis of homoscedasticity and accept

the alternative hypothesis that we do in fact have heteroscedasticity in the residual

of this regression model.

3 Your Turn

Download this sample dataset and see if you can replicate these results. Then

repeat the analysis, this time replacing the systolic blood pressure variable with a

variable measuring diastolic blood pressure (diastolic) as the dependent variable

and then explore whether or not there is evidence of heteroscedasticity in the

residuals of the regression.

IBM® SPSS® Statistics software (SPSS) screenshots Republished Courtesy of

International Business Machines Corporation, © International Business Machines

Corporation. SPSS Inc. was acquired by IBM in October, 2009. IBM, the IBM logo,

ibm.com, and SPSS are trademarks or registered trademarks of International

Business Machines Corporation, registered in many jurisdictions worldwide. Other

product and service names might be trademarks of IBM or other companies. A

current list of IBM trademarks is available on the Web at “IBM Copyright and

trademark information” at http://www.ibm.com/legal/copytrade.shtml.

SAGE

2015 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

1

Page 11 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China

Health and Nutrition Survey (2006)