11/4/2015Slide 1 SOLVING THE PROBLEM Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the

04/20/23 Slide 1

SOLVING THE PROBLEM

Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the data satisfies the assumption of linearity in a scatterplot of the raw data, provided the spread of the residuals is equal for all of the predicted values in the residual plot, and provided there are no outliers impacting the linear model. When the relationship we are analyzing does not meet these criteria, the use of regression analysis can still be justified if re-expressing one or both variables reduces the non-linear pattern in the scatterplot, equalizes the variance in the residual plot, and reduces the distance of outliers from the other cases in the distributions.

Clues that re-expression might be effective in linearizing the relationship are: severe skewing of one or both variables (outside the range from -1.0 to +1.0), and when Spearman's rho greater than Pearson's r.

There is no guarantee that re-expression will produce a scatterplot that satisfies the assumptions of linear regression. When it does not we are left with the choice of determining that the violations are not of serious consequence, or choosing an alternative strategy for modeling the relationship.

04/20/23 Slide 2

To solve these problems, we will first assess the conformity of the relationship to regression assumptions. Second, we will examine the criteria that suggest that re-expression might be effective. Third, we will examine the model using re-expressed variables to assess conformity to regression assumptions.

Finally, if the model using raw data supports the regression assumption, we will interpret the direction and strength of the relationship. If it was necessary to re-express one or both variables, we will interpret the strength of the re-expressed model, providing it satisfies the regression assumptions.

If the model still violates the conditions for a linear model, we will not interpret the direction and strength of the relationship. This is the convention in our homework problems; in a real application, I would consider interpreting the relationship, attaching a caution which identifies the violation of the assumptions.

In these problems, outliers are defined as cases that have a Cook’s distance greater than 0.5, and hence have a larger influence on the regression solution than other cases included in the analysis .

04/20/23 Slide 3

The introductory statement in the question indicates:• The data set to use (world2007.sav)• The task to accomplish (simple linear regression)• The variables to use in the analysis: the independent

variable population median age in years [agemdn] and the dependent variable infant mortality rate [infmort]

04/20/23 Slide 4

The second paragraph tells us how to re-express the variables, should it be necessary.

04/20/23 Slide 5

The first statement asks about the size of the sample. To answer this question, we run the linear regression in SPSS.

04/20/23 Slide 6

To compute a simple linear regression, select Regression> Linear from the Analyze menu.

04/20/23 Slide 7

First, move the dependent variable, infmort, to the Dependent text box.

Second, move the independent variable, agemdn, to the Independent(s) list box.

Third, click on the Statistics button to request basic descriptive statistics.

04/20/23 Slide 8

First, in addition to the defaults marked by SPSS, mark the check box for Descriptives so that we get the number of cases used in the analysis.

Second, click on the Continue button to close the dialog box.

04/20/23 Slide 9

Next, click on the Plots button to request the residual plot.

Though we do not need it to answer the immediate, producing the residual plot now will save us time later in the problem.

04/20/23 Slide 10

Second, move *ZPRED (for standardized predictions) to the Y axis text box.

First, move *ZRESID (for standardized residuals) to the Y axis text box.

Third, click on the Continue button to close the dialog box.

04/20/23 Slide 11

Next, click on the Save button to include Cooks distance in the output.

04/20/23 Slide 12

Click on the Continue button to close the dialog box.

Mark the check box for Cook’s distance to include this value in the data view and the output.

04/20/23 Slide 13

Click on the OK button to request the output.

04/20/23 Slide 14


In the table of Descriptive Statistics, we see that the number of cases that have valid data for both variables is 190.

04/20/23 Slide 15

The number of cases with valid data to analyze the relationship between "population median age in years" and "infant mortality rate" was 190, out of the total of 192 cases in the data set.

Mark the check box for a correct statement.

04/20/23 Slide 16

The next statement asks us to examine a scatterplot to evaluate the assumption of linearity.

04/20/23 Slide 17

To create the scatterplot, select the Legacy Dialogs > Scatter/Dot from the Graphs menu.

04/20/23 Slide 18

In the Scatter/Dot dialog box, we click on Simple Scatter as the type of plot we want to create.

Click on the Define button to go to the next step.

04/20/23 Slide 19

First, move the dependent variable infmort to the Y axis text box.

Second, move the independent variable agemdn to the X axis text box.

Third, click on the OK button to produce the plot.

04/20/23 Slide 20

The scatterplot appears in the SPSS output window.

To facilitate our determination about the linearity of the plot, we will add a linear fit line, a loess fit line, and a confidence interval to the plot.

See slides 8 through 18 in the powerpoint titled: SimpleLinearRegression-Part2.ppt for directions on adding the fit lines and confidence interval to the plot.

04/20/23 Slide 21

The criteria we use for evaluating linearity is a comparison of the loess fit line to the linear fit line. If the loess fit line falls within a 99% confidence interval around the linear fit line, we characterize the relationship as linear. Minor fluctuations over the lines of the confidence interval are ignored.

The pattern of points in the scatterplot shows an obvious curve indicating non-linearity.

The assumption of linearity is not satisfied.

04/20/23 Slide 22

The pattern of points in the scatterplot showed an obvious curve indicating non-linearity. The assumption of linearity is not satisfied.

The check box is not marked.

We can try re-expressing one or both variables to see if we can improve the linearity of the relationship sufficiently to justify the use of linear regression to analyze the relationship.

04/20/23 Slide 23

The next statement asks us to examine the residual plot for evidence that the assumptions of linearity or homogeneity of variance are violated.

We will not examine the residual plot when we have a clear violation of the assumption of linearity in the scatterplot.

We would leave this statement unchecked when we violate linearity because it is difficult to evaluate homogeneity of variance when the plot is non-linear.

However, we will examine the residual plot just for practice.

If we violate linearity, homogeneity of variance, or have influential cases, we jump ahead to the question comparing r and rho.

04/20/23 Slide 24

If we add a loess fit line to the residual plot, we see that the non-linearity in the scatterplot is supported by the residual plot.

Just for practice, we examine the residual plot for linearity and homogeneity of variance.

04/20/23 Slide 25

The violation of homogenity of variance is also clearly visible in the residual plot. The spread of residual is narrow for low predicted values, but substantially larger for high predicted values.

04/20/23 Slide 26

The next statement asks us to examine the impact of influential cases on the analysis. Like outliers, we might consider elimination of influential cases to improve the fit for the majority of cases.

Since re-expression will alter the distance used to identify influential cases, we will not evaluate Cook's distances until we satisfy the assumptions of linearity and homogeneity.

While we would leave this statement unchecked when we violate linearity or homogeneity of variance, we will check for the presence of influential cases just for practice.

04/20/23 Slide 27

Summary information about Cook’s Distances is found in the table of Residual Statistics.

The maximum Cook’s distance is .150, less than the cutoff of .50 specified for the problems.

We do not cases that have very high values of Cook’s distance for the variables in this relationship.

04/20/23 Slide 28

The next statement asks us to compare Spearman’s rho to Pearson’s r to assess our expectations for the effectiveness of re-expressing the variables.

To compare rho and r, we compute the correlation coefficients in SPSS.

04/20/23 Slide 29

To compute correlations, select Correlate > Bivariate from the Correlate menu.

04/20/23 Slide 30

First, move the variables agemdn and infmort to the Variables list box.

Second, mark the check box for Spearman and leave the check box for Pearson marked.

Third, click on the OK button to produce the output.

04/20/23 Slide 31

Spearman's rho (-0.886) was larger than Pearson's r (-0.732). The feasibility of re-expressing the data to correct for the violation of regression assumptions is supported.

04/20/23 Slide 32

Spearman's rho (-0.886) was larger than Pearson's r (-0.732). The feasibility of re-expressing the data to correct for the violation of regression assumptions is supported.

The check box for a correct answer is marked.

04/20/23 Slide 33

The next statement asks us which transformations we used to try to induce linearity in the scatterplot.

We should re-express variables that have skewness less than -1.0 or greater than +1.0.

04/20/23 Slide 34

We will use the Descriptives procedure to obtain skewness for both variables.

Select Descriptive Statistics > Descriptives from the Analyze menu.

04/20/23 Slide 35

First, move the variables agemdn and infmort to the Variable(s) list box.

Second, click on the Options button to specify our choice for statistics.

04/20/23 Slide 36

Next, mark the check boxes for Kurtosis and Skewness in addition to the defaults marked by SPSSS.

Finally, click on the Continue button to close the dialog box.

04/20/23 Slide 37

Click on the OK button to produce the output.

04/20/23 Slide 38

The skewness for "infant mortality rate" [infmort] was 1.470. The skewness for "population median age in years" [agemdn] was 0.456

Since the skew for the dependent variable "infant mortality rate" [infmort] (1.470) was equal to or greater than +1.0, we attempt to correct violation of assumptions by re-expressing "infant mortality rate" on a logarithmic scale. Since the skew for the independent variable "population median age in years" [agemdn] (0.456) was greater than -1.0 and less than +1.0, we do not attempt to correct violation of assumptions by re-expressing it.

04/20/23 Slide 39

Since the skew for the dependent variable "infant mortality rate" [infmort] (1.470) was equal to or greater than +1.0, we attempt to correct violation of assumptions by re-expressing "infant mortality rate" on a logarithmic scale.

We mark the check box for a correct statement.

04/20/23 Slide 40

The next statement asks whether or not the relationship with the re-expressed variable, LG_infmort, supports the assumption of linearity.

04/20/23 Slide 41

We first create the transformed variable, the logarithm of infmort.

Select the Compute Variable command from the Transform menu.

04/20/23 Slide 42

First, type the name for the re-expressed variable in the Target Variable text box.

The directions for the problem give us the formula for the transformation: Use the formula LG10(infmort) to create the log transformation of infant mortality rate [LG_infmort].

Second, type the formula in the Numeric Expression text box.

Third, click on the OK button to compute the transformation.

04/20/23 Slide 43

Next, we create the scatterplot for the relationship with the re-expressed variable.

To create the scatterplot, select the Legacy Dialogs > Scatter/Dot from the Graphs menu.

04/20/23 Slide 44

In the Scatter/Dot dialog box, we click on Simple Scatter as the type of plot we want to create.

Click on the Define button to go to the next step.

04/20/23 Slide 45

First, move the dependent variable LG_infmort to the Y axis text box.

Second, move the independent variable agemdn to the X axis text box.

Third, click on the OK button to produce the plot.

04/20/23 Slide 46

The scatterplot looks linear, but to make sure we will add fit lines and a confidence interval.

04/20/23 Slide 47

The pattern of points in the scatterplot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied.

See slides 8 through 18 in the powerpoint titled: SimpleLinearRegression-Part2.ppt for directions on adding the fit lines and confidence interval to the plot.

04/20/23 Slide 48

The pattern of points in the scatterplot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied.


04/20/23 Slide 49

The next statement asks whether or not the residual plot supports the assumptions of linearity and equal variance.

04/20/23 Slide 50

To compute a simple linear regression, select Regression> Linear from the Analyze menu.

We next do the regression analysis, creating the residual plot in the process.

04/20/23 Slide 51

First, move the dependent variable, LG_infmort, to the Dependent text box.

Second, move the independent variable, agemdn, to the Independent(s) list box.

Third, click on the Statistics button to request basic descriptive statistics.

04/20/23 Slide 52

First, in addition to the defaults marked by SPSS, mark the check box for Descriptives so that we get the number of cases used in the analysis.

Second, click on the Continue button to close the dialog box.

04/20/23 Slide 53

Next, click on the Plots button to request the residual plot.

04/20/23 Slide 54

Second, move *ZPRED (for standardized predictions) to the Y axis text box.

First, move *ZRESID (for standardized residuals) to the Y axis text box.

Third, click on the Continue button to close the dialog box.

04/20/23 Slide 55

Next, click on the Save button to include Cooks distance in the output.

04/20/23 Slide 56

Click on the Continue button to close the dialog box.

Mark the check box for Cook’s distance to include this value in the data view and the output.

04/20/23 Slide 57


04/20/23 Slide 58

The pattern of points in the residual plot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied, confirming the finding of linearity in the scatterplot for the data.

The pattern of points in the residual plot shows equal spread across the standardized predictions of "infant mortality rate" [infmort]. The assumption of equal variance is satisfied.

04/20/23 Slide 59

The pattern of points in the residual plot does not show an obvious curve indicating non-linearity and shows equal spread across the standardized predictions.


04/20/23 Slide 60

The next statement asks about the presence of influential cases after the variable is re-expressed.

04/20/23 Slide 61

Summary information about Cook’s Distances is found in the table of Residual Statistics.

There were no cases that had a Cook's distance of 0.5 or greater, qualifying as influential cases.

04/20/23 Slide 62

Since there were no Cook’s distances greater than 0.5, we mark the check box as correct.

04/20/23 Slide 63

The next statement asks about the direction of the relationship between the variables.

04/20/23 Slide 64

The slope for the regression equation between "population median age in years" [agemdn] and the log transformation of "infant mortality rate" [LG_infmort] was -.05. The negative value for the slope means that scores for the two variables change in the opposite direction. Higher scores on the variable "population median age in years" were associated with lower scores on the log transformation of "infant mortality rate".

04/20/23 Slide 65

The negative slope implies an inverse relationship in which increases in one variable are associated with lower scores on the other variables. The statement that "countries who had a higher median age in years had a lower infant mortality rate" is correct.

We mark the statement as correct.

04/20/23 Slide 66

The next statement asks about the strength of the relationship based on Tukey’s criteria.

04/20/23 Slide 67

Using the rule of thumb attributed to Tukey, an R² between 0.0 and 0.04 is very weak; 0.04 to 0.16 is weak; 0.16 to 0.36 is moderate; 0.36 to 0.64 is strong; and greater than 0.64 is very strong, the relationship between the log transformation of "infant mortality rate" [LG_infmort] and "population median age in years" [agemdn]was correctly characterized as a very strong relationship (R² = 78.2%).

To answer the question about the strength of the relationship, we look to the Model Summary table.

04/20/23 Slide 68

The relationship between the log transformation of "infant mortality rate" [LG_infmort] and "population median age in years" [agemdn]was correctly characterized as a very strong relationship (R² = 78.2%).


04/20/23 Slide 69

The final statement asks about the strength of the relationship based on Cohen’s criteria.

04/20/23 Slide 70

Applying Cohen's criteria for effect size (less than 0.01 = trivial; 0.01 up to 0.09 = weak or small; 0.09 up to 0.25 = moderate; 0.25 or greater = strong or large), the relationship between the log transformation of "infant mortality rate" [LG_infmort] and "population median age in years" [agemdn]was incorrectly characterized as a moderate relationship.

The relationship should have been characterized as a strong relationship (R² = 78.2%).

04/20/23 Slide 71

The relationship between the log transformation of "infant mortality rate" [LG_infmort] and "population median age in years" [agemdn]was incorrectly characterized as a moderate relationship.

The check box for a correct statement is not marked.

04/20/23 Slide 72

Dependent variable and independent variable

both quantitative?

Yes

Stop. Remaining statements

are not marked.

Sample size stated correctly?

Yes

Do not mark check box.

Mark statement check box.

No

No

There is no explicit question about level of measurement, but it should always be something we consider.

Because of the large number of steps needed to solve this problem, we will outline the process.

04/20/23 Slide 73

Scatterplot supports linearity assumption?

Yes


Residual plot supports linearity/homogeneity

?

Yes


If we don’t satisfy linearity, we go to the comparison of r and rho.

If we don’t satisfy homogeneity, we go to the comparison of r and rho.


No


No

04/20/23Slide 74

Cook’s distance < 0.5 for all cases?

Yes


Spearman’s rho > Pearsons r?

Yes


No


When we satisfy the linear conditions, we bypass the questions on re-expression.

We use the comparison of r and rho to frame our expectations of re-expressing variables. Note: we compare absolute values, ignoring the sign of both r and rho.


No

04/20/23 Slide 75

Scatterplot supports linearity assumption?

Yes Do not mark check box.


No

Re-express variables with skew ≤ -1.0 or ≥ 1.0

Residual plot supports linearity/homogeneity

?

Yes Do not mark check box.


No

We re-express one or both variable, and re-test for a linear relationship.

04/20/23

YesDo not mark

check box.


Cook’s distance < 0.5 for all cases?

Stop.Linear model is not

appropriate.

No

Yes No



Direction of relationship(b) interpreted correctly?

If we support a linear model with either the raw data or the re-expressed data, we interpret direction and strength.

This is the strategy for our homework problems. In reality, we might choose to interpret the relationship even though assumptions were violated.

04/20/23 Slide 77

Strength of relationship(R²) interpreted correctly?

Yes No



Documents

11/4/2015Slide 1 SOLVING THE PROBLEM Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the