Upload
arline-gray
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
04/20/23 Slide 1
SOLVING THE PROBLEM
Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the data satisfies the assumption of linearity in a scatterplot of the raw data, provided the spread of the residuals is equal for all of the predicted values in the residual plot, and provided there are no outliers impacting the linear model. When the relationship we are analyzing does not meet these criteria, the use of regression analysis can still be justified if re-expressing one or both variables reduces the non-linear pattern in the scatterplot, equalizes the variance in the residual plot, and reduces the distance of outliers from the other cases in the distributions.
Clues that re-expression might be effective in linearizing the relationship are: severe skewing of one or both variables (outside the range from -1.0 to +1.0), and when Spearman's rho greater than Pearson's r.
There is no guarantee that re-expression will produce a scatterplot that satisfies the assumptions of linear regression. When it does not we are left with the choice of determining that the violations are not of serious consequence, or choosing an alternative strategy for modeling the relationship.
04/20/23 Slide 2
To solve these problems, we will first assess the conformity of the relationship to regression assumptions. Second, we will examine the criteria that suggest that re-expression might be effective. Third, we will examine the model using re-expressed variables to assess conformity to regression assumptions.
Finally, if the model using raw data supports the regression assumption, we will interpret the direction and strength of the relationship. If it was necessary to re-express one or both variables, we will interpret the strength of the re-expressed model, providing it satisfies the regression assumptions.
If the model still violates the conditions for a linear model, we will not interpret the direction and strength of the relationship. This is the convention in our homework problems; in a real application, I would consider interpreting the relationship, attaching a caution which identifies the violation of the assumptions.
In these problems, outliers are defined as cases that have a Cook’s distance greater than 0.5, and hence have a larger influence on the regression solution than other cases included in the analysis .
04/20/23 Slide 3
The introductory statement in the question indicates:• The data set to use (world2007.sav)• The task to accomplish (simple linear regression)• The variables to use in the analysis: the independent
variable population median age in years [agemdn] and the dependent variable infant mortality rate [infmort]
04/20/23 Slide 4
The second paragraph tells us how to re-express the variables, should it be necessary.
04/20/23 Slide 5
The first statement asks about the size of the sample. To answer this question, we run the linear regression in SPSS.
04/20/23 Slide 6
To compute a simple linear regression, select Regression> Linear from the Analyze menu.
04/20/23 Slide 7
First, move the dependent variable, infmort, to the Dependent text box.
Second, move the independent variable, agemdn, to the Independent(s) list box.
Third, click on the Statistics button to request basic descriptive statistics.
04/20/23 Slide 8
First, in addition to the defaults marked by SPSS, mark the check box for Descriptives so that we get the number of cases used in the analysis.
Second, click on the Continue button to close the dialog box.
04/20/23 Slide 9
Next, click on the Plots button to request the residual plot.
Though we do not need it to answer the immediate, producing the residual plot now will save us time later in the problem.
04/20/23 Slide 10
Second, move *ZPRED (for standardized predictions) to the Y axis text box.
First, move *ZRESID (for standardized residuals) to the Y axis text box.
Third, click on the Continue button to close the dialog box.
04/20/23 Slide 11
Next, click on the Save button to include Cooks distance in the output.
04/20/23 Slide 12
Click on the Continue button to close the dialog box.
Mark the check box for Cook’s distance to include this value in the data view and the output.
04/20/23 Slide 13
Click on the OK button to request the output.
04/20/23 Slide 14
Click on the OK button to request the output.
In the table of Descriptive Statistics, we see that the number of cases that have valid data for both variables is 190.
04/20/23 Slide 15
The number of cases with valid data to analyze the relationship between "population median age in years" and "infant mortality rate" was 190, out of the total of 192 cases in the data set.
Mark the check box for a correct statement.
04/20/23 Slide 16
The next statement asks us to examine a scatterplot to evaluate the assumption of linearity.
04/20/23 Slide 17
To create the scatterplot, select the Legacy Dialogs > Scatter/Dot from the Graphs menu.
04/20/23 Slide 18
In the Scatter/Dot dialog box, we click on Simple Scatter as the type of plot we want to create.
Click on the Define button to go to the next step.
04/20/23 Slide 19
First, move the dependent variable infmort to the Y axis text box.
Second, move the independent variable agemdn to the X axis text box.
Third, click on the OK button to produce the plot.
04/20/23 Slide 20
The scatterplot appears in the SPSS output window.
To facilitate our determination about the linearity of the plot, we will add a linear fit line, a loess fit line, and a confidence interval to the plot.
See slides 8 through 18 in the powerpoint titled: SimpleLinearRegression-Part2.ppt for directions on adding the fit lines and confidence interval to the plot.
04/20/23 Slide 21
The criteria we use for evaluating linearity is a comparison of the loess fit line to the linear fit line. If the loess fit line falls within a 99% confidence interval around the linear fit line, we characterize the relationship as linear. Minor fluctuations over the lines of the confidence interval are ignored.
The pattern of points in the scatterplot shows an obvious curve indicating non-linearity.
The assumption of linearity is not satisfied.
04/20/23 Slide 22
The pattern of points in the scatterplot showed an obvious curve indicating non-linearity. The assumption of linearity is not satisfied.
The check box is not marked.
We can try re-expressing one or both variables to see if we can improve the linearity of the relationship sufficiently to justify the use of linear regression to analyze the relationship.
04/20/23 Slide 23
The next statement asks us to examine the residual plot for evidence that the assumptions of linearity or homogeneity of variance are violated.
We will not examine the residual plot when we have a clear violation of the assumption of linearity in the scatterplot.
We would leave this statement unchecked when we violate linearity because it is difficult to evaluate homogeneity of variance when the plot is non-linear.
However, we will examine the residual plot just for practice.
If we violate linearity, homogeneity of variance, or have influential cases, we jump ahead to the question comparing r and rho.
04/20/23 Slide 24
If we add a loess fit line to the residual plot, we see that the non-linearity in the scatterplot is supported by the residual plot.
Just for practice, we examine the residual plot for linearity and homogeneity of variance.
04/20/23 Slide 25
The violation of homogenity of variance is also clearly visible in the residual plot. The spread of residual is narrow for low predicted values, but substantially larger for high predicted values.
04/20/23 Slide 26
The next statement asks us to examine the impact of influential cases on the analysis. Like outliers, we might consider elimination of influential cases to improve the fit for the majority of cases.
Since re-expression will alter the distance used to identify influential cases, we will not evaluate Cook's distances until we satisfy the assumptions of linearity and homogeneity.
While we would leave this statement unchecked when we violate linearity or homogeneity of variance, we will check for the presence of influential cases just for practice.
04/20/23 Slide 27
Summary information about Cook’s Distances is found in the table of Residual Statistics.
The maximum Cook’s distance is .150, less than the cutoff of .50 specified for the problems.
We do not cases that have very high values of Cook’s distance for the variables in this relationship.
04/20/23 Slide 28
The next statement asks us to compare Spearman’s rho to Pearson’s r to assess our expectations for the effectiveness of re-expressing the variables.
To compare rho and r, we compute the correlation coefficients in SPSS.
04/20/23 Slide 29
To compute correlations, select Correlate > Bivariate from the Correlate menu.
04/20/23 Slide 30
First, move the variables agemdn and infmort to the Variables list box.
Second, mark the check box for Spearman and leave the check box for Pearson marked.
Third, click on the OK button to produce the output.
04/20/23 Slide 31
Spearman's rho (-0.886) was larger than Pearson's r (-0.732). The feasibility of re-expressing the data to correct for the violation of regression assumptions is supported.
04/20/23 Slide 32
Spearman's rho (-0.886) was larger than Pearson's r (-0.732). The feasibility of re-expressing the data to correct for the violation of regression assumptions is supported.
The check box for a correct answer is marked.
04/20/23 Slide 33
The next statement asks us which transformations we used to try to induce linearity in the scatterplot.
We should re-express variables that have skewness less than -1.0 or greater than +1.0.
04/20/23 Slide 34
We will use the Descriptives procedure to obtain skewness for both variables.
Select Descriptive Statistics > Descriptives from the Analyze menu.
04/20/23 Slide 35
First, move the variables agemdn and infmort to the Variable(s) list box.
Second, click on the Options button to specify our choice for statistics.
04/20/23 Slide 36
Next, mark the check boxes for Kurtosis and Skewness in addition to the defaults marked by SPSSS.
Finally, click on the Continue button to close the dialog box.
04/20/23 Slide 37
Click on the OK button to produce the output.
04/20/23 Slide 38
The skewness for "infant mortality rate" [infmort] was 1.470. The skewness for "population median age in years" [agemdn] was 0.456
Since the skew for the dependent variable "infant mortality rate" [infmort] (1.470) was equal to or greater than +1.0, we attempt to correct violation of assumptions by re-expressing "infant mortality rate" on a logarithmic scale. Since the skew for the independent variable "population median age in years" [agemdn] (0.456) was greater than -1.0 and less than +1.0, we do not attempt to correct violation of assumptions by re-expressing it.
04/20/23 Slide 39
Since the skew for the dependent variable "infant mortality rate" [infmort] (1.470) was equal to or greater than +1.0, we attempt to correct violation of assumptions by re-expressing "infant mortality rate" on a logarithmic scale.
We mark the check box for a correct statement.
04/20/23 Slide 40
The next statement asks whether or not the relationship with the re-expressed variable, LG_infmort, supports the assumption of linearity.
04/20/23 Slide 41
We first create the transformed variable, the logarithm of infmort.
Select the Compute Variable command from the Transform menu.
04/20/23 Slide 42
First, type the name for the re-expressed variable in the Target Variable text box.
The directions for the problem give us the formula for the transformation: Use the formula LG10(infmort) to create the log transformation of infant mortality rate [LG_infmort].
Second, type the formula in the Numeric Expression text box.
Third, click on the OK button to compute the transformation.
04/20/23 Slide 43
Next, we create the scatterplot for the relationship with the re-expressed variable.
To create the scatterplot, select the Legacy Dialogs > Scatter/Dot from the Graphs menu.
04/20/23 Slide 44
In the Scatter/Dot dialog box, we click on Simple Scatter as the type of plot we want to create.
Click on the Define button to go to the next step.
04/20/23 Slide 45
First, move the dependent variable LG_infmort to the Y axis text box.
Second, move the independent variable agemdn to the X axis text box.
Third, click on the OK button to produce the plot.
04/20/23 Slide 46
The scatterplot looks linear, but to make sure we will add fit lines and a confidence interval.
04/20/23 Slide 47
The pattern of points in the scatterplot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied.
See slides 8 through 18 in the powerpoint titled: SimpleLinearRegression-Part2.ppt for directions on adding the fit lines and confidence interval to the plot.
04/20/23 Slide 48
The pattern of points in the scatterplot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied.
The check box for a correct answer is marked.
04/20/23 Slide 49
The next statement asks whether or not the residual plot supports the assumptions of linearity and equal variance.
04/20/23 Slide 50
To compute a simple linear regression, select Regression> Linear from the Analyze menu.
We next do the regression analysis, creating the residual plot in the process.
04/20/23 Slide 51
First, move the dependent variable, LG_infmort, to the Dependent text box.
Second, move the independent variable, agemdn, to the Independent(s) list box.
Third, click on the Statistics button to request basic descriptive statistics.
04/20/23 Slide 52
First, in addition to the defaults marked by SPSS, mark the check box for Descriptives so that we get the number of cases used in the analysis.
Second, click on the Continue button to close the dialog box.
04/20/23 Slide 53
Next, click on the Plots button to request the residual plot.
04/20/23 Slide 54
Second, move *ZPRED (for standardized predictions) to the Y axis text box.
First, move *ZRESID (for standardized residuals) to the Y axis text box.
Third, click on the Continue button to close the dialog box.
04/20/23 Slide 55
Next, click on the Save button to include Cooks distance in the output.
04/20/23 Slide 56
Click on the Continue button to close the dialog box.
Mark the check box for Cook’s distance to include this value in the data view and the output.
04/20/23 Slide 57
Click on the OK button to request the output.
04/20/23 Slide 58
The pattern of points in the residual plot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied, confirming the finding of linearity in the scatterplot for the data.
The pattern of points in the residual plot shows equal spread across the standardized predictions of "infant mortality rate" [infmort]. The assumption of equal variance is satisfied.
04/20/23 Slide 59
The pattern of points in the residual plot does not show an obvious curve indicating non-linearity and shows equal spread across the standardized predictions.
The check box for a correct answer is marked.
04/20/23 Slide 60
The next statement asks about the presence of influential cases after the variable is re-expressed.
04/20/23 Slide 61
Summary information about Cook’s Distances is found in the table of Residual Statistics.
There were no cases that had a Cook's distance of 0.5 or greater, qualifying as influential cases.
04/20/23 Slide 62
Since there were no Cook’s distances greater than 0.5, we mark the check box as correct.
04/20/23 Slide 63
The next statement asks about the direction of the relationship between the variables.
04/20/23 Slide 64
The slope for the regression equation between "population median age in years" [agemdn] and the log transformation of "infant mortality rate" [LG_infmort] was -.05. The negative value for the slope means that scores for the two variables change in the opposite direction. Higher scores on the variable "population median age in years" were associated with lower scores on the log transformation of "infant mortality rate".
04/20/23 Slide 65
The negative slope implies an inverse relationship in which increases in one variable are associated with lower scores on the other variables. The statement that "countries who had a higher median age in years had a lower infant mortality rate" is correct.
We mark the statement as correct.
04/20/23 Slide 66
The next statement asks about the strength of the relationship based on Tukey’s criteria.
04/20/23 Slide 67
Using the rule of thumb attributed to Tukey, an R² between 0.0 and 0.04 is very weak; 0.04 to 0.16 is weak; 0.16 to 0.36 is moderate; 0.36 to 0.64 is strong; and greater than 0.64 is very strong, the relationship between the log transformation of "infant mortality rate" [LG_infmort] and "population median age in years" [agemdn]was correctly characterized as a very strong relationship (R² = 78.2%).
To answer the question about the strength of the relationship, we look to the Model Summary table.
04/20/23 Slide 68
The relationship between the log transformation of "infant mortality rate" [LG_infmort] and "population median age in years" [agemdn]was correctly characterized as a very strong relationship (R² = 78.2%).
The check box for a correct answer is marked.
04/20/23 Slide 69
The final statement asks about the strength of the relationship based on Cohen’s criteria.
04/20/23 Slide 70
Applying Cohen's criteria for effect size (less than 0.01 = trivial; 0.01 up to 0.09 = weak or small; 0.09 up to 0.25 = moderate; 0.25 or greater = strong or large), the relationship between the log transformation of "infant mortality rate" [LG_infmort] and "population median age in years" [agemdn]was incorrectly characterized as a moderate relationship.
The relationship should have been characterized as a strong relationship (R² = 78.2%).
04/20/23 Slide 71
The relationship between the log transformation of "infant mortality rate" [LG_infmort] and "population median age in years" [agemdn]was incorrectly characterized as a moderate relationship.
The check box for a correct statement is not marked.
04/20/23 Slide 72
Dependent variable and independent variable
both quantitative?
Yes
Stop. Remaining statements
are not marked.
Sample size stated correctly?
Yes
Do not mark check box.
Mark statement check box.
No
No
There is no explicit question about level of measurement, but it should always be something we consider.
Because of the large number of steps needed to solve this problem, we will outline the process.
04/20/23 Slide 73
Scatterplot supports linearity assumption?
Yes
Mark statement check box.
Residual plot supports linearity/homogeneity
?
Yes
Mark statement check box.
If we don’t satisfy linearity, we go to the comparison of r and rho.
If we don’t satisfy homogeneity, we go to the comparison of r and rho.
Do not mark check box.
No
Do not mark check box.
No
04/20/23Slide 74
Cook’s distance < 0.5 for all cases?
Yes
Mark statement check box.
Spearman’s rho > Pearsons r?
Yes
Mark statement check box.
No
Do not mark check box.
When we satisfy the linear conditions, we bypass the questions on re-expression.
We use the comparison of r and rho to frame our expectations of re-expressing variables. Note: we compare absolute values, ignoring the sign of both r and rho.
Do not mark check box.
No
04/20/23 Slide 75
Scatterplot supports linearity assumption?
Yes Do not mark check box.
Mark statement check box.
No
Re-express variables with skew ≤ -1.0 or ≥ 1.0
Residual plot supports linearity/homogeneity
?
Yes Do not mark check box.
Mark statement check box.
No
We re-express one or both variable, and re-test for a linear relationship.
04/20/23
YesDo not mark
check box.
Mark statement check box.
Cook’s distance < 0.5 for all cases?
Stop.Linear model is not
appropriate.
No
Yes No
Do not mark check box.
Mark statement check box.
Direction of relationship(b) interpreted correctly?
If we support a linear model with either the raw data or the re-expressed data, we interpret direction and strength.
This is the strategy for our homework problems. In reality, we might choose to interpret the relationship even though assumptions were violated.
04/20/23 Slide 77
Strength of relationship(R²) interpreted correctly?
Yes No
Do not mark check box.
Mark statement check box.