SPSS 2 Hypothesis Testing and Inferential Statisticshusam.khadash.com/students page_files/articles/test variabes-spss2.pdf · 1 Intermediate SPSS (2) Hypothesis Testing and Inferential

1

Intermediate SPSS (2) Hypothesis Testing and Inferential Statistics Tutorial Goal: Continuation of Intermediate SPSS (1). Building and testing hypotheses using inferential statistics in SPSS. This workshop covers parametric and nonparametric data, concentrating on ANOVAs, linear regression, and logistic regression. Participants learn how to understand, analyze and report results. Ok, let’s review somewhat from our last workshop. For statistics, there are four kinds of levels of measurement for the variable. All your analyses extend from what kind of level your variable is. They are NOIR. (N)ominal (O)rdinal (I)nterval (R)atio Let’s talk about each one. Nominal means that the number simply represents a category of objects. There is no measured different among the objects or people. Some examples are giving states numbers (N.Y. 1, Connecticut 2, R.I. 3), assigning a number for gender (male 1, female 2), or designating college major (History 1, Business 2, Sociology 3). You are just assigning a number to something. Ordinal means the larger number for the object is truly larger in some sort of amount. This typically means rank. Some examples are 1st, 2nd, and 3rd places in a contest, or preferences for different movies. However, there is no exactly measured difference among the objects. We don’t know definitively how much larger or better 1st is compared to 2nd. We just know 1st is somehow larger than 2nd. Interval means, like Ordinal, that there is a rank for the objects or people, but there is also a measurement for the ranking. Some examples are degrees Celsius or Fahrenheit. We know that the different between 98 and 99 degrees is the difference of the amount of mercury in a thermometer. Also, the difference between 42 and 43 degrees is the same amount between 98 and 99. However, there is no true zero, which stands for a complete lack of the object being measured. 0 degree does not mean there is no mercury, for example. Ratio means, like Interval, that there is a measurement for the ranking, but there is also a true zero. A true zero means that there is lack of the quality being measured. Some examples are income, where the difference between $10,000 and $11,000 is known and zero means complete lack of income. These levels are very important and we will be discussing them more as we go on. Nominal and Ordinal are called Nonparametric Data, and Interval and Ratio are called Parametric Data. The statistical analyses that you can use are dependent on what level your data are. Specifically, if you can make a logical mean using your data, then you can use parametric data.

There are some assumptions about data that you should be aware of. These also affect which test you choose.

Nonparametric Parametric • Nominal/Ordinal Data • Random sampling

• Interval/Ratio Data • Random sampling • Normal Distribution • Equal variances of the scores

in populations that the samples come from.

Since the parametric data have more assumptions, the parametric tests are considered more powerful when the assumptions are met. Powerful means that these tests are better at picking up differences in variables in the population. Also they are more robust to the violations of the assumptions. So, if the assumptions are not completely met, you can still get accurate results. The only assumption that’s nonnegotiable is the level of measurement. When you choose a statistical analysis, you need to do two things, make a hypothesis and decide on significance. We are now going to talk about each one before we go on to the test. 1. Making hypotheses is an essential part of every test. These hypotheses always deal with how the numbers of your sample relate to the numbers of the population. First, start with a null hypothesis and then an alternative hypothesis.

Null Hypothesis (HO) states that the numbers of your sample do not differ significantly from the numbers of the population. For example, you walk into any old restaurant and do an IQ test on 30 customers. The HO says that the mean of their IQs should not differ significantly from the mean IQ of the population. Alternative Hypothesis (HA) states that the numbers of your sample differ significantly from the numbers of the population. For example, we heard that the restaurant has intelligence boosting spices in the food, so our HA is that the sample of 30 people from the restaurant has a mean IQ of 130, which is much higher than the population’s IQ.

2

2. Significance shows us the likelihood that a particular result is due to chance. Remember back to our normal distribution and IQs. What are the chances that you randomly go to a restaurant and get a group of people to do an IQ test and the mean is 130? Pretty slim. As you can see in the graph to the right, about 97% of people have IQs below 130. That’s the concept behind significance. We are seeing what the likelihood is of getting a certain result.

P is significance. So, if you see a result reported p<.05, it means that the likelihood that the result is due to chance is less than .05. You, when doing research, have .05 likelihood of chance that you can tolerate. It is conventional that significance is set at .05, but it can even be lower at .01 or even lower depending on how daring you want to be. One tailed or two tailed is where you put this likelihood of chance on your distribution curve. The likelihood of chance is also called alpha.

• One-tailed tests are used if you have a directional hypothesis. Mainly, you put the .05 of chance in the direction of your alternative hypothesis. So, if you say you’re going to find a sample with a mean of 130 and the mean is 100, you put the whole .05 in the direction of the hypothesis, which is above the mean.

• Two-tailed tests are used when you are not certain in which direction your alternative hypothesis goes. So, if you hypothesize that a sample mean is somehow different than the population’s mean, in either a positive or negative direction, then split the alpha into two parts of .025 and place them at either ends of the normal distribution.

.05 due to chance

.025 due to chance

.025 due to chance

After you have performed a test, you verbalize the result in a sentence. Also you usually report five things: test result, degrees of freedom (df), number of sample, significance and one- or two-tailed. 1. Test Result: Each test has its own mathematical equation. For our purposes in SPSS, we do not need to

know the exact mechanics of each equation. We will just discuss the big picture of each test and roughly what it’s doing. Basically, for these analyses here, the higher the result, the better our chances of reaching significance and rejecting the H0. However, when reporting the result, you need to report the result of the equation. This will be pointed out in each of our tests.

2. Degrees of Freedom (df): The df is the number of frequencies that is allowed to vary, which is the number

of observations minus the number of constraints. This point is very technical and really doesn’t affect your research. You just need to report it. You only need to report this for chi-square, t-tests, correlation and ANOVAs.

3. Number: Number of cases in your sample. 4. One- or two-tailed: Where you put your chance of randomness (only with parametric tests).

3

5. Significance: You need to report the level of significance that your result reached. APA also suggests reporting effect size. Please see Publication Manual of the American Psychological Association for further information. From these four things, the test result and significance are the most important. Basically, you need a test result of a number high enough to reach significance. For example, if I were doing chi-square with 2 df, I need a test result (critical value) of at least 5.992 to reach significance. If you reach significance, you can reject the HO and accept the HA (You always talk about accepting hypotheses). Don’t worry, though. SPSS does all the math. You only need to understand and report the results.

df\area .050 .025 .010 .005

1 3.84146 5.02389 6.63490 7.87944

2 5.99146 7.37776 9.21034 10.59663

3 7.81473 9.34840 11.34487 12.83816

4 9.48773 11.14329 13.27670 14.86026

5 11.07050 12.83250 15.08627 16.74960

Minimum critical value to reach significance at p = .05 with 2 df.

So, with our restaurant and IQ example, the result would be reported as “The mean IQ of 130.76 for the 30 eaters at the restaurant was significantly higher than the national average IQ, t (29) = 20.650, p < .000, one-tailed).”

In this lesion, you will be introduced to five of the major statistical tests: simple and multiple regression, logistic regression, and ANOVA. 1. ANOVA Our last important statistical analysis is the Analysis of Variance (ANOVA). Basically, it’s multiple t-tests. Instead of doing a t-test for every group that you have and all their combinations, you can do a quick ANOVA and check the means of each group with each other.

4

Please open the Anova file.

This is again data from the GSS. Let’s do some sociology/political science research. Here we have two variables, livecom1, which is number of years a person has lived in a community, and partyid, which is the person’s political party affiliation (0 strong democrat; 1 not strong democrat; 2 independent, near dem; 3 independent; 4 independent, near rep; 5 not strong republican; 6 strong republican). So, we want to form 7 groups according to how each person views him/herself politically and compare the mean number of years living in a community for each group. We are going to do a One-Way ANOVA. This is where we have one independent variable, partyid, which is categorical data, and one dependent variable, livecom1, which is parametric data. Remember, categorical data is nominal or ordinal. You can break up the data into different groups. Also, there is a nonparametric equivalent of ANOVAs - Kruskal-Wallis.

You can also do Two-Way (two independent variables) or more ANOVAS. In addition, there are two ways we deal with our participants or subjects. If the levels are different between the participants, for example different drug doses, then our ANOVA is called Between-subjects. If the levels vary for each participant, so he/she experiences two or more levels, then our ANOVA is called Within-subjects. Use the statistics reference at the end of the tutorial to learn more.

Let’s establish our hypotheses: HO: There is no difference in the number of years a person has lived in a community according to the person’s political identity.

5

HA: People who are democratic live longer in a neighborhood. 1. In the Analyze menu, select Compare Means and left-click One-Way ANOVA.

2. In the One-Way Anova dialog, 1) select HOW LONG HAVE YOU LIVED IN THE COMMUNITY? [livecom1] click the arrow and move it into the Dependent List. 2) Select POLITICAL PARTY AFFLIATION [partyid] click the arrow and move it into the Factor field. Now, we need to select a few more functions. A simple ANOVA just says that the means of all the groups are different. It does not say how they are different. For that, we need to do Post Hoc tests. 3) Click on Post Hoc.

1

2

3

6

3. In the One-Way ANOVA: Post Hoc Multiple Comparisons, select Bonferroni, which is the most widely used and controls for Type I Error. Click Continue.

Type I Error or alpha inflation: when you do an Anova, you run risk of inflating your alpha and falsely rejecting the null hypothesis. Bonferroni controls for that.

4. Back in One-Way ANOVA, 1) click on Options. This brings up the One-Way ANOVA: Options dialog. Let’s add some nice graphics. 2) Select Means plot. 3) Select Descriptive to give us the means of the groups. 4) Click Continue. 5) Back in One ANOVA dialog, click OK to do the ANOVA.

7

5

43

2

1

Now let’s look at our results. First, we have our Descriptives. If you look at the Mean, you can see that the mean number of years is smallest for Independents and highest for Strong Democrats and Strong Republicans.

In the Multiple Comparisons, you have the comparisons of the groups. You can see that the mean of the group of Strong Democrat compared to Independent is significant. Also, you see that the mean is a positive 6.116, so the Strong Democrat group lived more than 6 years more in their neighborhood than Independents. If you check Strong Republicans compared to Independents, you see a significant mean of 5.808.

In the ANOVA chart: A. The df for your result. B. The F result, which is the

result for the ANOVA. C. The Sig. is .004 there is at

least one difference in the means. We need to check out what’s actually difference in the Post Hocs Tests.

The Means Plots is a nice visual of what’s happening. The mean number of years of Strong Democrats and Republicans is noticeably higher than Indepedents. Remember, the means of the other groups were not significant.

So, let’s report our results: (F with the df from the Regression and Residual in subscript from the ANOVA table = result, p (<=) number from the ANOVA table) There is a significant difference in length of residence among groups of different political affiliation, F6, 1284= 3.197, p < .004. Also, we should report our Post Hoc Tests to explain what means were exactly significant.

8

“Using a Bonferroni post-hoc test (p <.05), the group of Strong Democrats (M = 19.55 years) and Strong Republicans (M = 19.24 years) lived longer in their neighborhoods than Independents (M = 13.43 years). 2. REGRESSION: This is an analysis that is almost a continuation of correlation. As we have seen, one variable varied with another in correlation. A perfect correlation is when X increases one unit of measurement, Y increases an equal unit of measurement. So, in Regression, you are using the X variable to predict the observation for the Y variable. This can be done in Simple Regression, where you use one variable to predict another, or Multiple Regression, where you use two or more variables to predict another. These variables are parametric data. However, if the variable you want to predict is categorical, with a value of 0 or 1 for example, you need to use Logistic Regression. First, we need to know the names of the variables: Dependent and Independent. Dependent: This is the variable being predicted. It “depends” on the other variable to be explained, so to speak. Independent: This is the variable(s) that predicts the dependent. It is independent and can vary. So, if we were to regress (make a prediction with variables) our poverty and murder variables, murder would be the dependent variable and poverty would be the independent variable. We are predicting the murder rate by the poverty rate. The regression equation to predict is ý = a + bx Don’t worry, you are not expected to

remember the math. Regression in the Social Sciences is not usually used for predicted, but for theorizing. We want to create a model that predicts the dependent variable, so we can understand what’s happening.

Where ý = the predicted value of y a = the intercept b = slope. This is also given as β x = the x variable As in correlation with the fit line, regression is a calculation of the regression line. This line best approximates all of the points and slopes. So, you have the regression line, but, more often than not, not all points fall neatly onto the line. The distance between the line and the actual observation is a residual.

17.5015.0012.5010.007.50

12.00

10.00

8.00

6.00

4.00

2.00

0.00

mur

der

Maryland

R Sq Linear = 0.

residual

9

• Simple Linear Regression

Ok, simple regression is where we use one independent variable to predict one dependent variable. Using U.N. data, let’s do some population studies research. Let’s try to regress the mortality rate of a country using the caloric intake of its population. Recent U.N. data can be retrieved online (http://unstats.un.org/unsd/databases.htm). 1. In SPSS Data Editor, open the data file for Regression.

2. In the data file, click onto Variable View tab at the end of the screen. This is a data set of 109 countries containing 8 variables. We want to use the variable food (Per capita daily calories 1985) as the independent variable to predict death (Crude death rate/1000 people), our dependent variable.

3. We need to establish our hypotheses: H0: There is no relationship between a country’s death rate and caloric intake. HA: A country’s death rate can be predicted by the daily caloric intake of its populace.

10

http://unstats.un.org/unsd/databases.htm

4. In the Analyze menu, select Regression and left-click Linear.

5. In the Linear Regression dialog, 1) in the variable list select Crude death rate/1000 death people [death], click the arrow and move it into the Dependent field. Select Per capita daily calories 1985 [food], click the arrow and move it to the Independent field. 2) Click on Save since we want the residuals.

1

2

11

6. In the Linear Regression Save, 1) select Unstandardized in Residuals. This calculates the difference in the observed and predicted values of each case. 2) Click Continue.

1 2

7. Back in the Linear Regression dialog, click OK.

12

8. In your Output window, scroll down a little so you can see three tables: Model Summary, ANOVA, and Coefficients. Let’s discuss what’s important.

A. R Square (Model Summary) is the measure of the success of our model. It shows how much variance of the dependent variable is captured. For the Social Sciences, .347 is not bad. But, the higher, the better. You can also report the Adjusted R Square, the difference being that the adjusted rate does not automatically inflate the variance with additional independent variables. B. Sig. (ANOVA) is the significance of our model. .000 means we reached significance and can accept our HO. C. Beta (Coefficients) is the b (slope) in our regression equation. Beta is also negative, showing a negative relationship. So, the higher the per capita daily calories, the lower the mortality rate.

You can think of it this way. Look at the Unstandardized Coefficient for a moment. This number does not take the standard deviation into consideration, so it will be clearer for us to understand. You multiply the unstandardized coefficient by the slope to get the unit change. For every unit increase in the independent variable, there is a unit decrease (the sign is negative) in the dependent variable. So, if you increase food intake by 500 calories, you decrease the mortality by 2.5 people (-.005 * 500 = -2.5). Ok, so we can accept our HA. When we report our results, we have to report the Model and the Beta. In reality, you would never do a simple regression. You would have multiple independent variables. We’ll talk about that shortly. First our model: F with the df from the Regression and Residual in subscript from the ANOVA table = result, p (<=) number from the ANOVA table, and the Adjusted R Square from the Model Summary. “Caloric intake significantly predicted mortality rates, F1,106 = 56.217, p<.001, R2 =.347.” You also need to give info on the significant independent variable: Name of Variable, β = result, p (< =) result. “Our independent variable, per capita daily calories, was a significant predictor, β = -.589, p < .001.” Now, let’s visualize our data with a scatterplot to get a better sense of how our modal is doing.

13

1. In the Output window, go to the Graphs menu, and left-click Scatterplot.

2. In the Scatterplot dialog, make sure Simple is selected and choose Define.

14

3. In the Simple Scatterplot, move the variable Crude death rate to the Y Axis field, the Per capita daily calories variable to the X Axis field, and the variable country to the Label Cases by field. So, on our graph, mortality is the Y (vertical) axis, calories is the X (horizontal) axis, and the cases are labeled by the state name. The dependent variable goes on the Y axis and the independent variable goes on the X axis. Click OK.

This scatterplot really does a great job depicting the relationship. There is a definite negative relationship.

15

1. Double left-click on the graph to bring up the Chart Editor.

2. Let’s put in the regression line 1) Click on the Add Fit Line at Total icon and close the Properties dialog.

16

3. We now want to individual label cases. Click on Point Id and your cursor turns into a cross-hairs. Label some of the cases above the upper lower end of the regression line. See a pattern? They’re European countries.

Let’s now look at our residuals. Remember, they are the distance of the observed case from the regression line. Or, in other words, how far our cases fell from the line, which is our model. 1. Click back into the Data Editor. You now have a new variable, res_1. This is the residuals.

17

2. We want to line up the residuals from highest to lowest. Right-click onto the title bar of the column for res_1. Left-click on Sort Descending.

3. If you scroll down, you notice that Spain is the European country with the lowest residuals, -.97067. So, you can see a pattern that European countries follow the regression line or the residuals are higher.

• Multiple Regression

In multiple regression, you can set up a model where you use more than one independent variable to explain the dependent variable. The dependent variables are usually parametric data, but you can also use dummy variables. This type of variable uses binary data (1 or 0) and can be included in the model. At the end of this tutorial are some suggestions for further reading in statistics. It’s highly advisable to study some more and get a better understanding of statistics. Continuing with our mortality rate, let’s try to use more variables in addition to our daily calories. Look at our Variable View.

18

We are interested in pop (1985 population in millions), energy (Per cap energy consumed), gnpcap (Per capita GNP 1985), gnpgro (Annual GNP growth % 65-86) and urban (% population urban 1985). Let’s use these variables and daily calories to see if we can predict mortality. First, our hypotheses: HO: The mortality rate of a country is not related to the population size, per capita energy consumed per capita GNP, annual GNP growth from 1965, percentage of urban population, or daily calories. HA: Population size, per capita energy consumed, per capita GNP, annual GNP growth from 1965, percentage of urban population, and daily calories can predict the mortality rate in a country.

1. In the Analyze menu, select Regression, and left-click Linear.

19

2. In the Linear Regression dialog, 1) select Crude death rate/1000, click the arrow and move it to the Dependent field. Leave Per capita daily calories in the Independent variables. Select pop (1985 population in millions), energy (Per cap energy consumed), gnpcap (Per capita GNP 1985), gnpgro (Annual GNP growth % 65-86) and urban (% population urban 1985), click the arrow and move them into Independent. There should be 6 independent variables all together. 3) Don’t forget to turn off the residuals by clicking onto Save. 4) We need to get do multicollinearity test. Click on Statistics.

1

There are several methods to enter the variables, such as Enter, Stepwise and Remove. The default is Enter. Also, you need to report what method you used. Usually, if you have a large set of independent variables, more than seven, you can use one of the stepwise options to eliminate variables.

2 3

4

3. In the Linear Regression: Statistics dialog, turn on Collinearity diagnostics and click on Continue.

Multicollinearity is when your independent variables correlate with each other. You need to avoid this because it’s really just creating noise in your model. For example, poverty correlates with crime and illiteracy. So, why have all three variables when poverty is a major cause of the other two?

20

4. Plots help you visualize your data in multiple regression. You can test for normality, linearity, equality of variances and outliers. 1) Back in the Linear Regression dialog, click on Plots. 2) In the Linear Regression: Plots dialog, click on Histogram and Normal Probability Plot. This will give you histograms and plots of the residuals standardized by the standard deviation. 3) Back in the Linear Regression dialog, click OK.

3

1 2

5. Scroll down to Model Summary and ANOVA. Here you find the model results.

A. In Model Summery, you find the R Square of .525. So, this model does a better job of explaining the variance of the dependent than only using daily calories.

B. In the ANOVA table, you find the df, F

value and the Sig., all of which you need to report. The Sig. for the model is .000, so we can reject the null hypothesis.

21

C. The Beta tells you the slope for the coefficients. D. The Sig. is if the variable is significant

E. Tolerance is the value for collinearity. A value under .2 usually means the variable is correlating with other independent variables and should not be included. We have three variables that are significant and not correlating: % population urban 1985, Per capita daily calories 1985, and Annual GNP growth % 65-85. 6. Scroll down to the graphics. 1) You have the histogram with the normal curve superimposed. There is really no serious skewness. 2) You have a P-P Plot. If there were skewness, you would see observations far away from the line.

21

When we report, we need to report all the variables, even the insignificant ones. Use the same reporting structure as with simple regression: First our model: F with the df from the Regression and Residual in subscript from the ANOVA table = result, p (<=) number from the ANOVA table, and the Adjusted R Square from the Model Summary. Also, since we have many significant independent variables, start a chart: Name of Variable, B = result, β = result, p (< =) result. “The percentage of annual GNP growth for the years 1965-85, the percentage of urban population in 1985 and the per capita daily calories were statistically significant predictors of each country’s mortality rate, F6, 92 = 16.914, p < .001, R2 =.525.”

22

Also, in multiple regression, it is a good idea to create a chart of the independent variables. Also state what type of enter method you used, i.e. enter or stepwise.

Dependent Variables B β p Percentage of Urban Population 1985 -.107 -.588 <.001 1985 Population in Millions -.005 -.125 .093 Per Capita Daily Calories 1985 -.002 -.285 .018 Per Capita Energy Consumed, kg oil .001 .240 .225 Per Capita GNP 1985 3.07E-006 .003 .989 Percentage of Annual GNP Growth 1965-85 -.482 -.287 .006

*Enter Method APA suggests reporting only standardized Betas for purely theoretical work and unstandardized Betas for purely applied. Keep the scientific notation. Also, even though APA does not mention it, you often see p values flagged for significance at <.05 or <.01. You will have to discuss this with your advisor or journal editor. In the Output window, please close all the results, but don’t close the window. 3. LOGISTIC REGRESSION In section one, you used linear regression to predict the dependent variable. The dependent variable was usually a ratio variable, or continuous data. Sometimes you want to have a dependent variable that is a dichotomous variable, one which has one two answers. For example, in community health you want to see if a disease is present or not present. In environmental studies, you can predict to see if a mudslide occurs or does not occur. In political science, you can see if a person will participate or not participate in an election. You can also have a more complex categorical dependent variable. For this lesson, you will work with binary logistic regression. However, if the dependent variable has more than two outcomes, you can use multinomial logistic regression. For today, you will try to predict the probability of a person defaulting on a loan. You have a number of variables, some parametric and one categorical, which you will use to try and regress the default probability. 1. Import the data set logistic.sav.

23

As you can see 1) there is a default variable with the values 0 for no or 1 for yes. This is a dichotomous variable, so we should use logistic regression. 2) There are eight other variables that we want to use for our independent variables. Please check these variables out in the Variable View. You can see that seven of them are scalar data. However, if you look at ed (level of education), you see it is also a ordinal level of measurement variable, i.e. 1 ‘did not complete high school,’ 2 ‘high school degree,’ etc. We will have to treat it differently in our set up of the regression.

1

2

1. Let’s do our regression. In the Analyze pull down, select Regression and then left click on Binary Logistic.

24

2. 1) Select the variable Previously defaulted in the menu on the left and click the arrow to move it into the Dependent field. 2) Select all the other variables and move them into the Covariates field. 3) We will leave the Method as Enter.

1

2

3

3. In order to read the Odds Ratio which will be explained later, 1) click on Options. 2) In the Logistic Regressions: Options, click on CI for exp(B). This gives the Confidence Intervals at 95%. 3) Click Continue.

3

2

1

25

4. One of our covariates, Level of Education (ed), is a categorical variable. You cannot treat it as parametric data like the other covariates; you need to treat it as a dummy variable, where it is broken down into separate coefficients depending on whether the value is present or not. Always create as many coefficients as there are values minus one. That is, we have five values in Level of Education (ed): 1 = did not complete high school, 2 = high school degree, 3 = some college, 4 = college degree, and 5 = post-undergraduate degree, so we need to set up four coefficients, each one with values 1 for yes and 0 for no. The first value “did not complete high school,” which has hundreds of observations, can be left out. One group is put aside to prevent multicollinearity. The categorical variables codings looks like this: Categorical Variables Codings

Parameter coding Frequency (1) (2) (3) (4)

Did not complete high school 372 .000 .000 .000 .000

High school degree 198 1.000 .000 .000 .000 Some college 87 .000 1.000 .000 .000 College degree 38 .000 .000 1.000 .000

Level of education

Post-undergraduate degree 5 .000 .000 .000 1.000

1) In the Logistic Regression dialog, click on Categorical button, which opens up the Logistic Regression: Define Categorical Variables dialog. 2) From the Covariates menu, select and move the variable ed(Indicator) over to the Categorical Covariates field.

2

1

26

5. 1) Select the variable ed(indicator). Contrast is the means to compare the categorical variable. 2) Leave it at Indicator which shows present or absence of a membership. Reference category is the value level that is set to zero and left out. 3) Select First for the first value, Did not complete high school. 4) Click Change and the word first is put behind the word Indicator in the Categorical Covariates field. 5) Click Continue.

4

51

3 3

5. Back in the Logistic Regression dialog, click OK.

27

6. In the Output window, scroll down to Block 1. Block 0 does not calculate the independent variables.

In the Omnibus Tests of Model Coefficients, the Model line has the values for the whole regression model.

• The Chi-square 254.801 is the model value.

• Df is 11 • Sig. is .000

So the model is significant. In the Model Summary, we have the estimates of variance accounted for. Since this is not linear regression, this is considered a pseudo R2. The Cox & Snell value is an underestimate. However, you do not need to report these numbers.

In the Classification Table, you find how well the model made predictions. Percentage Correct shows the percentage for each value. You see the model predicted correctly 93% for No on observed people who previously defaulted, 51% for Yes on observed people who previously defaulted. The Overall Percentage is the accuracy for all the cases which was 82%.

The Variables in the Equation table gives you a lot of information.

B is the b (slope) in our regression equation. Wald is the test value for the coefficient. Sig. is the significance of the variable.

28

Exp(B) gives you the odds ratio. The odds ratio explains how much the odds improve of predicting whether someone has defaulted when you know the independent variable. If the odds ratio is greater than 1 and 95% CI lower and upper parameters don’t overlap 1, then independent variable increases risk in relation to the outcome variable. If the odds ratio is less than 1 and 95% CI don’t overlap 1, then independent variable decreases the risk in relation to the outcome variable.

The Age variable is significant at .044. The Exp(B) is 1.036 and the Lower and Upper C.I. do not overlap 1, so the odds of whether someone has previously gone into debt increases by 3% (1.036 – 1.00 = .036) with an increase in the unit of age. The Address variable is significant at .000. The Exp(B) is .900 and the Lower and Upper C.I. do not overlap 1, so the odds of whether someone has previously gone into debt decreases by 10% (.90 – 1.00 = - .10).

Now, we can report our results. Remember, you need to report on the model and the coefficients. First our model: χ2 = result, and df = result from the Omnibus Tests of Model Coefficient, N = result from the Case Processing Summary, and p (<=) result from the Omnibus Tests of Model Coefficients Also, since you did the enter method for the independent variables, start a chart: Name of Variable, β = result, Wald χ2 = result, p (< =) result, Odds Ratio = result. First the model: “Logistic regression was performed on our nine predictor variables, otherdebt, creddebt, debtinc, income, address, employ, ed, and age, to ascertain whether they significantly predicted whether someone had previously defaulted on a loan or not. All nine predictor variables predicted whether a person had a prior default, χ2 = 245.80, df = 11, N = 700, p < .001. Table 2.1 shows the logistic coefficient, Wald test and odds ratio of each predictor. Age, employ, address, debtinc and creddebt were significant coefficients. The odds ratio for creddebt showed that the odds that someone had previously gone into debt increases by 86% with an increase with each $1,000 increase of credit card debt.”

29

30

Now the chart: Predictor β Wald χ2 p Odds Ratio Age .04 4.07 .044 1.04 Ed 2.66 .616 Ed (1) -.88 .46 .498 .42 Ed (2) -.57 .19 .660 .57 Ed (3) -.52 .16 .688 .59 Ed (4) -.96 .52 .471 .38 Employ -.26 60.88 <.001 .77 Address -.11 20.53 <.001 .90 Income -.01 1.01 .315 .99 Debtinc .07 5.34 .021 1.07 Creddebt .63 30.63 <.001 1.87 Othdebt -.71 .24 .625 .49 *Enter Method

STATISTICAL ANALYSIS REVIEW Analysis Data type Purpose Reporting Regression

nonparametric (categorical dummy variables can be used)

Simple Linear Regression predicts the value of one variable from another. Multiple Regression predicts the value of one variable from the values of two or more variables.

1. MODAL: F with the df from the Regression and Residual in subscript from the ANOVA table = result, p (<=) number from the ANOVA table, and the Adjusted R Square from the Model Summary. 2. SIGNIFICANT DEPENDENT VARIABLE(S): Name of Variable, β = result, p (< =) number.

Logistic Regression

categorical variable to be predicted.

To predict the value of a categorical variable from the value(s) of other variables.

Same as above

parametric and categorical

Performs multiple t-tests. (F with the df from the Regression and Residual in subscript from the ANOVA table = result, p (<=) number from the ANOVA table)

ANOVA

nonparametric Kruskal-Wallis using Kruskal-Wallis, X² = result, p (<=) result.

* Remember, you also need to mention in the text or in the statistical reporting if your analysis was one- or two-tailed. Online Statistics Help Very good outline guides for SPSS by the Department of Psychology at the University of Nebraska (http://www-class.unl.edu/psycrs/statpage/) The California State University Social Sciences Research and Instructional Council Teaching Resources Depository Home Page (http://www.csubak.edu/ssric/welcome.htm)

http://www-class.unl.edu/psycrs/statpage/

http://www.csubak.edu/ssric/welcome.htm

31

Statistics at Texas A&M University (http://www.stat.tamu.edu/spss.php) Visualization of statistical analyses at Evotutor (http://www.evotutor.org/Statistics/StatisticsA.html) A New View of Statistics (http://www.sportsci.org/resource/stats/contents.html) HyperStat (http://davidmlane.com/hyperstat/) The Really Easy Statistics Site (http://helios.bto.ed.ac.uk/bto/statistics/tress1.html) Publishing Guide American Psychological Publishing. Association (2001) Publication Manual of the American Psychological Association. (5th Ed.). Washington, D.C. Morgan, S. E., Reichert, T. & Harrison, T. R. (2002) From Numbers to Words. Boston: Allyn and Bacon. Leech, N. (2005) SPSS for Intermediate Statistics: Use and Interpretation. Mahwah, New Jersey: Lawrence Erlbaum Associates. Hesson-McInnis, M. Reporting Statistics in APA Style http://www.ilstu.edu/~mshesso/apa_stats_format.html Kahn, J. Reporting Statistics in APA Style http://www.ilstu.edu/~jhkahn/apastats.html Contact Info: Tom Stieve [email protected] © Thomas Stieve

http://www.stat.tamu.edu/spss.php

http://www.evotutor.org/Statistics/StatisticsA.html

http://www.sportsci.org/resource/stats/contents.html

http://davidmlane.com/hyperstat/

http://helios.bto.ed.ac.uk/bto/statistics/tress1.html

http://www.ilstu.edu/%7Emshesso/apa_stats_format.html

http://www.ilstu.edu/%7Ejhkahn/apastats.html

mailto:[email protected]

Documents

SPSS 2 Hypothesis Testing and Inferential Statisticshusam.khadash.com/students page_files/articles/test variabes-spss2.pdf · 1 Intermediate SPSS (2) Hypothesis Testing and Inferential