22
Size matters: Standard errors in the application of null hypothesis significance testing in criminology and criminal justice SHAWN D. BUSHWAY* and GARY SWEETEN Department of Criminology and Criminal Justice, University of Maryland, 2220 LeFrak Hall, College Park, MD, 20742, USA *corresponding author: E-mail: [email protected] DAVID B. WILSON Administration of Justice Program, George Mason University, Manassas, VA, USA Abstract. Null Hypothesis Significance Testing (NHST) has been a mainstay of the social sciences for empirically examining hypothesized relationships, and the main approach for establishing the importance of empirical results. NHST is the foundation of classical or frequentist statistics. The approach is designed to test the probability of generating the observed data if no relationship exists between the dependent and independent variables of interest, recognizing that the results will vary from sample to sample. This paper is intended to evaluate the state of the criminological and criminal justice literature with respect to the correct application of NHST. We apply a modified version of the instrument used in two reviews of the economics literature by McCloskey and Ziliak to code 82 articles in criminology and criminal justice. We have selected three sources of papers: Criminology, Justice Quarterly, and a recent review of experiments in criminal justice by Farrington and Welsh. We find that most researchers provide the basic information necessary to understand effect sizes and analytical significance in tables which include descriptive statistics and some standardized measure of size (e.g., betas, odds ratios). On the other hand, few of the articles mention statistical power and even fewer discuss the standards by which a finding would be considered large or small. Moreover, less than half of the articles distinguish between analytical significance and statistical significance, and most articles used the term Fsignificance_ in ambiguous ways. Key words: criminal justice, criminology, Justice Quarterly, regression, review, significance, standard error, testing Introduction Null Hypothesis Significance Testing (NHST) has been a mainstay of the social sciences for empirically examining hypothesized relationships, and the main approach for establishing the importance of empirical results. NHST is the foundation of classical or frequentist statistics founded by Fisher (the clearest statement is in Fisher 1935), and it has three key steps. First, a null hypothesis of Bno difference^ or Bno relationship^ is established with no specific alternative hypothesis specified. In the second step, a test statistic is calculated under a number of distributional assumptions. Finally, if the probability of obtaining the calculated test statistic is below a certain threshold (typically 0.05 in the social sciences), the Journal of Experimental Criminology (2006) 2: 1–22 # Springer 2006 DOI: 10.1007/s11292-005-5129-7

Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

Size matters: Standard errors in the application

of null hypothesis significance testing in criminology

and criminal justice

SHAWN D. BUSHWAY* and GARY SWEETENDepartment of Criminology and Criminal Justice, University of Maryland, 2220 LeFrak Hall,

College Park, MD, 20742, USA

*corresponding author: E-mail: [email protected]

DAVID B. WILSONAdministration of Justice Program, George Mason University, Manassas, VA, USA

Abstract. Null Hypothesis Significance Testing (NHST) has been a mainstay of the social sciences for

empirically examining hypothesized relationships, and the main approach for establishing the

importance of empirical results. NHST is the foundation of classical or frequentist statistics. The

approach is designed to test the probability of generating the observed data if no relationship exists

between the dependent and independent variables of interest, recognizing that the results will vary from

sample to sample. This paper is intended to evaluate the state of the criminological and criminal justice

literature with respect to the correct application of NHST. We apply a modified version of the

instrument used in two reviews of the economics literature by McCloskey and Ziliak to code 82 articles

in criminology and criminal justice. We have selected three sources of papers: Criminology, Justice

Quarterly, and a recent review of experiments in criminal justice by Farrington and Welsh. We find that

most researchers provide the basic information necessary to understand effect sizes and analytical

significance in tables which include descriptive statistics and some standardized measure of size (e.g.,

betas, odds ratios). On the other hand, few of the articles mention statistical power and even fewer

discuss the standards by which a finding would be considered large or small. Moreover, less than half of

the articles distinguish between analytical significance and statistical significance, and most articles used

the term Fsignificance_ in ambiguous ways.

Key words: criminal justice, criminology, Justice Quarterly, regression, review, significance, standard

error, testing

Introduction

Null Hypothesis Significance Testing (NHST) has been a mainstay of the social

sciences for empirically examining hypothesized relationships, and the main

approach for establishing the importance of empirical results. NHST is the

foundation of classical or frequentist statistics founded by Fisher (the clearest

statement is in Fisher 1935), and it has three key steps. First, a null hypothesis of

Bno difference^ or Bno relationship^ is established with no specific alternative

hypothesis specified. In the second step, a test statistic is calculated under a number

of distributional assumptions. Finally, if the probability of obtaining the calculated

test statistic is below a certain threshold (typically 0.05 in the social sciences), the

Journal of Experimental Criminology (2006) 2: 1–22 # Springer 2006

DOI: 10.1007/s11292-005-5129-7

Page 2: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

null hypothesis is rejected. The approach is designed to test the probability of

generating the observed data if no relationship exists between the dependent and

independent variables of interest, recognizing that the results will vary from

sample to sample. A rejection of the null implies that the observed data was not

generated due to simple sampling variation, and therefore a true difference or

relationship between the key variables exists with only some small chance of Type

I error (i.e., concluding that there is a relationship when in fact there is none).

NHST has attracted criticism from its inception (e.g., Berkson 1938; Boring

1919) and critics of the approach can be found in psychology (Gigerenzer 1987;

Harlow et al. 1997; Rozeboom 1960), medicine (Marks 1997), economics (Arrow

1959; McCloskey and Ziliak 1996), ecology (Anderson et al. 2000; Johnson 1999)

and criminology (Maltz 1994; Weisburd et al. 2003). The critics can be prone to

flowery hyperbole. Our favorite is from Rozeboom (1997), who stated that B(n)ull

hypothesis testing is surely the most bone-headedly misguided procedure ever

institutionalized in the rote training of science students^ (p. 335). The critics can be

placed into three basic categories.

The first group of critics believes that the statistical properties of this approach

are flawed. For example, the approach ignores theoretical effect sizes and Type II

error. From this perspective, alternative approaches such as Bayesian statistics,

NeymanYPearson decision theory, non-parametric tests, and Tukey exploratory

techniques should at least supplement if not replace NHST (Maltz 1994; Zellner

2004). A second group of critics takes a more philosophical approach, focusing on

the limitations of experimental and correlational designs in social science (Lunt

2004). These critics would prefer more qualitative evidence and theoretical

development and less Brank empiricism.^ In contrast to the first two groups, the

third group of critics is not interested in replacing NHST, but rather hopes to

correct mistakes in the application of NHST.1

This last group focuses on the difference between analytical significance and

statistical significance. To these critics, the size of the effect should drive the

evaluation of the analysis, not its statistical significance. As such, effects can be

analytically uninteresting (i.e., trivially small given the topic at hand) and

statistically significant. Effects can also be analytically interesting and statistically

non-significant. That is, tests can have low power, such that the researcher cannot

reject the null hypothesis for analytically interesting effect sizes (Cook et al. 1979;

Lipsey 1990; Weisburd et al. 2003). The goal of these reformers is to persuade

researchers to focus on analytical or substantive significance in addition to

statistical significance. In economics, this effort has been led by McCloskey and

Ziliak (1996; Ziliak and McCloskey 2004) who have conducted high profile

reviews of the lead journal in the field (American Economic Review) to document

proper and improper usage of the NHST. The movement is more advanced in

medicine and psychology (Fidler 2002; Thompson 2004) where revised editorial

standards strongly recommend that researchers move away from over-reliance on

P-values and become more focused on effect sizes and confidence intervals (APA

2001). A recent issue of the Journal of Socio-Economics devoted to this topic is the

first attempt to bring together researchers from multiple disciplines (sociology,

SHAWN D. BUSHWAY ET AL.2

Page 3: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

psychology, economics, and ecology) to discuss the problem and possible

solutions.2 In this issue, there is marked frustration at the slow adoption of good

practice despite the clear consensus about the standards of good practice.

One possible reason for the lack of real change is that misuse causes no real

harm. Elliott and Granger (2004) and Wooldridge (2004), for example, do not deny

that people do a poor job of reporting their results, but nonetheless argue that

NHST is useful for theory testing and the intelligent reader usually can decipher

the effect sizes for herself. On the other hand, Weisburd et al. (2003) provide a

compelling case that the poor use of the NHST can in fact have profound negative

consequences. Specifically, they are concerned about researchers who accept a null

hypothesis in program evaluation and act as if the effect was actually zero,

concluding that the program does not work. They sampled all program evaluations

used in the Maryland Crime Prevention Report in which null findings were

reported (Sherman et al. 1997). They then investigated whether there was enough

statistical power to reject the null hypothesis of a small but substantively

meaningful effect size of 0.2 or greater. In slightly less than half of the cases

this reasonable effect size could not be rejected, suggesting that the null finding

was not substantively the same as an effect size of zero. Or to put it another way,

tests with low power to identify reasonable effect sizes were being used to

conclude that programs did not work (see also Lipsey et al. 1985, for a similar

conclusion). It is hard to argue that such practice is not both bad statistics and

harmful for the accumulation of knowledge.

It is not surprising that the first formal review of the use of NHST in

criminology and criminal justice was done in the context of program evaluation.

Thompson (2004) argues that the shift to Bmeta analytical^ thinking (Cumming

and Finch 2001), with its focus on effect sizes and replicability, is facilitating a real

change in psychology and education research. Replicability is judged by evaluating

the stability of effects across a related literature, a comparison that requires the

use of standardized effect sizes and does not depend on statistical significance.

Meta-analytic thinking has become increasingly common in criminology and

criminal justice (Petrosino 2005; Wilson 2001). We believe, however, that the poor

practice of NHST goes beyond program evaluation and has negative consequences

for knowledge building in all areas of the field. While the change can start in

the program evaluation literature, it must ultimately pervade criminology more

broadly.

This paper is intended to evaluate the state of the criminological and criminal

justice literature with respect to the correct application of NHST. Our assessment

conceptualizes the application of NHST around four basic issues. The first two

concern the size of the coefficients of interest. This broader issue is conceptualized

as having two components: reporting the size of an effect (or presenting

information necessary to determine the size of an effect), and interpreting the size

of an effect, not merely its statistical significance. The third issue concerns the

correct interpretation of non-significant effects, such as reporting confidence

intervals or power analysis. The fourth issue focuses on basic errors in the

application of NHST, such as errors in the specification of the null hypothesis or

SIZE MATTERS 3

Page 4: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

relying on statistical significance for the selection of variables in a multivariate

model.

To address these four issues, we adapt the instrument used in economics by

McCloskey and Ziliak (1996) and Ziliak and McCloskey (2004). We used this

instrument to code 82 articles in criminology and criminal justice selected from

three sources: Criminology, the flagship journal of the American Society of

Criminology, Justice Quarterly, the flagship journal of the Academy of Criminal

Justice Science, and a review piece by Farrington and Welsh (2005) on

experiments in criminal justice. In each case our goal is to focus on outlets

representing the best practice in a particular area of the field.

We find very similar results across the outlets. In general, we find both good and

bad practices in the field. On the one hand, most researchers provide the basic

information necessary to understand effect sizes and analytical significance in

tables which include descriptive statistics and some standardized measure of size

(e.g., betas, odds ratios). In fact, most researchers describe the size of their

coefficients in some way. On the other hand, only 31% of the articles mention

power and fewer than 10% of the articles discuss the standards by which a finding

would be considered large or small. None of the articles explicitly test statistical

power with a specific alternative hypothesis. It is not surprising, therefore, that

only 40% of the articles distinguish between analytical significance and statistical

significance, and only about 30% of the articles avoid using the term Fsignificance_in ambiguous ways. In large part, research in this field equates statistical

significance with substantive significance. Researchers need to take the next step

and start to compare effect sizes across studies rather than simply conclude that

they have similar effects solely on the basis of a statistically significant finding in

the same direction as previous work. The paper proceeds in the next section by

discussing the sampling frame, followed by a discussion of the instrument, and

finally, the results.

Materials and methods

Sampling frame

We focus on the prominent articles in three discrete subfields Y criminology,

criminal justice, and experimental criminology. As such, we selected articles in the

journals Criminology and Justice Quarterly in the years 2001 and 2002. We

recognize that this strategy presents more of a snapshot than an overview;

however, we would be surprised to find that practice in this two year window was

substantially different than practice in the surrounding years. We selected all

articles that used bivariate analysis, ordinary least squares (OLS) methods (e.g.,

OLS regression, analysis-of-variance), or logistic regression. Although NHST is

also appropriate for the broader class of non-linear models, we followed

McCloskey and Ziliak (1996) by focusing on the simplest models in common

usage.

SHAWN D. BUSHWAY ET AL.4

Page 5: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

There were 66 published articles in 2001Y2002 in Criminology, and of those, 32

met our eligibility criteria. There were 62 articles in 2001Y2002 in Justice

Quarterly and of those, 32 met our eligibility criteria. The most commonly

excluded analyses were hierarchical, structural equation, tobit and count models.

Because we also wanted to include program evaluations in our analysis, we used

the recent Farrington and Welsh (2005) review of experimental studies as our

sampling frame. We would have preferred to pick a journal like the Journal of

Experimental Criminology rather than a review article. However, this journal is

only in its second year, and there was no other unified source of experiments in

criminology. The Farrington and Welsh review, published in this journal,

represented what we believe is the definitive list of experiments in criminology.

We focused on articles that had been published in journals since 1995. Of these 27

articles, we were able to acquire 18 through the University of Maryland library.

Because these articles met the criteria for inclusion in Farrington and Welsh’s

(2005) review, we assume that they represent the state of experimental

criminology, and are not significantly different from the other nine articles.

Coding protocol

The original instrument by McCloskey and Ziliak (1996) had 19 questions. We

eliminated seven questions for one of three reasons: (1) we felt that McCloskey

and Ziliak were taking an extreme position, such as when they requested

simulations to determine if regression coefficients were reasonable, (2) we felt

the question was redundant, or (3) we judged the question to be ambiguous,

making it difficult to code consistently across studies. We also added three

questions, one to explicitly test the issue raised by Weisburd et al. (2003) with

respect to accepting the null hypothesis, and two others to address the use of

confidence intervals to aid in the interpretation of effect sizes, particularly for null

findings All questions were coded as one for good practice and zero for bad

practice. The questions and coding conventions, grouped by the four main issues,

were:

A. Reporting effect size, or the information necessary to determine effect size

1) Were the units and descriptive statistics for all variables used in bivariate

and multivariate analyses reported? Adequate reporting of descriptive statistics is

necessary to fully assess the magnitude of effects. McCloskey and Ziliak insisted

only on the display of means, but differences between means can only be properly

interpreted in the context of sample variability. As such, to be considered good

practice, we required that means and standard deviations be reported for

continuous variables. Furthermore, we insisted that studies include descriptive

statistics for all variables including in analyses. A number of studies provided only

partial information and therefore were not given credit for good practice on this

item. In this and other items, one could argue that this Fmistake_ is quite minor.

Nonetheless, this type of omission makes understanding the substantive signifi-

SIZE MATTERS 5

Page 6: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

cance of findings difficult, and complicates the comparison of effect sizes across

studies. We did not determine if the sample sizes in the descriptive statistics table

matched the sample sizes in the analysis although technically this should be the

case.

2) Were coefficients reported in elasticity (% change Y/% change X) form or in

some interpretable form relevant for the problem at hand so that readers can

discern the substantive impact of the regressors? While we did find a few studies

that used elasticities, this practice is uncommon in criminology and criminal

justice. It is much more common to see standardized betas, odds ratios, or other

effect size indices reported. In order to get credit for good practice, the study

author had to provide betas/odds ratios/elasticities in all of the tables in which

multivariate models were reported. This question was not applicable for papers that

did not use multivariate regressions.

3) Did the paper eschew Basterisk econometrics,^ defined as ranking the

coefficients according to the absolute size of the t-statistics? Reporting coefficients

without a t-statistic, P-value, or standard error counts as Fasterisk econometrics._4) Did the paper present confidence intervals to aid in the interpretation of the

size of coefficients? Any presentation, either in tables or the text was coded as good

practice. Just as a standard deviation provides a context for interpreting a mean by

providing information about the variability in a distribution, confidence intervals

provide a context for interpreting coefficients by providing information on

precision or the plausible range within which the population parameter is likely

to fall. The use of confidence intervals as an adjunct to NHST has been widely

recommended as a method to address weaknesses in NHST (e.g., APA Task Force

on Statistical Inference, 1996).

B. Interpreting effect size, not just statistical significance

5) Did the paper discuss the size of the coefficients? Any mention of the size of

a coefficient in substantive terms in the text of the paper was coded as good

practice. This coding decision was more lenient than that applied by McCloskey

and Ziliak, but nonetheless we still found articles which failed this question.

Simply listing the betas in replication of the table was not sufficient.

6) Did the paper discuss the scientific conversation within which a coefficient

would be judged Flarge_ or Fsmall_? In other words, did the authors explicitly

consider what other authors had found in terms of effect size or what standards

other authors had used to determine importance? Time after time, authors claimed

that their results were similar to prior results in the literature solely on the basis of

a statistically significant finding in the same direction as previous findings. This

question required that some attempt was made to compare effect size across studies

in an attempt to build knowledge. One could argue that this exercise is somewhat

futile in a field like criminology where many of the variables are unique scales or

otherwise constructed variables without inherent meaning, and where treatments

are applied to unique populations. But a comparison of betas or odds ratios is

informative and might encourage a more standardized approach to measurement.

Moreover, we find it hard to understand how criminological understanding can be

SHAWN D. BUSHWAY ET AL.6

Page 7: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

advanced simply by knowing the sign and significance of an effect with no

substantive understanding. This is particularly problematic in theory testing, where

every theory has found support in the form of a significant coefficient in a reduced

form model on the variable or variables thought to best represent the theory. Some

index of the size of the observed effect is critical to understanding the importance

and theoretical implications of a finding.

7) After the first use, did the paper avoid using statistical significance as the

only criterion of importance? The other most common reference was to measures

of fit such as R2.

8) In the conclusion section, did the authors avoid making statistical

significance the primary means for evaluating the importance of key variables in

the model? In other words, did the authors keep statistical significance separate

from substantive meaning. This was coded as Fnot applicable_ if the paper did not

center on key independent variables (e.g., exploratory analyses).

9) Did the paper avoid using the word Bsignificance^ in ambiguous ways,

meaning Bstatistically significant^ in one sentence and Blarge enough to matter for

policy or science^ in another? We conducted an electronic text search for the word

fragment Fsignific_ and evaluated each usage to determine if the usage was clear.

This item was coded as not applicable if no use of the word fragment Fsignific_ was

found in the article. We did not automatically code a study as using Fsignificance_ambiguously if Fsignificant_ was used without qualification (statistical v. substan-

tive); rather, use of the term had to be consistent (if statistically significant was the

default use, then a qualifier must be used if the authors meant Fsubstantively

significant._) One ambiguous usage was enough to get a Fbad practice_ score on this

question.

C. Interpreting statistically non-significant effects

10) Did the paper mention the power of a test? We coded two types of power

discussions. The first type involved some discussion of power (usually sample size)

and its impact on parameter estimates. The second type of power discussion was an

actual test of power in the study. We had some concerns that this question puts

undue emphasis on post-hoc power analysis. While appropriate in some cases,

post-hoc power analysis can lead to mistakes in interpretation because it requires

the assumption of a Fknown_ or hypothetical effect size. A more general approach

advocated in psychology and statistics involves the presentation of confidence

intervals around the coefficients so readers can observe the range of values

consistent with the analysis (APA 2001; Hoenig and Heisey 2001). This practice has

yet to see widespread use in criminology and criminal justice, but we believe that

confidence intervals could easily be presented in most papers.

11) Did the paper make use of confidence intervals to aid in the interpretation

of null findings? A statistically non-significant effect does not, in and of itself,

provide a basis for accepting the null. Recall that NHST assumes the truthfulness

of the null and then determines the probability of the observed data given that

assumption. A statistically non-significant effect merely means that the data could

reasonably occur by chance in a reality where the null hypothesis were true. This

SIZE MATTERS 7

Page 8: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

establishes the plausibility of the null but little more. It is fundamentally a weak

conclusion. By providing a range of plausible values for the population effect, a

confidence interval greatly facilitates the interpretation of a null finding. A

confidence interval that includes the null and does not include a substantively

meaningful value provides evidence that the effect is functionally null. A large

confidence interval that includes substantively meaningful values despite being

statistically non-significant, however, leaves open the possibility that the null is not

only false but that a genuine effect of substantive size exists. See Rosenthal and

Rubin (1994) for an interesting discussion of this issue.

12) Did the paper eschew Fsign econometrics_ meaning remarking on the sign

but not the size or significance of the coefficients? The most common form of Bsign

econometrics^ is a discussion of the sign of a non-significant coefficient without a

larger justification. A larger justification would include a statement that the effect

sizes were large but not statistically significant because sample sizes were small.

This question is explicitly focused on researchers who place a premium on

statistical significance but report sign as if it is independent of statistical sig-

nificance. The only case where researchers are justified in reporting the sign of a

non-significant coefficient is when it is substantively meaningful, and there are

sample limitations (Greene 2003, Ch. 8). In general this question applies only to

non-significant coefficients, but we also coded cases where researchers reported a

comparison of two (significant) coefficients without an explicit hypothesis test. In

this case, saying that coefficient A is bigger than coefficient B without considering

statistical significance is focusing on the sign of the difference.

13) In the conclusions, did the authors avoid interpreting a statistically non-

significant effect with no power analysis or confidence interval as evidence of no

relationship? We developed this question based on Weisburd et al. (2003). Con-

cluding there is no effect was justified if a confidence interval does not contain

a meaningful effect. This was only coded for papers that were dealing with ex-

plicit treatments (the Farrington sample), which was directly comparable to the

Weisburd et al. sample.

D. Avoiding basic errors in the application of NHST

14) Were the proper null hypotheses specified? The most common null

hypothesis is that the coefficient is zero. In fact, this is usually a default position

for researchers who estimate a simple reduced form regression without very much

structure. Yet, this may not be the relevant null hypothesis. In economics, theory

often makes an explicit prediction about the size of a coefficient. We can think of

only one case in criminology where this might be true Y in sentencing research,

researchers have begun including the presumptive sentence from sentence

guidelines (Engen and Gainey 2000). If judges follow the guidelines with random

variation, the coefficient should be 1. However, criminologists do sometimes want

to compare coefficients across groups, which implies a non-zero null.

15) Did the paper avoid choosing variables for inclusion solely on the basis of

statistical significance? The standard logic, fairly common in the social sciences, is

that if variables are statistically significant, they should be included even without

SHAWN D. BUSHWAY ET AL.8

Page 9: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

theoretical justification. But this approach essentially equates statistical signif-

icance with substantive significance. Variables should be included because theory

suggests that this is the process that generates the data. We coded as bad practice

papers for which the only justification for the inclusion of variables was the finding

of significance in prior studies. We admit to some ambivalence on this point

because it seems like something of a semantic point, but it is nonetheless true that

this approach is logically flawed. A much more problematic practice is the use of

stepwise regression. Simulations have shown that stepwise regression can lead to

statistically significant and strong results for variables that are unrelated to the

dependent variable (Freedman 1983). We coded as bad practice any article which

excluded control variables because of a lack of statistical significance.

Each article was coded by two coders.3 Two coders were responsible for each set

of articles, so while the identity of coders switched between outlets, they remained

constant within an outlet. The concordance rate was 76% in Criminology, 73% in

Justice Quarterly, and 72% in the Farrington articles. After coding independently,

the coders met to reconcile their decisions. The reconciled coding is reported in

this paper.

Results

The results of the survey suggest that researchers in the field of criminology and

criminal justice are not applying NHST blindly, with a majority of studies pro-

viding information about the size of effects. We found many examples of good

practice and we also found that some authors were quite explicit in their un-

derstanding of the limitations of NHST. However, we also found many examples of

bad practice. For example, most researchers failed to discuss the size of coefficients

in substantive terms or clearly distinguished statistically significant effects from

substantive significance ones.

We provide the distribution of scores across the items in Table 1. The scores are

very similar across outlet. The average paper received a score of 7. Several papers

received a high score of 10. The lowest was a paper with a score of 2.

Table 2 provides the percentage of studies correctly addressing each item. The

(relatively) good news is that most researchers in criminology presented statistical

information necessary for an assessment of size. More specifically, roughly three-

Table 1. Descriptive statistics for the number of correct scores by article source.

Source Mean Median Minimum Maximum SD Percent N

Justice Quarterly 7.0 7 2 10 1.8 54.5% 32

Criminology 6.8 7 2 10 2.0 52.9% 32

Farrington experiments 7.1 7 2 10 1.8 52.0% 18

Total 6.9 7 2 10 1.8 53.3% 82

Percent reflects the mean percent correct for applicable items.

SIZE MATTERS 9

Page 10: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

quarters of the authors reported some standardized version of their coefficient in

order to facilitate interpretation,4 and over two-thirds provided descriptive statistics

for all of the variables used in their analysis. The sole exception was the failure to

report confidence intervals for all but three studies in our sample. It would clearly

be better if 100% of authors provided these basic factsYindeed, over a third of these

studies might need to be excluded from a meta-analysis because of the lack of

basic descriptive statistics. Nonetheless, our results did show that the majority of

articles provide the basic building blocks for a comparison of effect size. These

building blocks provide a starting point for the conversation about size as well as

statistical significance. It is also encouraging to note that these results are very

Table 2. Survey results, full sample.

Item N = 82

Presented statistical information necessary for a determination of the size of

an effect

1. Were the units and descriptive statistics reported for all variables used in

bivariate and multivariate analysis reported?

64.6%

2. Were coefficients reported in elasticity form or some other interpretable

form relevant for the problem at hand so that readers could discern the

substantive impact of the regressors?

76.1%

3. Did the paper eschew Fasterisk econometrics_, defined as ranking the

coefficients according to the absolute size of the t-statistics?

57.3%

4. Did the paper present confidence intervals to aid in the interpretation of the

size of coefficients?

3.7%

Interpretation informed by the size of a coefficient, not merely its statistical

significance

5. Did the paper discuss the size of the coefficients? 75.0%

6. Did the paper discuss the scientific conversation within which a coefficient

would be judged Flarge_ or Fsmall_?

9.8%

7. After the first use, did the paper avoid using statistical significance as the

only criterion of importance?

75.6%

8. In the conclusion section, did the authors avoid making statistical significance

the primary means for evaluating the importance of key variables in the model?

44.2%

9. Did the paper avoid using the word Bsignificance[ in ambiguous ways, meaning

Bstatistical significant[ in one sentence and Blarge enough to matter for policy

or science[ in another?

31.7%

Correctly handled non-significant effects

10. Did the paper mention the power of a test? 30.5%

11. Did the paper make use of confidence intervals to aid in the interpretation

of null findings?

0.0%

12. Did the paper eschew Fsign econometrics_ meaning remarking on the sign

by not the size or significance of the coefficients?

58.0%

13. In the conclusions, did the authors avoid interpreting a statistically

non-significant effect with no power analysis as evidence of no relationship?

55.6%

Avoided basic errors in the application of statistical significance testing

14. Were the proper null hypotheses specified? 95.1%

15. Did the paper avoid choosing variables for inclusion solely on the basis

of statistical significance?

78.0%

SHAWN D. BUSHWAY ET AL.10

Page 11: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

similar to the results from the most recent Ziliak and McCloskey (2004) review in

economics, which suggests that the problems in criminology are shared with at

least one other social science.

One area where authors in criminology struggled was in interpreting their

results with respect to the size of effects. Despite a fairly common discussion of

size in the results section, many authors ultimately fell back on statistical

significance as the ultimate arbiter of importance. Perhaps then it is not surprising

that two-thirds of the authors used some form of the word Fsignificance_ in an

ambiguous way. It was common to find statements that a variable Fachieved

significance_ or became Fmore/less significant_ in the presence of additional

variables. Variables were described as being Fmodestly significant_ or Fhighly

significant._ In general, the most common error was implying that the strength of the

relationship was determined by the size of the p-value. This is simply not true, and

can be quite misleading if large samples lead researchers to stress small effects

with very large t-values. In keeping with this error, nearly half of the authors

practiced Fasterisk econometrics,_ rank ordering the coefficients according to the

size of the test statistics. There is simply no scientific justification for equating

statistical significance with substantive significance, either in relative or absolute

terms. Statistical significance should be the starting point in any discussion of

effect size, not the end point.

Very few of the authors (9.8%) presented a discussion of the standard by which

their effects would be considered large or small. Virtually none of the authors

compared the magnitude of their results with other studies, although information is

provided in most cases that would allow this comparison. Gottfredson and

colleagues (2003, evaluation sample) provide a notable exception. They compared

their estimate of the effect of drug treatment courts on recidivism to the average

effect reported in a prior meta-analysis. This simple comparison, usually absent in

other studies, allows the reader to assess the importance of the findings beyond

mere statistical significance. Repeatedly, authors claimed similar findings as prior

studies based solely on coefficients which were significant in the same direction.

Although this information is relevant, sign and significance represent a very coarse

description which could be made far more precise with a discussion of the effect

sizes in the two studies. Alternatively, theories could be developed more formally

to provide explicit predictions about the size of the coefficients.

Non-significant effects are problematic in the social sciences. Almost half of the

papers discuss the direction of non-significant effects (Fsign econometrics_) without

attending to the issue of size or adequately addressing the non-significant nature of

the effect. We found many cases where the authors adhered to strict NHST but

then focused on the sign of the coefficient. For example, in one Criminology article

the authors stated that the original bivariate relationship, although not significant,

was positive. Then, they report that when additional variables were included, the

sign of the relationship changed, Bsuggesting that the (initial) relationship was

spurious.^ But according to NHST, the relationship was spurious even without the

additional variables. In neither case do we know for certain that the relationship is

substantively spurious. Without a high level of statistical power, substantively

SIZE MATTERS 11

Page 12: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

meaningful effects remain plausible even though the null hypothesis cannot be

rejected. Equivocation is difficult to avoid under strict NHST.

Augmenting NHST with confidence intervals can facilitate the interpretation of

null findings (Hoenig and Heisey 2001). In the articles examined, there was

virtually no consideration of Type II error (falsely accepting the null hypothesis).

Less than a third of the authors mentioned the issue of power, and only two out of

82 articles included a power test.5 Moreover, almost half of the evaluations from

the Farrington and Welsh review equated a failure to reject the null hypothesis as

equivalent to an effect size of zero without conducting a power test or examining the

confidence interval. None of the articles in the Justice Quarterly or Criminology

samples conducted a power test, and only one in Justice Quarterly and two in the

experimental sample provided a confidence interval. Authors simply do not

provide an analytical discussion of their ability to find differences that may be

meaningful.

In general, the studies included in this sample did well at avoiding basic errors

in NHST. Roughly three-fourths of the studies avoided relying on statistical

significance for the selection of variables in multivariate models and most studies

correctly specified the null hypothesis. A particularly good example of the latter

was Steffensmeier and Demuth (2001). These authors used a population of all

people convicted of a felony or misdemeanor in Pennsylvania, with over 68,000

cases. First, they demonstrated that they understood that populations should not be

treated as samples by stating that, Bbecause our data set is not a sample, but

contains all reported sentences with complete data, statistical tests of significance

do not apply in the conventional sense^ (p. 160). They then argued, in a fairly

conventional way, for inclusion of the tests, but then immediately showed that the

tests were not being applied without thinking. Specifically, they stated that

Bbecause the number of cases included in our analysis is so large, many small

sentencing differences among groups or categories often turn out to be significant

in the statistical sense. Therefore, we place more emphasis on direction and

magnitude of the coefficients than on statistical significance levels . . .^ (p. 160).

This is exactly the correct approach to take in their case. In contrast, researchers

conducting studies with small samples would be better served by focusing on

confidence intervals and interpreting the range of possible values for a population

parameter (e.g., from zero to a moderately large effect).

Other examples of good practice involve the specification of a null hypothesis.

Over 90% of the articles received a positive rating on this question, even in cases

when the null was non-zero. For example, Koons-Witt (2002) wanted to explore

the impact of gender on sentencing decisions before and after the onset of

sentencing guidelines. Therefore, she correctly specified the null hypothesis as

equality between the coefficients on gender in separate models.6 Moreover,

researchers often specify models in which a mediating variable is included, and the

coefficient on the original variable is expected to decline. For example, Kleck and

Chiricos (2002) hypothesized that the coefficient on unemployment in a regression

of crime and unemployment would change when explicit measures of motivation

and opportunity were included. In this case, they correctly specified the null

SHAWN D. BUSHWAY ET AL.12

Page 13: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

Table 3. Results by item and source of study.

Item

Criminology

2001Y2002

N = 32

Justice quarterly

2001Y2002

N = 32

Farrington

Experiments

N = 18

Presented statistical information necessary

for a determination of the size of an effect

1. Were the units and descriptive statistics reported

for all variables used in bivariate and multivariate

analysis reported?

72.7% 59.4% 55.6%

2. Were coefficients reported in elasticity form

or some other interpretable form relevant for the

problem at hand so that readers could discern

the substantive impact of the regressors?

78.1% 86.2% 45.5%a

3. Did the paper eschew Fasterisk econometrics_,

defined as ranking the coefficients according

to the absolute size of the t-statistics?

69.7% 56.3% 33.3%

4. Did the paper present confidence intervals

to aid in the interpretation of the size of

coefficients?

0.0% 3.1% 11.1%

Interpretation informed by the size of a coefficient,

not merely its statistical significance

5. Did the paper discuss the size of the coefficients? 65.6% 87.5% 70.6%

6. Did the paper discuss the scientific conversation

within which a coefficient would be judged

Flarge_ or Fsmall_?

6.1% 3.1% 27.8%

7. After the first use, did the paper avoid using sta-

tistical significance as the only criterion of importance?

66.7% 84.4% 72.2%

8. In the conclusion section, did the authors avoid

making statistical significance the primary means for

evaluating the importance of key variables in the model?

35.7% 38.7% 66.7%

9. Did the paper avoid using the word Bsignificance[

in ambiguous ways, meaning Bstatistical

significant[ in one sentence and Blarge enough

to matter for policy or science[ in another?

24.2% 37.5% 33.3%

Correctly handled non-significant effects

10. Did the paper mention the power of a test? 30.3% 28.1% 33.3%

11. Did the paper make use of confidence intervals

to aid in the interpretation of null findings?

0.0% 0.0% 0.0%

12. Did the paper eschew Fsign econometrics_

meaning remarking on the sign by not the size or

significance of the coefficients?

57.6% 61.3% 50%

13. In the conclusions, did the authors avoid

interpreting a statistically non-significant effect with

no power analysis as evidence of no relationship?

Y Y 55.6%

Avoided basic errors in the application of statistical

significance testing

14. Were the proper null hypotheses specified? 90.9% 96.9% 94.4%

15. Did the paper avoid choosing variables for

inclusion solely on the basis of statistical significance?

75.8% 71.9% 88.9%

a Only 11 of the 18 experiments were coded on this question because the results were simple mean

comparisons. Of the remaining seven, regressions were occasionally run to support the main finding, and

several articles did not report these regressions in tabular form.

SIZE MATTERS 13

Page 14: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

hypothesis as the coefficient in the original equation. Throughout the discussion,

Kleck and Chiricos explicitly focused on the change in the coefficient. However,

we did find examples in which researchers incorrectly concluded that the effect had

been mediated because the coefficient was no longer significantly different from

zero.

Table 3 provides the results by source. For the most part, our results did not

vary by source. The Criminology papers were more likely to provide descriptive

statistics and avoid asterisk econometrics than Justice Quarterly papers, but Justice

Quarterly papers were more likely to talk about size and avoid ambiguous usage of

significance. In no way could it be said that authors in Criminology scored

systematically better than the authors from the two other sources. The main

difference was that the articles in the Farrington and Welsh sample were almost

seven times more likely to discuss some standard of size. This makes sense

because the experimental studies often reported program effect sizes in a research

arena with prior results. Nonetheless, the vast majority of experiments also did not

discuss the standard by which magnitude could be evaluated. In general, the

pattern across the article sources was far more similar than different. Clearly, the

problems are fairly endemic across the field and are not solely the responsibility of

any one group of substantive researchers.

Discussion

NHST is the dominant approach to drawing inferences regarding research

hypotheses in quantitative criminology and criminal justice. Although alternatives

like Bayesian inference exist, the purpose of this article is not to convince the field

to abandon NHST. Rather, we would like to encourage more thoughtful

application. In essence, we want to put NHST in its place: as a tool to facilitate

the inferential process, not as the end game for quantitative research.

The fundamental limitation of NHST is that it does not provide information

about size. As the title of this article states, size matters. To state that there is an

effect begs the question, how big? This requires attention to size and a scholarly

discussion that addresses the substantive significance of findings. Therefore, it is

not surprising that we believe our single most troubling finding was the lack of a

serious attempt in most articles to place the magnitude of the effect in a context or

even to attend to the issue of size. A research study should be placed in the context

of past work or theoretical predictions. Simply reporting the coefficient without

any attempt to validate or otherwise establish the magnitude of the effect within

the literature or policy framework risks creating a large body of independent

research with no cumulative advance in knowledge. This is particularly evident

when studies of a common research hypothesis with different sample sizes arrive at

different conclusions based solely on NHST. Without attending to size, researchers

may conclude that the empirical research base has lead to an equivocal conclusion

regarding the hypothesis. However, focusing on size may tell a more consistent

story, or at least a story that is not determined by the sample size of the studies but

SHAWN D. BUSHWAY ET AL.14

Page 15: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

rather by the size of the empirical relationships examined. It is the latter, after all,

in which we are truly interested. On the positive side, many of the key ingredients

for a substantive discussion of size, like descriptive statistics and standardized

coefficients, were reported in most of the studies in our review. All of the basic

tools are there for researchers to take the next step and compare effect sizes across

studies rather than simply concluding that they have similar findings solely on

the basis of a significant finding in the same direction as previous work.

The role of sample size in determining statistical significance is also under-

appreciated. This is evident in our study by the absolute lack of any discussion of

statistical power. In a research world in which sample sizes range from a few

dozen to 68,000, this is particularly alarming, and strikes us as fundamentally

unwise. All research designs do not have equal ability to identify the same effect

size. A discussion about the relative power of a test to identify an effect and an

awareness of the confidence intervals around an effect seems to be both reasonable

and essential for a good evaluation of the value of the study.

One potential criticism of this type of discussion about NHST is from well-

known econometrician Edward Leamer (2004). In his criticism of McCloskey and

Ziliak, he states that it is ultimately not size, but models, that matter. Too much

emphasis on the minutiae of NHST threatens to take the attention away from the

important question of whether the model provides insight into the question of

interest. During our coding, we were often frustrated by the lack of discussion

about the source of causal identification in the regression models, and the general

lack of understanding about the limitations of observational studies with controls

for observables to identify causality. While the criminology and criminal justice

articles were not substantially different from the economics articles with respect to

the use of NHST, we feel confident in stating that the application of causal models

based on observational data is in fact substantially more thoughtful in economics.

We were also frustrated by the lack of attention to measurement issues in many

of the papers we reviewed. Often authors would raise key issues with respect to

measurement only to proceed without addressing them. One anonymous reviewer

made a compelling argument that this problem in criminology is far more pressing

than any discussion about NHST. We do not necessarily disagree, but we

nonetheless think that issues surrounding the appropriate use of NHST deserve

attention by criminologists.

The goal of NHST is admirable: to protect against the acceptance of a research

hypothesis when the observed data can be explained by sampling variation. This

simple goal, however, has taken on a hegemonic role in the practice of social

scientific research. Critical thinking about the meaningfulness of findings in

scientific and practical terms is often lacking. Size matters. Large effects have

different theoretical and practical importance than small effects. A binary accept/

reject approach to hypothesis testing advances our field far less than approaches

that explicitly assess whether observed effects are of a size consistent with

theoretical expectations or are large enough to matter in a practical or policy

context. This requires reasoned argumentation and scientific discourse, rather than

a reliance on an arbitrary and binary decision rule (i.e., p e 0.05). The former

SIZE MATTERS 15

Page 16: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

requires greater skill and scholarly effort but also promises greater advancement

for our field.

Acknowledgement

The authors wish to thank Emily Owens for help with the coding and University of

Maryland’s Program for Economics of Crime and Justice Policy for generous

financial support. We also wish to thank Michael Maltz and two anonymous

reviewers for helpful advice. All errors remain our own.

Notes

1 It should be noted that there is no controversy about what constitutes correct use Y the

problem is unambiguously in the application.2 This issue has received less attention in criminology and criminal justice than in these

other disciplines. Maltz’s 1994 paper in the Journal of Research and Crime and

Delinquency is the best-known paper on the problems with NHST in criminology.3 Three coders were used in the analysis. All three coders have successfully completed at

least four upper level courses in econometrics in an economics department and are at least

4th year PhD students. All three coders read the McCloskey and Ziliak (1996) piece and

participated in pilot coding and reconciling as a group.4 Almost every author who reported logit coefficients also reported the odds ratio. We

suspect this is because statistics packages now provide this information easily. However,

there was some confusion about the interpretation of the odds ratio. For example, we

found one paper in Justice Quarterly in which the authors interpreted the coefficients as

the odds ratio, even though the odds ratio was also provided in the table. We recommend

that authors simply report the change in probability associated with each x. This is far

more intuitive, and several statistics packages, including Stata, now provide this

information with a simple command (dlogit).5 Even the two papers which conducted explicit power tests did not provide satisfying

discussions of power. Both papers were ambiguous about the effect size for which they

estimated power. Furthermore, one of the papers suggested that the low power associated

with their statistical tests minimized the probability of Type II error. In fact, by definition

low power is associated with a higher probability of Type II error.6 Koons-Witt cited Brame et al. (1998) to justify the test statistic used to test the null

hypothesis. Citation of this paper was very common in our sample. However, Brame et al.

stated that their test is appropriate only for OLS and count models. This test can be used

with logit or probit models if the researcher assumes Bthat both the functional form and

the dispersion of the residual term for the latent response variable are identical for the two

groups being compared^ (p. 259, fn11). These are strong assumptions which are unlikely

to be met in most cases. These assumptions were never justified, or even stated, in the

papers that compared coefficients from logit or probit models.

SHAWN D. BUSHWAY ET AL.16

Page 17: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

References (Asterisk indicates papers in sample.)

*Agnew, R. (2002). Experienced, vicarious, and anticipated strain: An exploratory study on

physical victimization and delinquency. Justice Quarterly 19, 603Y632.

*Agnew, R., Brezina, T., Wright, J. P. & Cullen, F. T. (2002). Strain, personality traits, and

delinquency: Extending general strain theory. Criminology 40, 43Y72.

*Alpert, G. P. & MacDonald, J. M. (2001). Police use of force: An analysis of organizational

characteristics. Justice Quarterly 18, 393Y409.

Anderson, D. R., Burnham, K. P. & Thompson, W. L. (2000). Null hypothesis testing:

Problems, prevalence, and an alternative. Journal of Wildlife Management 64, 912Y923.

APA (2001). Publication manual of the American Psychological Association, (5th edition),

Washington, DC: American Psychological Association.

APA Task Force on Statistical Inference (1996, December). Task Force on Statistical

Inference initial report. Washington, DC: American Psychological Association. Available:

http://www.apa.org/science/tfsi.html.

*Armstrong, T. A. (2003). The effect of moral reconation therapy on the recidivism of

youthful offenders: A randomized experiment. Criminal Justice and Behavior 30,

668Y687.

Arrow, K. J. (1959). Decision theory and the choice of a level of significance for the t-test.

In I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow & H. B. Mann (Eds.),

Contributions to probability and statistics: Essays in honor of Harold Hotelling (pp.

70Y78). Stanford, CA: Stanford University Press.

*Baller, R. D., Anselin, L., Messner, S. F., Deane, G. & Hawkins, D. F. (2001). Structural

covariates of U.S. county homicide rates: Incorporating spatial effects. Criminology 39,

561Y590.

*Baumer, E. P. (2002). Neighborhood disadvantage and police notification by victims of

violence. Criminology 40, 579Y616.

Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the

chi-square test. Journal of the American Statistical Association 33, 526Y536.

*Bernburg, J. G. & Thorlindsson, T. (2001). Routine activities in social context: A closer

look at the role of opportunity in delinquent behavior. Justice Quarterly 18, 543Y568.

*Borduin, C. M., Mann, B. J., Cone, L. T., Henggeler, S. W., Fucci, B. R., Blaske, D. M. &

Williams, R. A. (1995). Multisystemic treatment of serious juvenile offenders: Long-term

prevention of criminality and violence. Journal of Consulting and Clinical Psychology 63,

569Y578.

Boring, E. G. (1919). Mathematical vs. scientific importance. Psychological Bulletin 16,

335Y338.

*Braga, A. A., Weisburd, D. L., Waring, E. J., Mazerolle, L. G., Spelman, W. & Gajewski,

F. (1999). Problem-oriented policing in violent crime places: A randomized controlled

experiment. Criminology 37, 541Y580.

Brame, R., Paternoster, R., Mazerolle, P. & Piquero, A. (1998). Testing for the equality of

maximum-likelihood regression coefficients between two independent equations. Journal

of Quantitative Criminology 14, 245Y261.

*Broidy, L. M. (2001). A test of general strain theory. Criminology 39, 9Y36.

*Burruss, G. M. Jr., & Kempf-Leonard, K. (2002). The questionable advantage of defense

counsel in juvenile court. Justice Quarterly 19, 37Y68.

*Campbell, F. A., Ramey, C. T., Pungello, E., Sparling, J. & Miller-Johnson, S. (2002).

SIZE MATTERS 17

Page 18: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

Early childhood education: Young adult outcomes from the Abercedarian project. Applied

Developmental Science 6, 42Y57.

*Cernkovich, S. A., & Giordano, P. C. (2001). Stability and change in antisocial behavior:

The transition from adolescence to early adulthood. Criminology 39, 371Y410.

*Chermak, S., McGarrell, E. F. & Weiss, A. (2001). Citizens’ perceptions of aggressive

traffic enforcement strategies. Justice Quarterly 18, 365Y392.

Cook, T. D., Gruder, C. L., Hennigan, K. M. & Flay, B. R. (1979). The history of the sleeper

effect: Some logical pitfalls in accepting the null hypothesis. Psychological Bulletin 86,

662Y679.

*Copes, H., Kerley, K. R. Mason, K. A. & Van Wyk, J. (2001). Reporting behavior of fraud

victims and Black’s theory of law: An empirical assessment. Justice Quarterly 18,

343Y364.

Cumming, G. & Finch, S. (2001). A primer on the understanding, use and calculation of

confidence intervals that are based on central and non-central distributions. Educational

and Psychological Measurement 61, 532Y575.

*Curry, G. D., Decker, S. H. & Egley, A. Jr. (2002). Gang involvement and delinquency in a

middle school population. Justice Quarterly 19, 275Y292.

*Dawson, M. & Dinovitzer, R. (2001). Victim cooperation and the prosecution of domestic

violence in a specialize court. Justice Quarterly 18, 593Y622.

*DeJong, C., Mastrofski, S. D. & Parks, R. B. (2001). Patrol officers and problem solving:

An application of expectancy theory. Justice Quarterly 18, 31Y62.

*Dugan, J. R. & Everett, R. S. (1998). An experimental test of chemical dependency therapy

for jail inmates. International Journal of Offender Therapy and Comparative Criminology

42, 360Y368.

*Dunford, F. W. (2000). The San Diego Navy Experiment: An assessment of interventions

for men who assault their wives. Journal of Consulting and Clinical Psychology 68,

468Y476.

Elliott, G. & Granger, C. W. J. (2004). Evaluating significance: Comments on Bsize

matters^. The Journal of Socio-Economics 33, 547Y550.

*Engel, R. S. & Silver, E. (2001). Policing mentally disordered suspects: A reexamination of

the criminalization hypothesis. Criminology 39, 225Y252.

Engen, R. L. & Gainey, R. R. (2000). Modeling the effects of legally relevant and extralegal

factors under sentencing guidelines: The rules have changed. Criminology 38, 1207Y1230.

*Exum, M. L. (2002). The application and robustness of the rational choice perspective in

the study of intoxicated and angry intentions to aggress. Criminology 40, 933Y966

Farrington, D. P. & Welsh, B. C. (2005). Randomized experiments in criminology: What

have we learned in the last two decades? Journal of Experimental Criminology 1, 9Y38.

*Feder, L. & Dugan, L. (2002). A test of the efficacy of court-mandated counseling for

domestic offenders: The Broward experiment. Justice Quarterly 19, 343Y376.

*Felson, R. B. & Ackerman, J. (2001). Arrest for domestic and other assaults. Criminology

39, 655Y676.

*Felson, R. B. & Haynie, D. L. (2002). Pubertal development, social factors, and

delinquency among adolescent boys. Criminology 40, 967Y988.

*Felson, R. B., Messner, S. F., Hoskin, A. W. & Deane, G. (2002). Reasons for reporting

and not reporting domestic violence to the police. Criminology 40, 617Y648.

Fidler, F. (2002). The fifth edition of the APA Publication manual: Why its statistics

recommendations are so controversial. Educational and Psychological Measurement 62,

749Y770.

SHAWN D. BUSHWAY ET AL.18

Page 19: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

*Finn, M. A. & Muirhead-Steves, S. (2002). The effectiveness of electronic monitoring with

violent male parolees. Justice Quarterly 19, 293Y312.

Fisher, R. A. (1935). The design of experiments. Edinburgh, Scotland: Oliver and Boyd.

Freedman, D. A. (1983). A note on screening regression equations. The American

Statistician 37, 152Y155.

*Garner, J. H., Maxwell, C. D. & Heraux, C. G. (2002). Characteristics associated with the

prevalence and severity of force used by the police. Justice Quarterly 19, 705Y746.

Gigerenzer, G. (1987). Probabilistic thinking and the fight against subjectivity. In L. Kruger,

G. Gigerenzer & M. S. Morgan (Eds.), The probabilistic revolution. Vol. II: Ideas in the

Sciences (pp. 11Y33). Cambridge, MA: MIT Press.

*Golub, A., Johnson, B. D., Taylor, A. & Liberty, H. J. (2002). The Validity of arrestees’

self-reports: Variations across questions and persons. Justice Quarterly 19, 477Y502.

*Gottfredson, D. C., Najaka, S. S. & Kearly, B. (2003). Effectiveness of drug treatment

courts: Evidence from a randomized trial. Criminology and Public Policy 2, 171Y196.

*Greenberg, D. F. & West, V. (2001). State prison populations and their growth, 1971Y1991.

Criminology 39, 615Y654.

Greene, W. H. (2003). Econometric analysis, (5th edition). Upper Saddle River, NJ:

Prentice-Hall.

Harlow, L. L., Mulaik, S. A. & Steiger, J. H. (Eds.), 1997. What if there were no significance

tests? Mahwah, NJ: Lawrence Erlbaum Associates.

*Harmon, T. R. (2001). Predictors of miscarriages of justice in capital cases. Justice

Quarterly 18, 949Y968.

*Hay, C. (2001). Parenting, self-control, and delinquency: A test of self-control theory.

Criminology 39, 707Y736.

*Henggeler, S. W., Melton, G. B., Brondino, M. J., Scherer, D. G. & Hanley, J. H. (1997).

Multisystemic theory with violent and chronic juvenile offenders and their families: The

role of treatment fidelity in successful dissemination. Journal of Consulting and Clinical

Psychology 65, 821Y833.

*Hennigan, K. M., Maxson, C. L., Sloane, D. & Ranney, M. (2002). Community views on

crime and policing: Survey mode effects on bias in community surveys. Justice Quarterly

19, 565Y587.

Hoenig, J. M. & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power

calculations for data analysis. The American Statistician 55, 19Y24.

*Inciardi, J. A., Martin, S. S., Butzin, C. A., Hopper, R. M. & Harrizon, L. D. (1997). An

effective model of prison-based treatment for drug-involved offenders. Journal of Drug

Issues 27, 261Y278.

*Ireland, T. O., Smith, C. A. & Thornberry, T. P. (2002). Developmental issues in the impact

of child maltreatment on later delinquency and drug use. Criminology 40, 359Y400.

Johnson, D. H. (1999). The insignificance of statistical significance testing. Journal of

Wildlife Management 63, 763Y772.

*Kaminski, R. J. & Marvell, T. B. (2002). A comparison of changes in police and general

homicides: 1930Y1998. Criminology 40, 171Y190.

*Kautt, P. & Spohn, C. (2002). Cracking down on black drug offenders? Testing for

interactions among offenders’ race, drug type, and sentencing strategy in federal drug

sentences. Justice Quarterly 19, 1Y36.

*Kempf-Leonard, K., Tracy, P. E. & Howell, J. C. (2001). Serious, violent, and chronic

juvenile offenders: The relationship of delinquency career types to adult criminality.

Justice Quarterly 18, 449Y478.

SIZE MATTERS 19

Page 20: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

*Killias, M., Aebi, M. & Ribeaud, D. (2000). Does community service rehabilitate better

than short-term imprisonment? Results of a controlled experiment. Howard Journal 39,

40Y57.

*Kingsnorth, R. F., MacIntosh, R. C. & Sutherland, S. (2002). Criminal charge or probation

violation? Prosecutorial discretion and implications for research in criminal court

processing. Criminology 40, 553Y578.

*Kleck, G. & Chiricos, T. (2002). Unemployment and property crime: A target-specific

assessment of opportunity and motivation as mediating factors. Criminology 40, 649Y679.

*Koons-Witt, B. A. (2002). The effect of gender on the decision to incarcerate before and

after the introduction of sentencing guidelines. Criminology 40, 297Y328.

*Kramer, J. H. & Ulmer, J. T. (2002). Downward departures for serious violent offenders:

Local court Bcorrections[ to Pennsylvania sentencing guidelines. Criminology 40, 897Y932.

Leamer, E. E. (2004). Are the roads red? Comments on Bsize matters.^ The Journal of

Socio-Economics 33, 555Y557.

Lipsey, M. W. (1990). Design sensitivity: Statistical power for experimental research.

Newbury Park, CA: Sage.

Lipsey, M. W., Crosse, S., Dunkle, J., Pollard, J. & Stobart, G. (1985). Evaluation: The state

of the art and the sorry state of the science. In D. S. Cordray (Ed.), Utilizing prior

research in evaluation planning (New Directions for Program Evaluation, No. 27, pp.

7Y28). San Francisco: Jossey-Bass.

Lunt, P. (2004). The significance of the significance test controversy: comments on Fsize

matters._ The Journal of Socio-Economics 33, 559Y564.

*Maguire, E. R. & Katz, C. M. (2002). Community policing, loose coupling, and

sensemaking in American police agencies. Justice Quarterly 19, 503Y536.

Maltz, M. D. (1994). Deviating from the mean: The declining significance of significance.

Journal of Research in Crime and Delinquency 31, 434Y463.

Marks, H. M. (1997). The progress of experiment: Science and therapeutic reform in the

United States 1900Y1990. Cambridge, UK: Cambridge University Press.

*Marlowe, D. B., Festinger, D. S., Lee, P. A., Schepise, M. M., Hazzard, J. E. R., Merrill, J.

C., Mulvaney, F. D. & McLellan, A. T. (2003). Are judicial status hearings a key

component of drug court? During-treatment data from a randomized trial. Criminal

Justice and Behavior 30, 141Y162.

*Marquart, J. W., Barnhill, M. B. & Balshaw-Biddle, K. (2001). Fata attraction: An analysis

of employee boundary violations in a southern prison system, 1995Y1998. Justice

Quarterly 18, 877Y910.

*Mastrofski, S. D., Reisig, M. D. & McClusky, J. D. (2002). Police disrespect toward the

public: An encounter-based analysis. Criminology 40, 519Y552.

*McCarthy, B., Hagan, J. & Martin, M. J. (2002). In and out of harm’s way: Violent

victimization and the social capital of fictive street families. Criminology 40, 831Y865.

McCloskey, D. N. & Ziliak, S. T. (1996). The standard error of regressions. Journal of

Economic Literature 34, 97Y114.

*McNulty, T. L. (2001). Assessing the raceYviolence relationship at the macro level: The

assumption of racial invariance and the problem of restricted distributions. Criminology

39, 467Y490.

*Meehan, A. J. & Ponder, M. C. (2002). Race & place: The ecology of racial profiling

African American motorists. Justice Quarterly 19, 399Y430.

*Menard, S., Mihalic, S. & Huizinga, D. (2001). Drugs and crime revisited. Justice

Quarterly 18, 269Y300.

*Mills, P. E., Cole, K. N., Jenkins, J. R. & Dale, P. S. (2002). Early exposure to direct

SHAWN D. BUSHWAY ET AL.20

Page 21: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

instruction and subsequent juvenile delinquency: A prospective examination. Exceptional

Children 69, 85Y96.

*Ortmann, R. (2000). The effectiveness of social therapy in prison: A randomized

experiment. Crime and Delinquency 46, 214Y232.

*Peterson, D., Miller, J. & Esbensen, F.-A. (2001). The impact of sex composition on gangs

and gang member delinquency. Criminology 39, 411Y440.

Petrosino, A. (2005). From Martinson to meta-analysis: Research reviews and the US offender

treatmentdebate.Evidence & Policy: A Journal of Research, Debate and Practice 1, 149Y172.

*Piquero, A. R. & Brezina, T. (2001). Testing Moffitt’s account of adolescent-limited

delinquency. Criminology 39, 353Y370.

*Pogarsky, G. (2002). Identifying Bdeterrable^ offenders: Implications for research on

deterrence. Justice Quarterly 19, 431Y452.

*Rebellon, C. J. (2002). Reconsidering the broken homes/delinquency relationship and

exploring its mediating mechanism(s). Criminology 40, 103Y136.

*Rhodes, W. & Gross, M. (1997). Case management reduces drug use and criminality

among drug-involved arrestees: An experimental study of an HIV prevention intervention.

Washington, DC: National Institute of Justice.

*Richards, H. J., Casey, J. O. & Lucente, S. W. (2003). Psychopathy and treatment response

in incarcerated female substance abusers. Criminal Justice and Behavior 30, 251Y276.

Rosenthal, R. & Rubin, D. B. (1994). The counternull value of an effect size: A new

statistic. Psychological Science 5, 329Y334.

Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological

Bulletin 5, 416Y428.

Rozeboom, W. W. (1997). Good science is abductive, not hypothetico-deductive. In L. L.

Harlow, S. A. Mulaik & J. H. Steiger (Eds.), What if there were no significance tests? (pp.

335Y392). Mahwah, NJ: Lawrence Erlbaum Associates.

*Scheider, M. C. (2001). Deterrence and the base rate fallacy: An examination of perceived

certainty. Justice Quarterly 18, 63Y86.

*Schnebly, S. M. (2002). An examination of the impact of victim, offenders, and situational

attributes on the deterrent effect of defensive gun use: A research note. Justice Quarterly

19, 377Y398.

*Schwartz, M. D., DeKeseredy, W. S., Tait, D., & Avi, S. (2001). Male peer support and a

feminist routine activities theory: Understanding sexual assault on the college campus.

Justice Quarterly 18, 623Y650.

Sherman, L. W., Gottfredson, D., MacKenzie, D., Eck, J., Reuter, P. & Bushway, S. (1997).

Preventing crime: What works, what doesn’t, what’s promising: A report to the United

States Congress. Washington, DC: National Institute of Justice.

*Silver, E. (2002). Mental disorder and violent victimization: The mediating role of

involvement in conflicted relationships. Criminology 40, 191Y212.

*Simons, R. L., Stewart, E., Gordon, L. C., Conger, R. D. & Elder, G., Jr. 2002. A test of

life-course explanations for stability and change in antisocial behavior from adolescence

to young adulthood. Criminology 40, 401Y434.

*Spohn, C. & Holleran, D. (2001). Prosecuting sexual assault: A comparison of charging

decisions in sexual assault cases involving strangers, acquaintances, and intimate partners.

Justice Quarterly 18, 651Y688.

*Spohn, C. & Holleran, D. (2002). The effect of imprisonment on recidivism rates of felony

offenders: A focus on drug offenders. Criminology 40, 329Y358.

*Steffensmeier, D. & Demuth, S. (2001). Ethnicity and judges’ sentencing decisions:

HispanicYBlackYWhite comparisons. Criminology 39, 145Y178.

SIZE MATTERS 21

Page 22: Size matters: Standard errors in the application of null hypothesis significance ...gasweete/crj604/readings/2006-Bushway... · 2006-06-08 · Size matters: Standard errors in the

*Stewart, E. A., Simons, R. L. and Conger, R. D. (2002). Assessing neighborhood and social

psychological influences on childhood influences on childhood violence in an African-

American sample. Criminology 40, 801Y829.

*Swanson, J. W., Borum, R., Swartz, M. S., Hiday, V. A., Wagner, H. R. & Burns, B. J.

(2001). Can involuntary outpatient commitment reduce arrests among persons with severe

mental illness? Criminal Justice and Behavior 28, 156Y189.

*Taylor, B. G., Davis, R. C. & Maxwell, C. D. (2001). The effects of a group batterer

treatment program: A randomized experiment in Brooklyn. Justice Quarterly 18, 171Y201.

*Terrill, W. & Mastrofski, S. D. (2002). Situational and officer-based determinants of police

coercion. Justice Quarterly 19, 215Y248.

Thompson, B. (2004). The Bsignificance^ crisis in psychology and education. The Journal of

Socio-Economics 33, 607Y613.

*van Voorhis, P., Spruance, L. M., Ritchey, P. N., Listwan, S. J. & Seabrook, R. (2004). The

Georgia cognitive skills experiment: A replication of Reasoning and Rehabilitation.

Criminal Justice and Behavior 31, 282Y305.

*Velez, M. B. (2001). The role of public social control in urban neighborhoods: A multi-

level analysis of victimization risk. Criminology 39, 837Y864.

*Vogel, B. L. & Meeker, J. W. (2001). Perceptions of crime seriousness in eight African-

American communities: The influence of individual, environmental, and crime-based

factors. Justice Quarterly 18, 301Y321.

Weisburd, D., Lum, C. M. & Yang, S.-M., 2003. When can we conclude that treatments or

programs Bdon’t work^? The Annals of the American Academy of Political and Social

Science 574, 31Y48.

*Weitzer, R. & Tuch, S. A. (2002). Perceptions of racial profiling: Race, class, and personal

experience. Criminology 40, 435Y456.

Wellford, C., 1989. Towards an integrated theory of criminal behavior. In S. Messner, M. M.

Krohn, and A. Liska (Eds.), Theoretical integration in the study of deviance and crime:

Problems and prospects (pp. 119Y128). Albany, NY: State University of New York.

*Wells, L. E. & Weisheit, R. A. (2001). Gang problems in nonmetropolitan areas: A

longitudinal assessment. Justice Quarterly 18, 791Y824.

*Welsh, W. N. (2001). Effects of student and school factors on five measures of school

disorder. Justice Quarterly 18, 911Y948.

*Wexler, H. K., Melnick, G., Lowe, L. & Peters, J. (1999). Three-year reincarceration

outcomes for Amity in-prison therapeutic community and aftercare in California. Prison

Journal 79, 321Y336.

Wilson, D. B. (2001). Meta-analytic methods for criminology. Annals of the American

Academy of Political and Social Science 578, 71Y89.

Wooldridge, J. M. (2004). Statistical significance is okay, too: Comment on Bsize matters.^The Journal of Socio-Economics 33, 577Y579.

*Wright, B. R. E., Caspi, A., Moffitt, T. E. & Silva, P. A. (2001). The effects of social ties

on crime vary by criminal propensity: A life-course model of interdependence.

Criminology 39, 321Y352.

*Wright, J. P., Cullen, F. T., Agnew, R. S. & Brezina, T. (2001). BThe root of all evil^? An

exploratory study of money and delinquent involvement. Justice Quarterly 18, 239Y268.

Zellner, A. (2004). To test or not to test and if so, how? Comments on Bsize matters?^ The

Journal of Socio-Economics 33, 581Y586.

Ziliak, S. T. & McCloskey, D. N. (2004). Size matters: The standard error of regressions in

the American Economic Review. The Journal of Socio-Economics 33, 527Y546.

SHAWN D. BUSHWAY ET AL.22