20
Psychological Methods 2000, Vol.5, No. 4, 496-515 Copyright 2000 by the American Psychological Association, Inc. 1082-989X/00/S5.00 DOI: IO.I037//I082-989X.5.4.496 Combining Independent p Values: Extensions of the Stouffer and Binomial Methods Richard B. Darlington Cornell University Andrew F. Hayes Dartmouth College The Stouffer z method, and other popular methods for combining p values from independent significance tests, suffer from three problems: vulnerability to criti- cisms of the individual studies being pooled, difficulty in handling the "file drawer problem," and vague conclusions. These problems can be reduced or eliminated by supplementing a test of combined probability with a variety of new analyses de- scribed here. Along with other advantages, these analyses provide a way to address the file drawer problem without making limiting assumptions about the nature of studies not included in the pooled analysis. These analyses can supplement a traditional meta-analysis, yielding conclusions not provided by widely used meta- analytic procedures. Today it is rare that a literature search reveals only one study on a topic of interest. Rather, there may be several, or dozens, or even hundreds of studies on a single topic. This fact has led to the rise of meta- analysis, which has revolutionized the way we think about literature reviews. The abundance of studies has also led to the widespread use of tests of combined significance. These tests are often discussed in gen- eral works on meta-analysis such as Becker (1994), Glass, McGaw, and Smith (1981), or Rosenthal (1991), but they represent a distinct offshoot from the main branch of meta-analysis, which focuses on effect size. Tests of combined significance are also called probability poolers. The purpose of probability pool- ers is to show that a positive effect exists in at least some of the studies under analysis, without asking about the size of typical effects or the factors affecting effect size. Richard B. Darlington, Department of Psychology, Cor- nell University; Andrew F. Hayes, Amos Tuck School of Business Administration and Department of Psychological and Brain Sciences, Dartmouth College. A substantial part of the work on this manuscript was done while Andrew F. Hayes was at the University of New England, Armidale, Australia. Correspondence concerning this article should be ad- dressed to Richard Darlington, Department of Psychology, Cornell University, Ithaca, New York 14853. Electronic mail may be sent to [email protected] or to Andrew. [email protected]. The best-known probability pooler is the Stouffer z method, which is described under the heading, A Brief Review of Probability Poolers and Some Terminol- ogy. In a simple application, an analyst might find four independent tests of the same one-sided null hy- pothesis, all in the same direction but with nonsignif- icant ps of .159, .133, .111, and .092. Combining these by the Stouffer method yields a pooled p of just below .01. As in this example, a pooled p is often below all the individual ps used to calculate it, even before the lowest individual ps are corrected for hav- ing been selected post hoc. Probability poolers are not used nearly as widely as meta-analytic techniques that emphasize effect size, but there are at least two circumstances in which probability poolers seem especially useful. The first occurs in examples like the one in the previous paragraph, in which the total number of studies is so small that it is not clear (without combining probabilities) that the effect is even statistically significant. The second use of probability poolers has been ne- glected, and we hope that this article may stimulate its use. When there are dozens, hundreds, or even thou- sands of studies on a topic, those studies typically vary on many dimensions: precise nature of an ex- perimental treatment, types of subjects studied, ex- perimental conditions, and so forth. Among hundreds of studies, perhaps only five studied older pregnant women with low-budget treatment programs and with adequate controls. It may be very important to some 496

Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

Embed Size (px)

Citation preview

Page 1: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

Psychological Methods2000, Vol.5, No. 4, 496-515

Copyright 2000 by the American Psychological Association, Inc.1082-989X/00/S5.00 DOI: IO.I037//I082-989X.5.4.496

Combining Independent p Values:Extensions of the Stouffer and Binomial Methods

Richard B. DarlingtonCornell University

Andrew F. HayesDartmouth College

The Stouffer z method, and other popular methods for combining p values fromindependent significance tests, suffer from three problems: vulnerability to criti-cisms of the individual studies being pooled, difficulty in handling the "file drawerproblem," and vague conclusions. These problems can be reduced or eliminated bysupplementing a test of combined probability with a variety of new analyses de-scribed here. Along with other advantages, these analyses provide a way to addressthe file drawer problem without making limiting assumptions about the nature ofstudies not included in the pooled analysis. These analyses can supplement atraditional meta-analysis, yielding conclusions not provided by widely used meta-analytic procedures.

Today it is rare that a literature search reveals onlyone study on a topic of interest. Rather, there may beseveral, or dozens, or even hundreds of studies on asingle topic. This fact has led to the rise of meta-analysis, which has revolutionized the way we thinkabout literature reviews. The abundance of studies hasalso led to the widespread use of tests of combinedsignificance. These tests are often discussed in gen-eral works on meta-analysis such as Becker (1994),Glass, McGaw, and Smith (1981), or Rosenthal(1991), but they represent a distinct offshoot from themain branch of meta-analysis, which focuses on effectsize. Tests of combined significance are also calledprobability poolers. The purpose of probability pool-ers is to show that a positive effect exists in at leastsome of the studies under analysis, without askingabout the size of typical effects or the factors affectingeffect size.

Richard B. Darlington, Department of Psychology, Cor-nell University; Andrew F. Hayes, Amos Tuck School ofBusiness Administration and Department of Psychologicaland Brain Sciences, Dartmouth College.

A substantial part of the work on this manuscript wasdone while Andrew F. Hayes was at the University of NewEngland, Armidale, Australia.

Correspondence concerning this article should be ad-dressed to Richard Darlington, Department of Psychology,Cornell University, Ithaca, New York 14853. Electronicmail may be sent to [email protected] or to [email protected].

The best-known probability pooler is the Stouffer zmethod, which is described under the heading, A BriefReview of Probability Poolers and Some Terminol-ogy. In a simple application, an analyst might findfour independent tests of the same one-sided null hy-pothesis, all in the same direction but with nonsignif-icant ps of .159, .133, .111, and .092. Combiningthese by the Stouffer method yields a pooled p of justbelow .01. As in this example, a pooled p is oftenbelow all the individual ps used to calculate it, evenbefore the lowest individual ps are corrected for hav-ing been selected post hoc.

Probability poolers are not used nearly as widelyas meta-analytic techniques that emphasize effectsize, but there are at least two circumstances inwhich probability poolers seem especially useful. Thefirst occurs in examples like the one in the previousparagraph, in which the total number of studies isso small that it is not clear (without combiningprobabilities) that the effect is even statisticallysignificant.

The second use of probability poolers has been ne-glected, and we hope that this article may stimulate itsuse. When there are dozens, hundreds, or even thou-sands of studies on a topic, those studies typicallyvary on many dimensions: precise nature of an ex-perimental treatment, types of subjects studied, ex-perimental conditions, and so forth. Among hundredsof studies, perhaps only five studied older pregnantwomen with low-budget treatment programs and withadequate controls. It may be very important to some

496

Page 2: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

PROBABILITY POOLERS 497

specialists to know that the treatment in question canwork in such circumstances. Probability poolers canfill this need. When the total number of studies islarge, it is almost always possible to group the studiesinto small sets based on important dimensions andthus reach more specific conclusions than one couldotherwise.

However, the use of probability poolers has beenimpeded by three major limitations of these methods.As typically used, they are highly vulnerable to criti-cisms of individual studies included in the test; theydon't handle the "file drawer problem" well; and theconclusions they allow are vague and uninformative.We argue that all these problems can be alleviated bynew types of analysis that can yield far more infor-mation than a simple one-shot probability pooler.

A Brief Review of Probability Poolers andSome Terminology

We focus throughout on directional hypothesis testsand assume that investigators are trying to show thatan effect 9 is positive, thereby rejecting the null hy-pothesis 0 < 0. As explained more fully in the section,Vague Conclusions, we assume that all investigatorsmay not in fact be studying exactly the same trueeffects, so we let 6, denote the true effect studied byExperiment i. We assume that even if some or allinvestigators reported p values as two-tailed, the pool-ing analyst converts all values to one-tailed values pf,in such a way that a smaller value of/?, corresponds toa positive estimate of 9,. Thus one-tailed ps in thewrong direction are subtracted from 1. Two-tailed psare first divided by 2, then subtracted from 1 if theestimate of 9, was negative. Thus pt will always beabove .5 if the estimate of 9, was negative.

Over the last 50 years or so, many different meth-ods for combining independent probabilities havebeen suggested. Rosenthal (1978) described several ofthese methods, and their power and validity have beenevaluated by Strube and Miller (1986). We shall dis-cuss first the method of adding zs, known to most asStouffer's test (from Stouffer, Suchman, DeVinney,Star, & Williams, 1949), because it is probably themost widely used and understood of the availablemethods of combining probabilities from a set of in-dependent hypothesis tests. Our arguments, however,generalize to all commonly used methods of combin-ing probabilities, including Winer's method of addingts (Winer, 1962), and Fisher's method, which is basedon the adding of logged ps (Fisher, 1938).

In the Stouffer method, each p( is transformed to zt,the value of t that cuts off the upper 100/?% of thearea under the standard normal curve. For instance, aone-tailed p of exactly .025 is transformed to a z of1.96. Let k denote the number of tests pooled. ThenStouffer's Z is defined as

^•Stouffer ~

If all true effects are zero, then ZStouffer is distrib-uted as standard normal. The combined probabilityused to evaluate the null hypothesis is defined as theproportion of the area under the standard normal dis-tribution to the right of ZStouffer. If this resulting com-bined probability is less than the level of a selectedfor the test of combined significance, this leads torejection of the null hypothesis of no effect.

Problems with Standard Probability Poolers

The Stouffer method is clearly the probability-pooling method used most often in the psychologicalliterature. However, Stouffer's method and all otherwell-known probability poolers suffer from three ma-jor but avoidable limitations.

Vulnerability to Criticisms of Individual Studies

Probability poolers are extremely vulnerable tocriticisms of the individual studies included in thetest. Many difficult decisions must be made whendeciding whether to include a study in a pooled analy-sis: whether the study is actually relevant to the ques-tion of interest, whether the study was conducted wellenough to be included in the analysis, and so on.Ultimately, the investigator conducting the analysismakes the final decision. However, a reader typicallycannot tell whether deleting any one study might per-haps change the entire pooled conclusion from sig-nificant to nonsignificant. Thus, any reader who be-lieves that even one study in the pooled set wasirrelevant, methodologically flawed, or used condi-tions so rare they are unlikely to recur logically should(and often does) reject the entire pooled conclusion.Given that different readers may have different beliefsabout which studies should and should not be in-cluded in a test of combined significance, a good pro-portion of the readers will reject the pooled con-

Page 3: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

498 DARLINGTON AND HAYES

elusions outright, perhaps on entirely differentgrounds. This is an extremely serious limitation in thereal world, although methodological discussions ofprobability poolers typically ignore the point.

Difficulty in Handling the File Drawer Problem

Standard probability poolers also provide no satis-factory means for handling what Rosenthal (1979) hascalled the file drawer problem. Not every study everconducted on the topic is likely to come to the atten-tion of the person conducting a pooling analysis. Theanalyst must typically assume that his or her literaturesearch missed some studies that have been performed,perhaps because they were never published. Becausestatistically significant results seem more likely to bepublished than nonsignificant results (see, e.g., Begg,1994; Dickerson, 1997; Hunt, 1997), a test of com-bined significance will often reject the null hypothesisnot because the evidence supports a particular alter-native but because evidence consistent with the nullhypothesis is less likely to be included in the test. Thisform of publication bias therefore makes tests of com-bined significance very difficult to interpret.

Rosenthal (1979) has proposed a means for han-dling the file drawer problem when Stouffer's test isused by entertaining a simple question: If Stouffer'stest leads to rejection of the null, how many undis-covered studies averaging no effect (mean z = 0)would have to exist to change the outcome of St-ouffer's test from statistically significant to nonsig-nificant? Rosenthal proposed a formula for estimatingthat number; he called the estimate the "fail-safe N"(FSN). If the FSN is too large given what is knownabout the field of study, one can conclude that theresult of Stouffer's test is immune to the problem ofpublication bias. For example, suppose the investiga-tor found only 50 studies after an extensive literaturesearch, for which Stouffer's test gives a combinedsignificance of .000000001 and the FSN is computedas 614. In most fields of study, it is unlikely that over600 studies would elude an investigator, so the inves-tigator can claim that this small combined signifi-cance is probably not attributable exclusively to pub-lication bias.

Although Rosenthal's method is widely used, somehave suggested that the logic underlying it is flawed.The method assumes that the studies that have eludedthe investigator have an average z of zero (Rosenthal,1978, p. 186). But it would seem that if publicationbias existed, the average z of the unpublished studieswould typically be below zero (Darlington, 1980;

lyengar & Greenhouse, 1988; Thomas, 1985). For in-stance, suppose 1,000 studies of the same topic yieldan average z of zero. For simplicity, suppose all 1,000studies are completed simultaneously, so there is amoment when all are unpublished. At that moment,the mean z of the unpublished studies is zero. But ifthe subsequent process of selective publication tendsto remove studies with positive zs from this unpub-lished set, then the remainder must perforce have anegative mean. Ironically, several of the criticisms ofRosenthal's approach themselves languish in filedrawers despite serious attempts to publish them.Therefore, most behavioral scientists who use Rosen-thal's FSN formula are probably unaware that thesecriticisms even exist.

Vague Conclusions

The problem of vague conclusions arises especiallywhen the total number of studies is too small to allowa pooling analyst to group them into highly homoge-neous subsets. Investigators often mistakenly interpreta statistically significant pooled probability as evi-dence that the effect size tends to be different fromzero or that the corpus of studies supports a generalconclusion about the direction of the combined ef-fects. That is, users of Stouffer's test often fail toacknowledge that each individual study typically ex-amines a slightly different form of the effect of inter-est. Perhaps one study uses male participants and vi-sually presented stimuli, and another uses femaleparticipants and orally presented stimuli. It is rarelypossible to enumerate all the possible factors thatmight influence the effect size, and we can almostnever assume that the set of studies available is in anysense evenly balanced across these factors. Thus it isfatuous to claim that one is testing an average effectsize in any important sense.

Aside from this, consider the effect of differentsample sizes. Suppose two studies of a particular traituse men and women, respectively, and show respec-tive means of -1 and +1. But suppose the secondsample size is much larger than the first, so that whentests are performed of the null hypothesis of a zeromean and the results are transformed to z. values, thezs are -1 and +4. The Stouffer pooled z is then asignificant 2.12, although the average of the twosample means is exactly 0.

The reader might object that a weighted average ofsample means (weighted by sample size) is positive inthis example. But we could create other examples inwhich the two studies differ not in sample size, but in

Page 4: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

PROBABILITY POOLERS 499

the power of the significance test used or the sensi-tivity of the measures used. In a typical use of prob-ability poolers, studies will differ in all these respects.Even in the simple case in which each study uses asample of size Nf to test a single mean M, against zerowith a known common standard deviation CT, for eachstudy we have z( = M,/(o-A/W,) = VN, MJa. Thus asummation of zs effectively weights each samplemean by V7V(, not by Nt. Therefore, the Stouffermethod does not in fact weight studies in proportionto sample size.

Thus, methods like Stouffer's test are not testinganything akin to an "average" effect in the corpus ofliterature as a whole (cf. Becker, 1987). The onlyscientifically useful conclusion possible is that at leastone of the k pooled effects is positive. Such a conclu-sion is often unacceptably vague. For instance, if thequestion of interest concerns the effectiveness of anew medical treatment, and the analysis includes doz-ens of studies of both sexes and many different agesand ethnic groups around the world, then the finalconclusion may be merely that the treatment worksfor at least one group somewhere in the world, withno ability to be more specific. However, more specificconclusions can be reached.

An Overview

In the remainder of this paper we describe ways tomodify a probability pooler to handle the problemsjust discussed. In principle, any probability pooler canbe so modified, but the modifications will typicallyrequire new tables. Therefore we focus on just twoprobability poolers: the Stouffer method, for whichwe have developed new tables, and the binomialmethod. In principle, existing tables will suffice forthe binomial method, but new, larger tables wouldbe highly useful, and we have developed such tables.For reasons that will be made clear later, we havegiven the name "Stouffer-max" to our extensions ofthe Stouffer method. It turns out that Stouffer-maxalso subsumes as a special case the Bonferronimethod or, more precisely, the exact formula forindependent tests that the Bonferroni formula ap-proximates.

We have performed several power studies on thesemethods, though in the present article we report theresults of those studies only in the briefest summaryform. The Stouffer-max method is generally morepowerful than the binomial method, especially when kis small. However, the Stouffer-max method is cur-rently limited by available tables to k < 1,000. The

binomial method is simpler conceptually and has noupper limit on the number of studies. Thus, we rec-ommend choosing between these two methods on thebasis of the circumstances. Because of the simplicityof the binomial method, we use it to offer our firstillustrations of the analyses we advocate.

We will say repeatedly that some conclusion "mustbe" true. We mean that in the ordinary sense of "prob-ability, expectation, or supposition," as in the sen-tence, "It must be almost midnight." Specifically, wesay a hypothesis must be true when all alternativehypotheses have been rejected, and we use rejected inthe same way statisticians have routinely used it formany decades.

The Binomial Probability Pooler

One of the simplest and earliest forms of combiningthe probabilities from a series of independent hypoth-esis tests is based on a simple question: Given k in-dependent hypothesis tests in which the null hypoth-esis is true in every test, what is the probability ofgetting s or more "positive outcomes," that is, positiveresults that are statistically significant at a fixed levelof significance a? Almost half a century ago, severalauthors (Brozek & Tiede, 1952; Jones & Fiske, 1953;Wilkinson, 1951) pointed out that this probability canbe calculated from the binomial formula

\k—x

and that a small binomial probability will support theclaim that a true positive effect does exist in the lit-erature. The binomial method is sometimes consid-ered a "vote-counting" method (Hedges & Olkin,1980), though some authors reserve that term for thecase in which the pooling analysis contains no infer-ential statistics at all.

Extensions of the Basic Method

But one can easily go well beyond the simple con-clusion of the basic binomial method. To illustratethese analyses and their conclusions for a binomialprobability pooler, consider a simple hypothetical ex-ample. Suppose a literature search reveals 10 studieson a topic, and 4 of the 10 show a statistically sig-nificant effect beyond the .05 level in the direction ofinterest. In our terminology, 4 of the 10 tests havepositive outcomes and the other 6 have negative out-comes. A binomial table shows that the one-tailedprobability of finding 4 or more of 10 studies signifi-

Page 5: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

500 DARLINGTON AND HAYES

cant at the .05 level is only .0010, so the pooled orcombined significance level is .001.

In this section, we make several claims that willstrike many readers as logically questionable. We de-fend our logic at length in the section, The LogicBehind Our Conclusions. Note that in this example,the positive conclusion from the binomial analysis isinvulnerable to criticisms of any of the 6 studies withnegative outcomes. Deleting any one of these 6 stud-ies would only improve the pooled result by eliminat-ing one of the negative instances. Thus the positivepooled conclusion is invulnerable, in this case, tocriticisms of over half the studies. In fact, deleting anysubset of these six studies, or even all six, would havethe same result. We will see later that for all of theprobability poolers we discuss, one can usually iden-tify a set of studies that do not contribute to a positivepooled result. That result is then invulnerable to criti-cisms of any or all of those studies.

Furthermore, the positive conclusion from the bi-nomial analysis is more specific in the followingsense. Recall the point made earlier that no two stud-ies are really identical (some may use male partici-pants whereas others use female participants, etc.) soit is useful to know exactly which studies are contrib-uting to the positive pooled conclusion. Anyone whohas reached such a conclusion and then plans to studythe details of the experiments to understand why re-sults can come out positive in this type of analysis isclearly well advised to focus on the four experimentsthat contributed to the positive pooled conclusion. Asimilar point is not nearly so obvious when an ordi-nary probability pooler has been used.

Now we consider the file drawer problem. A bino-mial table shows that with four results significant atthe .05 level, the total number of studies could be aslarge as 28 and the pooled binomial result would stillbe significant at just below the .05 level (.0491). Be-cause 10 studies were found, the combined result isstill statistically significant at the .05 level even if weimagine that there are as many as 18 studies withnegative outcomes that are unpublished or otherwisenot found. The important thing to notice about thisderivation of the FSN is that, unlike Rosenthal'smethod, it does not require the assumption that themean effect size of the missing studies is zero; all themissing effect sizes may be highly negative. If oneassumes that any of the missing studies had positiveoutcomes (i.e., were statistically significant in thepositive direction), then FSN must be larger. Thus, theFSN derived with the binomial method is a lower

limit on the number of missing studies that wouldhave to exist to threaten the significance of the pooledp value.

Now suppose we are concerned that some unknownfuture critic might point out some reason for exclud-ing one of the four studies counted as having positiveoutcomes. The binomial method handles this possi-bility quite simply. The investigator (or critic) merelyreturns to a binomial table or formula and finds theprobability of three or more positive outcomes amongnine studies (as discarding one study would lower thetotal to nine). That probability is .0084. Thus, criti-cizing or discarding one of the positive outcomes doesnot change the pooled probability from statisticallysignificant to nonsignificant. Happily for the analyst,it makes no difference which of the four studies witha positive outcome some critic might attack.

What if our unknown future critic attacks not onebut two studies? To assess this contingency, wereturn to the binomial table or formula and find theprobability of two (four minus two) or more posi-tive outcomes in eight (10 minus 2) studies. Thepooled probability is .0572, so we conclude that thepooled probability is only marginally significant byusual criteria if two of the four positive-outcomestudies are attacked. We have, however, shown thatthe positive pooled conclusion is invulnerable tomethodological criticisms of any one study in theanalysis.

A critic might disparage a study not for its meth-odological inadequacy but for the fact that it usedunusual conditions that are unlikely to recur. Thesame sort of analysis can address that possibility byidentifying a lower confidence limit on the number ofreal positive effects. Thus, it increases the specificityof the pooled conclusion. In the present example, thatlower limit is 2.

We can also combine this sort of analysis with afile-drawer analysis. A binomial table or formulashows that when one of the positive outcomes is dis-carded, the pooled result is significant at the .05 leveleven if the total number of studies is as large as 16.Therefore, if we observe 4 positive outcomes out of10, we can conclude that there must be at least tworeal positive effects even if we imagine that our searchmissed as many as 7 (from 16-9) studies with nega-tive outcomes. (Our phrase "real positive effect" ofcourse ignores the possibility that positive effects mayresult from experimental error. A more precise phrasewould be "positive effect not caused by chance." Thatphrase seems awkwardly long, so we shall continue to

Page 6: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

PROBABILITY POOLERS 501

use the simpler phrase, with the understanding that itincludes this possibility.)

Tables and Macros for BinomialProbability Poolers

Repeatedly using the binomial formula to computethe pooled p can be tedious. Typical binomial tablescover only modest ranges of values of 5 (the numberof positive outcomes) and k (the total number of stud-ies combined). Thus our recommendations might ap-pear impractical. In response, we have created a seriesof tables and macros to aid in this task. To be as sureas possible that investigators are not prevented fromusing our techniques by the absence of appropriatetables, we have extended the tables to rather highvalues of k. However, for the reasons discussed in theintroduction, we assume that most uses of these tech-niques will involve moderately small values of k.

In Appendix A, we have included macros and othercommands for SYSTAT, Minitab, and SPSS. Theseallow analysts to create tables of cumulative binomialprobabilities of three types: (a) specifically for analy-ses in which positive outcomes are deleted one byone, so that both s and k drop by 1 each time; (b)specifically for file-drawer analysis, in which a and .sare held constant but k varies; and (c) more generaltables covering many values of both s and k. In allcases, the user can choose the values of a, s, and k thatthe table will cover.

In addition, a website contains large binomialtables of the types illustrated by Tables 1 and 2 of thisarticle. The website tables are approximately 300 and50 times as large, respectively, as Tables 1 and 2. Thewebsite addresses are given in the table notes.

When an analyst chooses some significance value ato distinguish between positive and negative out-comes of individual studies, he or she usually sets a at.5, .1, .05, or .01. For each of these four values of a,tables on the website show all exact binomial ps (i.e.,probabilities of s or more positive outcomes) fallingbetween .1 and .0001, for k < 1,000. Table 1 showsthe beginning of the website table for a = .05.

The website contains four tables that are similar toTable 2 of this article. Each of these tables shows themaximum k that would make a result significant at aspecified level (.05, .01, or .001) for a specified num-ber (up to 1,000) of positive outcomes. Again, a forindividual studies may be set at .5, .1, .05, or .01.Table 2 shows part of the website table for a = .05.

Table 2 and its parallel website tables are especiallyconvenient for file-drawer analyses, but we want to

emphasize that they can also be used for other pur-poses. Their major limitation, in comparison to ordi-nary binomial tables, is that they tell only whether apooled result is significant beyond .05, .01, or .001,rather than yielding a more exact pooled p. Their ma-jor advantage is their compactness for the range of kcovered, which allows them to include /t-values ashigh as 94,881 (for a = .01) in a table which, whenprinted, fits on two pages.

The Logic Behind Our Conclusions

The Binomial Case

The previous section has introduced three prin-ciples that need substantiation:

1. One can conclude that at least one of k effects mustbe positive even when no effect is significant in-dividually.

2. Similar conclusions can be drawn about "at least/'effects, where j > 1.

3. The conclusion that a set of studies must contain atleast j real positive effects can be applied not onlyto the total set of k studies, but also to the smallerset of studies that contributed to the pooled con-clusion. For a binomial pooler, that is simply theset of studies with positive outcomes.

In our discussion of these issues, we will considera "trial" to be a single repetition of all k studies. Asbefore, we use the phrase must be true not in an ab-solute sense, but more in the way statisticians rou-tinely use the word accept, meaning that all possiblealternative hypotheses have been found to be incon-sistent with the evidence at a specified probabilitylevel.

We'll discuss these issues in the context of thefollowing example. We have 100 coins, each of whichwe flip once. Because our real discussions of prob-ability poolers focus on one-sided hypotheses, let usassume that we care only if coins are biased towardheads, so the null hypothesis is that all 100 coins areeither fair or biased toward tails. We wish to test thishypothesis, and related ones, at the .05 level.

Suppose we observe 70 heads in the 100 flips. If weuse the exact binomial formula to test the null hypoth-esis, we find a one-tailed p of .000039, allowing us toconclude that at least one of the coins must be biasedtoward heads. We reach this conclusion even thoughwe cannot point to any one coin and say with anycertainty at all that that one coin is biased. This es-tablishes the first of our three listed points.

Page 7: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

502 DARLINGTON AND HAYES

Table 1Cumulative Binomial p Values Between .1 and .0001, for P = .05 and k s 50

k

23456789101 1121314151617181920212223242526272829303132333435363738394041424344454647484950

sm

1222222223333333333334444444444444555555555555556

Binomial ps

.0975

.0073

.0140

.0226

.0328

.0444

.0572

.0712

.0861

.0152

.0196

.0245

.0301

.0362

.0429

.0503

.0581

.0665

.0755

.0849

.0948

.0258

.0298

.0341

.0387

.0437

.0491

.0548

.0608

.0671

.0738

.0808

.0881

.0958

.0324

.0359

.0397

.0438

.0480

.0525

.0573

.0623

.0675

.0729

.0786

.0845

.0907

.0970

.0378

.0025

.0001

.0005

.0012

.0022

.0038

.0058

.0084

.0115

.0016

.0022

.0031

.0042

.0055

.0070

.0088

.0109

.0132

.0159

.0189

.0222

.0049

.0060

.0072

.0085

.0100

.0117

.0136

.0156

.0179

.0204

.0230

.0259

.0290

.0083

.0095

.0109

.0123

.0139

.0156

.0174

.0194

.0216

.0239

.0263

.0289

.0317

.0347

.0118

.0002

.0004

.0006

.0010

.0001

.0002

.0003

.0004

.0006

.0009

.0012

.0015

.0020

.0026

.0032

.0040

.0008

.0010

.0012

.0015

.0019

.0023

.0027

.0033

.0039

.0046

.0054

.0063

.0073

.0018

.0021

.0025

.0029

.0034

.0039

.0045

.0051

.0059

.0066

.0075

.0084

.0095

.0106

.0032

—.0001.0002.0002.0003.0004.0006—.0001.0002.0002.0003.0004.0005.0006.0007.0009.0011.0013.0015.0003.0004.0005.0006.0007.0008.0010.0012.0014.0016.0018.0021.0024.0028.0008

———————————————

.0001

.0001

.0002

.0002

.0003———.0001.0001.0002.0002.0002.0003.0003.0004.0005.0006.0006.0002

——————

————

————

——————————

————————

.0001

.0001—

Note. P is the probability of a positive outcome on each binomial trial, k is the number of trials. Each table entry p is the probability of .sor more positive outcomes, sm (for s-minimum) is the smallest.? for which a p appears in the row shown. Entries in a row are for successiveand increasing values of s. Thus for k = 4, ps of .0140 and .0005 are for s = 2 and s = 3 respectively. If i exceeds the largest .v for whicha p i s shown, then p < .0001. If s falls below the smallest s for which a p is shown, then p > . I . If an entry appears in the table as .0001, thenits true value is .0001 or higher, because entries with true values below .0001 are not shown. Much larger tables of this form appear athttp://www.psych.cornell.edu/darlington/meta/binomtab.htm

Page 8: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

PROBABILITY POOLERS 503

Table 2Maximum Value o/k to Yield a Pooled p ^ .05 by the Binomial Test at a = .05, for Fixed s

y

X

0102030405060708090100110120130140150160170180190200210220230240250260270

0

—110268435608784962

1,1421,3241,5061,6901,8742,0592,2452,4312,6182,8052,9923,1803,3683,5573,7463,9354,1244,3144,5034,6934,883

1

1125284452625802980

1,1601,3421,5251,7081,8932,0782,2632,4502,6362,8243,0113,1993,3873,5763,7653,9544,1434,3334,5224,7124,902

2

7140300469643819998

1,1781,3601,5431,7271,9112,0962,2822,4682,6552,8423,0303,2183,4063,5953,7833,9734,1624,3524,5414,7314,922

3

16155317487660837

1,0161,1971,3781,5611,7451,9302,1152,3012,4872,6742,8613,0493,2373,4253,6143,8023,9924,1814,3704,5604,7504,941

4

28171334504678855

1,0341,2151,3971,5801,7631,9482,1332,3192,5062,6922,8803,0673,2553,4443,6323,8214,0104,2004,3894,5794,7694,960

5

40187351521696873

1,0521,2331,4151,5981,7821,9672,1522,3382,5242,7112,8993,0863,2743,4633,6513,8404,0294,2194,4084,5984,7884,979

6

53203367538713891

1,0701,2511,4331,6161,8001,9852,1712,3562,5432,7302,9173,1053,2933,4813,6703,8594,0484,2384,4274,6174,8074,998

7

67219384556731909

1,0881,2691,4511,6351,8192,0042,1892,3752,5622,7492,9363,1243,3123,5003,6893,8784,0674,2574,4464,6364,8265,017

8

81235401573748926

1,1061,2871,4701,6531,8372,0222,2082,3942,5802,7672,9553,1433,3313,5193,7083,8974,0864,2764,4654,6554,8455,036

9

95251418590766944

1,1241,3061,4881,6711,8562,0412,2262,4122,5992,7862,9743,1613,3503,5383,7273,9164,1054,2954,4844,6744,8645,055

Note, s is the number of positive outcomes, a is the probability of each positive outcome, and k is the total number of trials. Let s - x +y. For example, if 12(10 + 2) individual results are significant with a = .05, then pooled p s .05 only if the total number of studies is 140or less. Much larger tables of this form appear at http://www.psych.cornell.edu/darlington/meta/failsafe.htm

An exact binomial analysis shows the ordinary,95% one-sided lower confidence limit on the rate ofheads to be .61578. That is, testing this null valueagainst 70 observed heads yields a one-tailed p ofalmost exactly .05. But the hypothesis that every coinhas this probability of heads implies that every coin isbiased and tells us nothing about a lower confidencelimit on the number of biased coins. To maximize theexpected number of heads with the fewest biasedcoins, one should assume that any biased coin is com-pletely biased and comes up heads every time.

Define H25 as the hypothesis that 25 coins alwayscome up heads while the other 75 coins are fair. Un-der H25, 25 of the 70 observed heads are from biasedcoins and the other 45 are from fair coins. The prob-ability of observing 45 or more heads from 75 faircoins is .05267, so H25 is just barely consistent with

the data at the 95% confidence level (one-tailed). Un-der the hypothesis H24—that 76 coins are fair and 24always come up heads—the probability of 70 or moreheads is only .04232, so H24 is not consistent with theobserved result. Thus 25 is a lower confidence limiton the number of biased coins.

The probability .05267 was found by computingthe binomial probability of 70 or more heads in 100fair flips (as mentioned, it is .000039), then 69 ormore heads in 99 fair flips (it is .000055), and so on,until a probability greater than .05 was found at 45heads in 75 fair flips. In other words, we imagineddeleting one positive case (one observation of a head)at a time from the observed results until a nonsignif-icant result was found. This establishes the second ofour three listed points.

At this point, we have concluded that the 100 coins

Page 9: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

504 DARLINGTON AND HAYES

must include at least 25 coins biased toward heads.Can we draw the same conclusion about the smallerset of 70 coins with observed heads? To see why wecan, consider questions about the number of "falsepositives." We're thinking of heads as the positivedirection and thinking of the null hypothesis as thehypothesis that each coin is fair or negatively biased.We will thus define a false positive as a fair or nega-tively biased coin that lands on heads. Then all theprevious probability statements can be rephrased asstatements about the number of false positives.

Here it helps to consider it "unreasonable to ex-pect" a certain outcome if the probability of that out-come is below our specified a. If all 100 coins are fairor negatively biased, then all heads are false positives,and we have seen that it is unreasonable to expect 70false positives in 100 flips. If 90 coins were fair ornegatively biased and the other 10 always came upheads, then we could explain 70 observed heads onlyby positing 60 (70 - 10) false positives. But as we sawabove, it would be unreasonable to expect so manyfalse positives from 90 fair coins. However, we sawthat it is reasonable to expect as many as 45 falsepositives in 75 fair coins. By assuming the other 25coins always come up heads, this hypothesis can bemade consistent with the actual observation of 70 (45+ 25) heads. But 45 is the largest number of falsepositives that can reasonably be predicted.

But, of course, the 70 observed positives includeonly real positives and false positives. If there are atmost 45 false positives, there must be at least 25 realpositives. Therefore we can conclude that there mustbe 25 real positives (positively biased coins) amongthe 70 observed positive outcomes, not merely in thelarger set of 100 coins. That establishes our finalpoint: that positive statements that we can ascribe tothe total set of k experiments can also be ascribed tothe smaller set of experiments with positive outcomes.

Extending the Logic to Other Poolers

We now show how similar conclusions can bereached with other probability poolers such as theStouffer test. We'll discuss Stouffer specifically, butthe logic applies to other methods as well.

Suppose the analyst repeatedly deletes the mostpositive results, one at a time, and repeats the pooledtest until nonsignificance is found. We'll call theserepeated tests TD1 ("test deleting 1"), TD2, and so on.Each test in the series yields a less positive result thanthe previous test. That is true even when results fromtwo experiments are equal, because dropping one of

the most positive results always yields a less positivepooled result.

TDJ drops the j most significant results and teststhe rest. TDJ tests the null hypothesis that the numberof real positive effects does not exceed). The overalltest can be thought of as TDO, as it drops no results,and its null hypothesis of no real positive effects is thehypothesis that the number of real effects does notexceed zero.

To understand the TD logic, consider an example inwhich there are three real positive effects, so that TD3is the first test in the series to test a true null hypoth-esis: that the number of real positive effects does notexceed three. As before, think of a trial as a repetitionof all k experiments. Consider first the case in whichthe three real positive effects are all extremely large—so large that, even in a billion trials, their individualresults would always be the three most significant.Thus in every trial, the three results dropped fromTD3 would be the three results from experiments withreal positive effects. Thus in every trial, TD3 consistssimply of applying a probability pooler to a set of k—3experiments with no real positive effects. We see that,provided the probability pooler is valid to begin with,it will be valid in this application.

Now consider the more realistic case in which thethree real positive effects are much smaller than wejust imagined—perhaps quite small, though still posi-tive. This change can only lower the sampling distri-bution of the test statistic for TD3, making that sta-tistic less likely to reach any particular positive level.But because the test was valid before, its bias in thisnew case can only be negative. This is true eventhough, in some trials, we discard some results fromexperiments whose real effects are zero or negativewhile retaining some results from experiments withreal positive effects.

From our definition of a trial as a repetition of eachof the k experiments in the analysis, it follows that theconclusion that there must be at least j real positiveeffects is essentially a statement of a lower confidencelimit on the number of such effects.

Now consider how a positive conclusion can bestated not for the total set of k studies, but for thesmaller set that contributes to a Stouffer z, to which astudy is said to contribute only if deleting it from thepooled analysis would lower the pooled z. One canclearly test any study for this property, and in thesection, Cutoff z, Total Invulnerability, and Specific-ity, we'll show that it is computationally even easierthan one might assume.

Page 10: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

PROBABILITY POOLERS 505

The desired conclusion has the form, "At least./ realpositive effects must exist within this set of 5 studies,"when we have already concluded that at least j realpositive effects must exist within the total set of kstudies. Notice that the desired conclusion does notimply the absence of real positive effects outside theset of s studies. That would imply the total absence offalse negatives, clearly an unjustifiable conclusion.

We define an error of overspecificity as acceptanceof the quoted conclusion when it is false. In simula-tion studies, we have examined a wide array of casesin which some of the k true effects were zero andothers were positive. In each case, a trial consisted of(a) generating k values of z, (b) dropping the highestzs one at at time to reach a conclusion that "at least jof these k true effects must be positive," (c) testing forpositive contributions to conclude that "at least j ofthese s effects must be positive" (where typicallys < k), then (d) checking the true effect sizes to seewhether the last step had produced an error of over-specificity. All tests were performed at the .05 level.We found that the rate of overspecificity errors in-creased with the sizes of the positive true effects, buteven making those true effects extremely large neverraised the rate of overspecificity errors above itsproper nominal level of .05.

This conclusion can be supported at an intuitivelevel, even without a rigorous statistical proof. If weconclude that the total pattern of results is sufficientlypositive that it would be unlikely to occur unless therewere at least j real positive effects, then clearly itwould be even less likely for such a positive pattern ofresults to occur without those real positive effectscontributing to the positive pooled result.

Extending the Stouffer z

Stouffer-max

At this point we have shown how the familiar bi-nomial probability pooler can be extended to increasethe specificity and invulnerability of conclusions aswell as resistance to the file-drawer problem. We nowconsider a test we call Stouffer-max, which extendsthese same advantages to the Stouffer test. Power isthe major advantage of Stouffer-max over the bino-mial pooler; the major disadvantages are a loss ofsimplicity and a limitation of currently availabletables to k < 1,000.

Because the Stouffer pooled Z can be written asZz/V/c, and Zz = k x mean(z), we see the Stoufferpooled Z can also be written ZStouffer = mean(z) x Vfc.

Because Vfc can be computed without even knowingany experimental results, we can regard mean(z) asthe real test statistic. Consider a modification of theStouffer test in which we compute the mean of just thefive highest (most positive) values of z and comparethat mean to critical values like those shown in Table3. Or, more generally, we might compute the mean ofthe * highest values of z, where s is some positiveinteger we choose. We'll call this statistic MeanZ. Thewebsite document cited in Table 3 contains tables ofcritical values of MeanZ for i ̂ 50 and k ̂ 1,000, forthe following seven values of alpha: .1, .05, .025, .01,.005, .0025, and .001. We call this the Stouffer-maxtest because it uses just the highest values of z fromindividual experiments, rather than all values of z.

Of course, in Stouffer-max only the i most signifi-cant experiments contribute at all to a positive pooledresult. This automatically increases invulnerabilityand specificity. And just as with the binomial method,the file-drawer problem can be handled simply bysearching the relevant tables for the highest value of kthat still leaves the pooled result statistically signifi-cant. One can also rank the zs and delete them one ata time, starting at the top, using the Stouffer-maxtables to test each new result. Each new deletion low-ers k by 1, but the analyst may prefer to keep s con-stant by adding a new z at the bottom of the includedset as each old z is deleted from the top. MeanZ willof course drop each time, unless all zs in that range areequal.

In principle, s can be set anywhere from k downto 1. When s = k, Stouffer-max reduces to the origi-nal Stouffer test. When k = 1, the test reduces tothe Bonferroni method or, more precisely, to theexact formula for independent tests that the Bonfer-roni approximates:

corrected p — 1 - (1 - smallest individual p)k.

Thus both Stouffer and the Bonferroni relatives can bethought of as extreme cases of Stouffer-max.

When we first developed Stouffer-max, we werethinking of it as a simple technique like chi-square,which everyone uses in much the same way. But wehave since begun to think of it as being more likefactor analysis, with many ways to use it and contro-versy about the best ways to proceed. We first de-scribe one possible procedure that we very tentativelysuggest as the "standard," but then suggest others.

A Basic Strategy for Using Stouffer-max

Stouffer-max examines relations among four quan-tities: the number of the highest zs omitted, the num

Page 11: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

506 DARLINGTON AND HAYES

Table 3Critical Values of MeanZ(s,k) for Tests at a = .05

10

23456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950

1.1631.4581.6301.7491.8411.9151.9772.0292.0762.1162.1532.1862.2172.2452.2702.2942.3172.3382.3582.3762.3942.4102.4262.4422.4562.4702.4832.4962.5092.5202.5322.5432.5542.5642.5742.5842.5932.6022.6112.6202.6282.6362.6442.6522.6602.6672.6742.6812.688

—0.950

.212

.373

.490

.581

.655

.718

.112

.820

.862

.901

.935

.967

.9962.0232.0482.0722.0942.1152.1342.1532.1712.1882.2042.2192.2342.2482.2612.2742.2872.2992.3112.3222.3332.3442.3542.3642.3742.3832.3922.4012.4102.4182.4262.4342.4422.4502.457

——

0.8221.0561.2061.3171.4061.4791.5411.5961.6431.6861.725.760.792.822.850.875.900.922.944.964

1.9832.0022.0192.0362.0522.0672.0812.0952.1092.1222.1352.1472.1582.1702.1812.1912.2022.2122.2222.2312.2402.2492.2582.2662.2752.2832.291

———

0.7360.946 (

.085 (

.191 (

.276

.347

.409

.462

.510

.552

.591

.626

.659

.6891.7171.7431.7671.7911.812.833.853.871.889.906.922.938.953.967.981.994

2.0072.0202.0322.0432.0552.0652.0762.0862.0962.1062.116 :2.125 ;2.134 ;2.143 :2.i5i ;2.160 :

————

16721863 (3.993 (.093 (.175.244.304.356.403.445.484.519.552.582.610.636.661.684.706.727.747.765.783.801.817.833.848.863.877.890.903.916.928.940.952.963.974.984.994

1.004>.014>.023>.0331.0421.050

—————

16221797 (1919 (.014 (.093 (.160.218.269.315.357.395.430.462.492.520.547.572.595.617.638.658.677.695.712.729.745.760.775.789.803.816.829.841.853.865.876.887.898.908.918.928.938.947.956

——————

15821744 (1858 (3.949 (.025 (.089 (.146.196.241.282.320.354.386.416.444.470.495.518.541.561.581.600.619.636.653.669.684.699.714.727.741.754.766.778.790.801.813.823.834.844.854.864.873

———————

3.5483.699 (3.807 (3.894 (3.967 (.029 (.084 (.133.177.218.255.289.321.350.378.404.429.452.474.495.515.534.552.569.586.602.618.633.647.661.674.688.700.712.724.736.747.758.768.779.789.798

————————

3.5203.6613.7643.8473.9173.977.031.078.122.161.198.232.263.292.320.345.370.393.415.436.456.475.493.510.527.543.559.574.588.602.616.629.641.654.666.677.688.699.710.720.730

23456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950

Note. MeanZ(s.i) is the mean of the s highest of k mutually independent values of j, where each t has a standard normal distribution. Thesevalues were found by simulation methods, but all values shown here are estimated to be accurate to 3 decimal places. See Appendix B forfurther discussion. Much larger tables of this form appear at www.psych.cornell.edu/darlington/cmz.htm

Page 12: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

PROBABILITY POOLERS 507

ber s of the next highest zs averaged to computeMeanZ, the number of studies assumed to be stillhidden, and a significance level p. In ordinary signifi-cance tests, we compute p as a function of other quan-tities, but in the current context another approach maybe more reasonable. In asking whether the effect inquestion exists at all, we're considering three threatsto validity: the possibility that a positive pooled resultmay be due to sampling error, to the file-drawer prob-lem, or to flaws in a few studies. The first systematicquantitative techniques for assessing sampling errorwere developed before the 20th century, and the firstsuch techniques for assessing the file-drawer problemwere introduced in 1978. We believe this is the firstpublished article to propose systematic techniques forassessing the problem of flaws in a few studies froma large group of pooled studies.

Consider why scientists have largely settled on the.05 level in assessing chance. Why not .25 or .10 or.001? The .05 level was chosen to represent a levelthat would be convincing to a substantial majority ofreaders, though not to everyone. Are there comparablelevels for the other two threats to validity? In a set ofstudies taken seriously by most experts in an area, itmight be reasonable to assume that there are in factfairly serious flaws in 1 study in 10, or perhaps even1 in 5. That might suggest that omitting the top 10%of studies is roughly comparable to the .05 level ofclassic significance tests, whereas omitting the top20% is roughly comparable to the .01 level.

We find it more difficult to suggest any general rulefor the file-drawer problem. On one hand, there areareas of research in which every study worth consid-ering is funded by a major agency, so that the exis-tence of each study is known to most specialists assoon as it starts, let alone by the time results appear.On the other hand, there are many areas in whichstudies are inexpensive and easy, and there are doubt-less many unreported studies. Nevertheless, we sug-gest that specialists in an area might be able to suggesta fail-safe number for that area that would be con-vincing to a substantial majority of researchers in thesame field.

Thus, we suggest that an analyst attempt to choosea number of potentially flawed studies, a number ofhidden studies, and a significance level that will beconvincing to a substantial majority of readers. Thesewill determine the number of the most significantstudies to be dropped, the level of k used, and a.

The next problem is to select some value of i forthe first test. To avoid the charge that s was selected

post hoc to maximize significance, we suggest using asimple arbitrary rule-setting 5 as close as possible tohalf the number of undropped studies, rounding downwhen that number is odd. Because 50 is the highestvalue of s in the tables, this means s should be set to50 if the number of studies exceeds 100.

If that first test is significant, then the researchercan successively test MeanZ values for smaller valuesof s, continuing until a nonsignificant result is found,and then focus on the smallest value of s that yields asignificant result. This yields the most specific con-clusion possible.

To illustrate, suppose eight studies produce z valuesof 2.385, 2.247,1.925, 1.864, 1.636, 1.164, 1.144, and0.405. If we feel that as many as one in five studiesmay have some substantial flaw, then we drop the toptwo of these eight studies. Suppose we also assumethat the number of hidden studies might be as high as50% of the number of found studies. Then we'll as-sume there are 12 studies total. But because we'redropping the top 2, we work with k = 10. Suppose wealso use a = .05. After dropping the top two, we havesix left, so we set s equal to half that, or 3. The meanof the top three remaining zs is (1.925 + 1.864 +1.636)/3 = 1.808. Entering the tables with 5 = 3 andk = 10, we find a critical mean z of 1.772, so thisresult is significant at the .05 level. When we repeatthe process for s = 2, the result is nonsignificant. Wethus conclude that even if one discards the top two ofthe eight zs and assumes four other studies with lowerzs were not found, there still must be at least one realpositive effect among the remaining top three studies.Or if we assume no serious methodological flaws inany of the eight studies, we conclude there must be atleast three real positive effects among the top fivestudies. We can reach these conclusions despite thefact that even the highest of the eight zs is nonsignif-icant after a Bonferroni correction.

Some Computations Common to Several MoreComplex Strategies

We suggest several other strategies below, but mostof them share a computational step that we describe inthis section. The step is also useful in the "basic strat-egy" just described. That step, which is easily auto-mated, is the computation of the MeanZ values forevery possible combination of.? and number of top zsdropped. Let z(0 denote the I'th z, when zs are rankedfrom high to low. First take the "average" of z(l), thenthe average of z( 1) and z(2), then the average of z( 1),z(2), and z(3), up to the average of all k. Then drop

Page 13: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

508 DARLINGTON AND HAYES

z(l) and take the "average" of z(2), then the averageof z(2) and z(3), then the average of z(2), z(3), andz(4), and so on, producing (k - 1) averages in thisstep. Then drop z(2) and compute (k - 2) more aver-ages. Keep repeating. In the last step, you havedropped the top (k - 1) zs, so you "average" just z(/c).This total procedure produces k(k + 1 )/2 "averages,"of which k(k - 1 )/2 are real averages of 2+ values andk are "averages" of single numbers. Most analysts canfind some easy way to compute all these values on apersonal computer, though the precise proceduresthey use will vary widely. For those with program-ming skills, the process is easily programmed.

Cutoff z, Total Invulnerability, and Specificity

Suppose a pooled conclusion remains significant atan acceptable level after the two most significant in-dividual studies are dropped from the pool. It is thenreasonable to call the pooled conclusion invulnerableto criticisms of those two studies or any two studies.The term invulnerable seems appropriate even thoughdropping the two top studies would make the pooledconclusion less significant.

It seems reasonable to use an even stronger term todescribe the case in which deleting a study for pos-sible methodological inadequacies actually improves(lowers) the pooled p. As we saw earlier, this occursin the binomial test when studies with negative out-comes are discarded. It can also arise in the Stouffer-max test or in the original Stouffer test when one ormore of the lower zs is deleted. Because we want astronger term than invulnerability to describe thiscase, the term total invulnerability seems reasonable.We will say that a pooled conclusion is totally invul-nerable to criticisms of a study or a set of studies ifdropping any or all of those studies would only im-prove the pooled p, or at least keep it constant. Laterwe'll stretch this definition a bit from mathematicalpurity and call a conclusion totally invulnerable tocriticisms of a study unless dropping the study wouldmove the pooled p to a higher range, where ranges aredefined by the breakpoints .1, .05, .025, .01, .005,.0025, and .001. Purists might want to call this con-dition "almost totally invulnerable."

It is quite easy to identify the studies with totalinvulnerability in the original Stouffer test. A bit ofalgebra shows that if z, for one study is equal to

then Stouffer's Z and its associated p value is un-

changed if that study is deleted from the analysis. Inour terminology, a pooled conclusion is totally invul-nerable to criticisms of studies with zs below thiscutoff value, because dropping any or all such studieswould only raise the pooled Z. As k increases, thecutoff z approaches the simpler expression 0.5mean(z) quite rapidly from below, because the part ofcutoff z in parentheses approaches \/(2k). Even whenk is only 11, the two values are within 2.4% of eachother. Thus when k is above 10, it may be reasonableto treat the cutoff z as simply 0.5 mean(z), with theactual cutoff z always being slightly below the calcu-lated value.

In Stouffer-max, a pooled conclusion is of coursetotally invulnerable to criticisms of studies not in-cluded in the computation of MeanZ, since deletingany or all of those studies would lower k while leavingMeanZ unchanged. In addition, suppose a result issignificant at an acceptable level with s = 10, butdropping the 10th study (that is, the lOth-highest z)from the entire analysis (so s and k both drop by 1)produces a new result at least as significant as before.Then the conclusion based on s = 10 is totally invul-nerable to criticisms of the 10th study as well as stud-ies numbered 11 and higher. And if dropping the ninthstudy again produces a result at least as significant asthe original one, then the original result is totally in-vulnerable to Study 9 as well. This process can becontinued until total invulnerability is no longerfound.

The reader can now see more clearly why it may beconvenient to compute every possible MeanZ early inthe analysis. If that has been done, the process justsketched is simply a matter of taking already-computed values of MeanZ and comparing them tothe appropriate entries in the tables.

Total invulnerability is always specific to certainstudies, whereas "ordinary" invulnerability applies toall studies. For instance, if an analyst reports that apooled result is still significant even with the top twozs deleted, that means that the result is invulnerable tocriticisms of any two of the pooled studies.

In our "basic strategy," we suggested finding thelowest s that yields a significant result, in order toincrease specificity. Especially when k is low, it turnsout that the process described in this section may stillincrease the range of total invulnerability, even if theabove-mentioned step has been performed. That isbecause a given value of MeanZ is more significantthe lower k is. Thus, simply lowering s by 1 may raisethe pooled p, perhaps making it nonsignificant,

Page 14: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

PROBABILITY POOLERS 509

whereas lowering both s and k may retain the originalsignificance or even improve it.

By identifying the range of total invulnerability, weare actually dividing studies into those that contributeto the pooled conclusion and those that don't. But aswe saw earlier, a positive pooled result can be con-sidered to be specific to the studies that do contribute.Thus when we find this dividing point, we are findingtwo things simultaneously: the set of studies for whichthe pooled conclusion is totally invulnerable to flaws,and the set to which the conclusion is specific.

Analyzing the Tradeoff Between Invulnerabilityand File Drawers

If we assume all studies are well-executed and do asimple file-drawer analysis, we will find a higher FSNthan if we do that analysis after dropping one or moreof the highest zs. If all the aforementioned MeanZvalues have been computed, it is quite simple to ex-press FSN as a function of the number dropped. Forinstance, in our earlier example, suppose we deter-mine s by the aforementioned rule, as half of theundropped studies. If we round down in choosing s,then we use s-values of 4, 3, 3, 2, 2, 1, 1, and 1 whendropping 0, 1, 2,...,7 zs, respectively. This producesthe eight MeanZ values of 2.105, 2.012, 1.808, 1.750,1.400, 1.164, 1.144, and 0.405, respectively. Using a= .05, the last three of these 8 MeanZs are below allthe critical mean z (CMZ) values in the tables. Thehighest fc-values that make the first five MeanZ valuessignificant are 31, 16, 10, 5, and 2, respectively. Thefirst MeanZ was the mean of the four highest of eight,so FSN for it is 31 - 8 = 23. The second was themean of the three highest of seven, so FSN for it is 16-7 = 9. The next FSN is 10 - 6 = 4, and the next(with three dropped) is 5 - 5 = 0. Thus if we assumethat as many as three of the eight observed studies areflawed, the pooled result is nonsignificant unless weassume there are no missing studies. If we tried tocompute the next FSN, we would find 2 - 4 = -2.FSN cannot be negative; this result is simply nonsig-nificant, even with no hidden studies. But we havefound, without excessive computation, the FSN valuefor every number of dropped studies from zero tothree.

The Fisher and Fisher-max Methods

Years before the Stouffer test was proposed, Fisher(1938) proposed a probability pooler based on thechi-square distribution. Both the Fisher and Stouffermethods are exact in the sense that if the individual ps

entering them are distributed exactly correctly (that is,distributed uniformly from 0 to 1), then the pooled pwill also be so distributed. Nevertheless, the twomethods do not yield identical pooled ps. As a generalrule, the Stouffer method yields a numerically smaller(thus more significant) pooled p than the Fishermethod if the entering ps are fairly similar in value,whereas the Fisher method yields a more significantresult if the entering ps range widely. Thus one canstate as a general, though vague, rule that the Stouffermethod will be more powerful than the Fisher methodif the experiments being pooled are fairly similar toeach other in sample size and other characteristicsaffecting power, and the Fisher method will be morepowerful otherwise.

In our own power studies, we have found that drop-ping the most significant results rapidly increases thepower of Stouffer relative to Fisher. That seems rea-sonable in retrospect, as dropping the most significantresults of course increases the homogeneity of theremaining results, and Stouffer performs best in suchcases.

In 1996, the first author posted on his website adescription of a method that could be called Fisher-max because it extends the Fisher method in the sameway the Stouffer-max method extends the Stouffermethod. That is, an analyst selects just the s mostsignificant of k results, and then combines them by theFisher method. Larger tables for the Fisher-maxmethod were posted in 1998. However, we now feelthat the power characteristics just mentioned gener-ally recommend the Stouffer-max over the Fisher-max method, so we shall not describe the Fisher-maxmethod further here. The method remains posted atwww.psych.cornell.edu/darlington/index.htm for any-one interested in exploring it further.

Summary

When standard methods are used to pool the sig-nificance probabilities of independent experimentsthat study the same question, the pooled conclusionsare typically vague, vulnerable to criticisms of indi-vidual studies, and subject to the file drawer problem.However, probability poolers can be extended to ad-dress all three of these problems simultaneously. Inprinciple, any probability-pooling method can be soextended, though the extension typically requires newtables. Such tables are presented here for the binomialand Stouffer probability poolers, together with de-scriptions of the extensions. Our file drawer methods,

Page 15: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

510 DARLINGTON AND HAYES

unlike the method by Rosenthal (1979), allow one toassume that all the hidden studies showed highlynegative effects.

References

Becker, B. J. (1987). Applying tests of combined signifi-cance in meta-analysis. Psychological Bulletin, 102,164-171.

Becker, B. J. (1994). Combining significance levels. In H.Cooper & L. Hedges (Eds.), The handbook of researchsynthesis (pp. 215-230). New York: Russell Sage Foun-dation.

Begg, C. B. (1994). Publication bias. In H. Cooper & L.Hedges (Eds.), The handbook of research synthesis (pp.399-409). New York: Russell Sage Foundation.

Brozek, J., & Tiede, K. (1952). Reliable and questionablesignificance in a series of statistical tests. PsychologicalBulletin, 49, 339-341.

Darlington, R. B. (1980). Another peek in the file drawers.Unpublished manuscript, Cornell University.

Darlington, R. B. (1990). Regression and linear models.New York: McGraw-Hill.

Dickerson, K. (1997). How important is publication bias?:A synthesis of available data. AIDS Education and Pre-vention, 9(Supplement A), 15-21.

Fisher, R. A. (1938). Statistical methods for research work-ers (7th ed.). London: Oliver & Boyd.

Glass, G., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: Sage.

Hedges, L., & Olkin, I. (1980). Vote-counting methods inresearch synthesis. Psychological Bulletin, 88, 359-369.

Hunt, M. (1997). How science takes stock: The story ofmeta-analysis. New York: Russell Sage Foundation.

lyengar, S., & Greenhouse, J. B. (1988). Selection modelsand the file drawer problem. Statistical Science, 3,109-117.

Jones, L. V., & Fiske, D. W. (1953). Models for testing thesignificance of combined results. Psychological Bulletin,50, 375-382.

Neter, J., Wasserman, W., & Kutner, M. H. (1990). Appliedlinear statistical models. Homewood, IL: Irwin.

Rosenthal, R. (1978). Combining results of independentstudies. Psychological Bulletin, 85, 185-193.

Rosenthal, R. (1979). The "file drawer problem" and toler-ance for nul l results. Psychological Bulletin, 86,638-641.

Rosenthal, R. (1991). Meta-analytic procedures for socialresearch. Newbury Park: CA: Sage.

Stouffer, S. A., Suchman, E. A., DeVinney, L. C., Star, S.A., & Williams, R. M., Jr. (1949). The American soldier:Adjustment during army life (Vol. 1). Princeton, NJ:Princeton University Press.

Strube, M. J., & Miller, R. H. (1986). Comparison of powerrates for combined probability procedures: A simulationstudy. Psychological Bulletin, 99, 407-415.

Thomas, H. (1985). On the "file drawer" problem. Unpub-lished manuscript, Pennsylvania State University.

Wilkinson, B. (1951). A statistical consideration in psycho-logical research. Psychological Bulletin, 48, 156-158.

Winer, B. J. (1962). Statistical principles in experimentaldesign. New York: McGraw-Hill.

Appendix A

SYSTAT, Minitab, and SPSS Commands and Macros for Generating Tables of CumulativeBinomial Probabilities

For all three program packages, we describe "deletiontables" for dropping the most significant studies. Each rowin this type of table is for both k and s that are 1 less thantheir previous values. For instance, suppose a researcherfinds 10 positive outcomes in 40 studies. Then for a deletionanalysis, the researcher wants to know the cumulative bi-nomial probabilities for k — 40 and s = 10, k — 39 and s= 9, k = 38 and i = 8, and so on. A deletion table is aone-column table showing probabilities of this type.

For SPSS, we give a program for producing general bi-nomial tables. Each such table will be for just one value of

a (the probability of success on each trial), but for manyvalues of s and k.

For SYSTAT and Minitab, we describe how to createfirst-pass tables for file-drawer analysis. This type of tableis used to get a rough idea of the highest value of k for whichthe observed results are still significant. This table applies to asingle value of s and a single value of p, but to many values ofk. The desired values of k may range so high that it is imprac-tical to include all values of k in the range of interest.For instance, one might want to study ^-values of 40, 50,60,. .., 100, 125, 150, 175, 200, 300, 400,..., 1,000, etc.

Page 16: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

PROBABILITY POOLERS 511

For SYSTAT and Minitab, we also describe how to create"final-pass" tables for file-drawer analysis. This type oftable is just like a first-pass table, except that it coversconsecutive values of k over just a limited range. For in-stance, suppose a first-pass table was used to find that thelargest k still yielding significant results is between 125 and150. Then a final-pass table might include every value of kwithin that range.

For SPSS, we make no distinction between first-pass andfinal-pass tables, giving a program that generates probabili-ties for all values of k within the range specified by the user.

A note to the programmer trying to understand thesecommands and macros: In all of these packages, the basicbinomial command yields the probability of s or fewer posi-tive outcomes, while we want the probability of i or more.Thus we repeatedly use the fact that the probability of s ormore positive outcomes is 1 minus the probability of 5-1 orfewer such outcomes.

SYSTATThe SYSTAT command "NCF(s, k, alpha)" returns the

probability of s or fewer positive outcomes in k trials, wherealpha is the probability of each outcome. Any or all of thethree arguments (the quantities in parentheses) may be ei-ther single values or columns of data in the SYSTAT work-sheet. At the far left of the SYSTAT worksheet is a columnconsisting of row numbers (1, 2, 3,...) with the heading"CASE." That column can be used in SYSTAT commandsthrough the word CASE.

In SYSTAT, a deletion table can be created with twocommands of the following form:

repeat Qlet p = 1 - ncf(s - case, k + 1 - case, alpha)

The user should replace the values in bold as follows.Replace Q by the number of probabilities he or she wants tocalculate. Replace s by the observed number of positiveoutcomes. Replace k + 1 by 1 more than the total number ofstudies. Replace alpha by the significance criterion used indefining positive outcomes-typically .5, .1, .05, or .01.Thus, if 10 studies out of 40 were significant beyond the .1level, the commands might read

repeat 6let p = 1 - ncf(10 - case, 41 - case, .1)

The first entry in the returned column will be the prob-ability of s or more positive outcomes in k studies, thesecond will be the probability of 5 - 1 or more such out-comes in k- \ studies, etc. Since the first value of CASE is1, the first value of p calculated will be 1 - ncf(i - 1, k,alpha). To understand why that should be the first value,review the last paragraph before the SYSTAT section.

For file-drawer tables, first create a column labeled k,showing the values of k you wish to analyze. Then run thesingle command

let p = 1 - ncf(s - 1, k, alpha)Thus for the previous example the command might belet p = 1 -ncf(9, k, .1)

Because there is a column with the heading k, SYSTATwill interpret the entry k in this command as referring tothat column, and will return a value of p for each of thosevalues of k.

Minitab

In Minitab, deletion tables and file-drawer tables are mosteasily constructed by invoking the macros described below.These macros were written for Minitab 12 and are not guar-anteed to work in earlier versions of Minitab. A macro canbe typed in any ordinary text editor such as the WindowsNotepad, and then saved in the subdirectory entitled "Mac-ros" under the Minitab directory. Be sure the macro is savedin text or ASCII format and use the suffix "mac" to identifythe file as a macro. After saving the macro, double-check tomake sure the operating system or text editor has not addedany extra suffix such as "txt" after the "mac" suffix; deleteany such extra suffix.

The following macro creates a binomial deletion table. Itis designed to be placed in a file named "binomdel.mac".

macrobinomdel s k remove tp rescolmconstant s sm k bp tp n i removemcolumn rescoldo i = 0: removelet n = k-ilet sm = s-l-icdf sm bp;binomial n tp.let rescol(i+l) = 1-bpenddoendmacro

This macro has five "arguments"-values you enter whencalling the macro. They are: (a) s, (b) k, (c) the largestnumber of positive outcomes you want to remove, (d) alpha,the probability of each positive outcome, and (e) the columninto which you want results to be placed. In Minitab, acolumn number should be preceded with a c; thus c8 iscolumn 8.

In Minitab, you invoke a macro by going to the commandwindow and typing in the macro name preceded by "%" andfollowed by the arguments, in the order just listed. Forinstance, suppose you have observed 10 positive outcomesin 40 studies with a = . 1, you want to study the effect ofremoving up to 6 of the positive outcomes, and you want thecalculated probabilities placed in Column 4. Then executethe command

%binomdel 10406.1 c4The Minitab macro below is for binomial file drawer analy-

ses and is designed to be placed in the file "binomfd.mac".macrobinomfd s tp nn ncol rescolmconstant s i n bp tp fp fail nnmcolumn ncol rescollet fp = 1 - tp

(Appendixes continue)

Page 17: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

512 DARLINGTON AND HAYES

do i = l:nnlet n = ncol(i)let fail = n - scdf fail bp;binomial n fp.let rescol(i) = bpenddoendmacro

The macro assumes you have created a column of valuesof k you want to study, for a single value of i1 and a singlevalue of alpha. The macro has five arguments: (a) s, (b)alpha, (c) the number of values of k entered, (d) the columnin which values of k have been entered, and (e) the columninto which the macro's results should be placed.

For instance, suppose you have j = 10, a = . 1, you haveplaced 30 values of k in column 4, and you want the macro'sresults placed in column 5. Then enter the command

%binomfd 10.1 30 c4 c5

SPSS

The SPSS transformation command, binomial.cdf (s, k,alpha), returns the probability of s or fewer positive out-comes in k trials where alpha is the probability of eachoutcome. This transformation command is used within thecommand sets below to generate a full binomial table, adeletion table, and a file-drawer table. The commands be-low can be entered into a syntax window and executed as abatch. The user needs to modify a few of the commands tocustomize the table to the needs of the analysis.

The first command set generates a full binomial probabil-ity table. In this program, the minimum and maximum val-ues of k to be displayed as rows in the table are set in Line3 (currently set for k between 1 and 100). The command,loop #k = 20 to 200., for instance, would generate a tablefor k between 20 and 200, inclusive. The smallest value fors to be displayed is set in Line 5 (currently set for 5 = 1),and the minimum and maximum values for j in the table aredefined in Line 6. Currently, the program produces p-valuesfor s between 1 and 12. If the user desired p values for sbetween 3 and 12, for instance, Line 5 should be changed toread "compute #s = 3." and Line 6 should be changed toread "do repeat p = s_3 to s_12." The significance criterionused in defining positive outcomes is set in Line 8 (currentlyset for .05).

new file.input program.loop #k = 1 to 100.compute k = #k.compute #s = 1.do repeat p = s_l to s_12.do if #s <#k+l.compute p = 1-cdf.binomial (#s-l, #k, .05).end if.compute #s = #s+1.end repeat.

end case.end loop.end file.end input program.formats all (f8.7).execute.

The next command set generates a deletion table showingthe p value for pairs of k and s, where k and j are decreasedby 1 in each row. The starting values of k and s are set inLines 4 and 5, and the maximum number of deletions is setin Line 3. The alpha level defining a positive outcome is setin Line 6. It is important that the number of deletions set inLine 3 is no greater than the initial value set of s. Theprogram is currently configured to generate a deletion tablewith 10 rows, containing/? values for (k = 30, s = 10), (k= 29, s = 9), (k = 28, s = 8), and so on, using a = .05as the criterion for significance in the individual studies. Togenerate a deletion table with 15 rows starting at k = 50 ands = 20, for instance, Line 3 would be changed to read "loop#k = 1 to 15.", Line 4 would be changed to read "computek = 50-#k-1.", and Line 5 would be changed to read "com-pute s = 20-#k-l."

new file.input program.loop #k = 1 to 10.compute k = 30-#k+l.compute s = 10-#k+l.compute p = 1-cdf.binomial (s-1, k, .05).end case.end loop.end file.end input program.formats p (f8.7).execute.

The next command set generates a file drawer tableshowing the p values for a fixed value of s and increasingvalues of k. As currently configured, the command set gen-erates a table for s = 3 and k = 3 to 1,000, using a = .05as the criterion for a positive outcome in the individualstudies. The value for s is set in Line 3, and the maximumvalue of k displayed is set in Line 4. To generate a filedrawer table for 5 = 1 0 , maximum k = 2,000, for instance,Line 3 would be changed to read "compute #s = 10.", andLine 4 would be changed to read "loop #k = #s to 2000."

new file.input program.compute #s = 3.loop #k = #s to 1000.compute k = #k.compute p = 1-cdf.binomial (#s-l, k, .05).end case.end loop.end file.end input program.formats p (f8.7).execute.

Page 18: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

PROBABILITY POOLERS

Appendix B

The Stouffer-max Tables

513

The second part of this Appendix gives our estimates ofthe accuracy of the entries in the Stouffer-max tables, anddoes not require that the first part of the Appendix be readfirst.

How the Tables Were Generated

Assume that k mutually independent significance levelshave been converted to zs and have been sorted from highestto lowest. Define the test statistic MeanZ(s,£) as the mean ofthe .? highest values of i. Define CMZ(s,k,a) (pronouncedcritical mean z) as the critical value for MeanZ(i^) at the alevel of significance. We used simulation methods to esti-mate values of CMZ(s,k,ot) for values of 5 from 2 to 50, forvalues of k from i to 1,000, and for the following sevenvalues of a: .1, .05, .025, .01, .005, .0025, and .001. Noestimates are needed fors= 1 because those values can becalculated exactly by the well-known formula repeated inthe main paper.

Our tabled values of CMZ range from 0.181 (for s = k =50, a = .1) to 4.141 (for s = 2, k = 1,000, a = .001). Allvalues of CMZ are reported to three decimal places, like thevalues just given. Therefore, we tried to estimate all valuesof CMZ to three decimal places so that the tabled valuescould be regarded as uninfluenced by sampling error. Webelieve we met that goal for most of the entries in the tablesand came close to the goal for the rest. Support for thatclaim appears in the next section of this Appendix. Thecurrent section describes how the tables were generated.

Consider the problem of generating CMZ values for aparticular value of k, such as k = 100. One generates acolumn of 100 artificial values of z, sorts them from highestto lowest, then computes the cumulative sum of the first 50entries (because we studied values of i only up to 50), thendivides each sum by the appropriate value (from 1 to 50) totransform it into a value of MeanZ. Then repeat this manytimes to find the critical values for each value of MeanZ.

This process requires little extra computation to increasethe number of values of i or alpha studied, but all thatcomputation pertains to a single value of k. When the entireprocess is repeated for k = 101, entirely new random num-bers are used. Thus, the 343 (7 x 49) values of CMZ com-puted for a single value of k are not mutually independent,but CMZ values for different values of k are statisticallyindependent. Therefore, it is reasonable to save computingtime by taking all values of CMZ computed for a singlecombination of i and alpha and fitting a smooth curve tothose values across the many values of k. That is the methodwe used. The curve we fitted was not a straight line, but forsimplicity imagine for a moment that we were fitting

straight lines to segments of such a curve. One could havemuch more faith that the 21 true CMZ values for k from 810to 830 would approximate a straight line than that the 21true CMZ values for k from 10 to 30 would do so. For ourpurposes, k values from 810 to 830 are much closer togetherthan k values from 10 to 30. Thus it is reasonable to fit asmooth curve to values of log(£) rather than to values of k.However, that line of thought suggests that one wants moredata for smaller values of k than for larger values. Becausethe derivative of log(&) with respect to k is \lk, it followsthat, on a log scale, the difference between adjacent valuesof k is roughly proportional to \/k. Thus, as a rough guide,the amount of data one would want for each value of k isproportional to \lk. We followed that rule roughly.

We used regression methods to fit smooth curves, so itseemed prudent to have a certain amount of data for valuesof k higher than we planned to report (that is, for k > 1,000)to keep the regression curve running smoothly over theentire range we planned to report. Therefore, we actuallycomputed data for k values up to 1,200.

We estimated CMZ values in blocks of 40,000 trials each.In any one block, we computed 40,000 values of MeanZ foreach value of s. Then we sorted those 40,000 values fromhighest to lowest. Then, for instance, we took the 2,000thsorted value as that block's estimate of CMZ for a = .05,as 2,000/40,000 = .05. But we used many blocks (of 40,000trials each) for most values of k and averaged the CMZvalues across blocks. The following table shows the numberof blocks used for each value of k. It actually shows onlythose values of k in which the number of blocks changedfrom the previous k value. Values of k appear in bold;regular type shows the number of blocks. Thus 1,666 blocksof 40,000 trials each were used for k = 3, and 1,500 blockswere used for k = 4. The number of blocks changed forevery k up to 33, then generally declined for higher valuesof k. For various scheduling reasons, we occasionally usedmore blocks for some higher values of k than for somelower values. There's nothing wrong with that; it merelymeans we had to run more blocks to reach our overall goalsof accuracy.

3 1,666 4 1,500 5 1,400 6 1,332 7 1,284 8 1,2509 1,222 10 1,200 11 450 12 415 13 380 14 35515 330 16 310 17 290 18 275 19 260 20 25021 235 22 225 23 215 24 205 25 200 26 24027 235 28 225 29 220 30 215 31 210 32 20533 200 50 110 60 109 64 108 68 107 71 8172 80 78 79 84 78 92 77 101 42 112 41126 40 141 30 144 29 168 28 201 37 251 36301 6 334 5 501 4 915 5 1,001 4 1,053 31,137 2 1,140 1

(Appendix continues)

Page 19: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

514 DARLINGTON AND HAYES

The number of separate data points (one for each value ofk) in any given regression curve ranged from 1,199 (1,200- 1) when* = 2, down to 1,151 (1,200 - 49) when s = 50.But these data points were not weighted equally. Weightedleast squares was used, with each data point given a weightequal to the number of blocks used for that value of k. Butthe first data point in each regression, for which k = s, canbe calculated exactly by the Stouffer formula. So that wasdone, and this exact value was arbitrarily given a weight of1 million.

Overall, we ran 35,482 blocks of 40,000 trials each, mak-ing almost 1.42 billion total trials. Each "trial" was a col-umn of 3 to 1,200 values of z; the total number of zs gen-erated exceeded 185 billion. Each trial contributed to all 343values of CMZ (for 343 different combinations of .s and a)for that trial's value of k, so all 185 billion values of zcontributed to each regression. These calculations do notinclude the arbitrary weight of 1 million given to the onedata point in each regression that could be calculated ex-actly; these data points presumably further improved theaccuracy of the regressions. It was especially useful thatthese exact values fell at the far left edge of each regression(at the lowest value of k in the regression), as ordinarilyregression predictions are least accurate at the edges.

Quadratic spline regression was used, with 14 break-points. The total number of terms in such a regression is 1for each break-point, plus overall linear and quadratic terms,plus an additive constant, making 17 parameters altogether.This seemed to produce good fit to the data.

In weighted least squares, as in ordinary least squares,one can calculate for each data point a value calledSEPRED. SEPRED for ordinary least squares is discussedby Darlington (1990, p. 355), whereas Neter, Wasserman,and Kutner (1990, pp. 418-420) describe weighted leastsquares. SEPRED stands for "standard error of prediction."As the name implies, it is the estimated standard error withwhich true Y is estimated at that point in the regression.These SEPRED values are the standard error (SE) valuesdescribed in the next section.

A typical regression curve was nearly straight for highervalues of k and became increasingly curved (with negativesecond derivative) for lower values of k. Thus, we concen-trated the break-points on the left end of the curve. Asmentioned, the first data point in each curve was for k = s.Then break-points were placed at data points 2, 3, 5, 8, 13,20, 30, 45, 60, 75, 100, 140, 200, and 300. The concentra-tion on the left was not nearly as great as might first appearfrom these values, as a log scale was used for k. Thus for atypical regression, 300 is actually about 80% of the wayalong the scale rather than the 25% one might calculatefrom the fact that 300/1,200 = .25. Because curvature in-creased toward the left, we defined the spline terms fromright to left rather than the usual left to right.

The GAUSS program fitting the curves is shown below.The output of the program is contained in the three matricesSEPRM, STRM, and HATYM. Each of these matrices is

1,000 rows by 350 columns, though the first 7 columns ofeach matrix were left blank because there was no need tocalculate values for s = 1. As mentioned earlier, more than1,000 values were actually computed, but values for k >1,000 were not saved. Matrix HATYM contains the esti-mates of CMZ that appear in the tables, Matrix SEPRMcontains the aforementioned values of SEPRED, and MatrixSTRM contains standardized residuals. These were in-spected to determine where the curve was not fitting thedata well. First the overall form of the regression, and thenlater the specific break-points in the quadratic spline regres-sion, were selected to minimize the absolute values of thelargest residuals. Although it took hundreds of hours ofcomputer time to compute and sort the 185 billion z valuesused to calculate the unsmoothed CMZ values, the curve-smoothing process itself took only a few minutes of com-puter time.» format /rd 3,0;strm = zeros(1000,350);seprm = strm;

hatym = strm;let ncuts = 2 3 5 8 13 20 30 45 60 75 100 140 200

300;nseg = rows(ncuts)+1;j = 7;do until j= =350;j=j+l;y = z l O [ b e g v [ j ] : 1 2 0 0 , j ] ; n = r o w s ( y ) ; I n s q = ln

(seqa(begv[j],l,n));xcuts = lnsq[ncuts];w = wfull8[begv[j] : 1200]; w[ 1 ] = 1000000;x = ones

(n,nseg+2);x[.,2] = lnsq[n]-lnsq;x[.,3] = x[.,2]."2;i = 3;DO UNTIL i= =nseg+2;i = i+l;x[.,i] = (lnsq[l:ncuts[i-3]]-xcuts[i-3])'zeros(n-ncuts[i-

3],1);x[.,i] = x[.,i].~2;ENDO;q = sqr t (w) ;yq = y .*q;xq = x.*q;b = y q / x q ; h a t y =

x*b;e = y-haty;mse = e'(w.*e)/(rows(x)-cols(x));ci = invpd(x'(w.*x));hws = sumc(ci*x'.*x');str = (e.*q)./sqrt((-hws+1 )*mse);hatym[ 1 :n-200,j] = trimr(haty,0,200);strm[ 1 :n-200,j] =

trimr(str,0,200);sepred = sqr t (mse*hws) ; seprm[ l :n-200,j] = t r i m r

(sepred,0,200);locate 17,1 ;j;;endo;«

The Accuracy of the Stouffer-max Tables

Think of the CMZ values as being arranged in 343 col-umns, each of which pertains to a particular combination ofs and a and contains values of & ranging from s to 1,000. Wecalculated an estimated SE for each CMZ value in thetables. Then we took the maximum estimated SE value foreach of the 343 columns. The 343 columns can be charac-terized as follows:

Conditionss = 2, a = .001s > 2, a = .001s = 2, a = .0025All other cases

NC*1

481

293*NC = number of columns meeting the condition

max(SE).000608.000481.000403.000298

Page 20: Combining Independent Values: Extensions of the …psych.colorado.edu/~willcutt/pdfs/Darlington_2000.pdf · 2012-12-05 · Combining Independent p Values: ... studies not included

PROBABILITY POOLERS 515

Consider first the 293 columns for which all values of SEfell at or below .000298. Because .0017.000298 = 3.35, inthose columns an error of .001 or larger is made only if anestimate falls 3.35 or more SEs from its expected value. Theprobability that any one error will reach this level is about1 in 1,200. Therefore, for these 293 columns we have nearcertainty that a reported value of CMZ is accurate to threedecimal places.

Now consider the 48 columns for which s > 2 and a =.001, plus the 1 column for which s = 2 and a = .0025.The maximum SE in these 49 columns is .000481. Thus inthese columns, an error of .001 or larger requires a value tofall over 2 SEs from its expected value. Thus for these

values, we have 95% confidence in accuracy to three deci-mal places.

By the same argument, in the one remaining column, forwhich s = 2 and a = .001, we calculate 90% confidence inaccuracy to three decimal places.

Thus, overall, the tables can be characterized as beingaccurate to three decimal places, but with some qualifica-tions to those statements for the very smallest values of sand a.

Received May 29, 1998Revision received August 20, 1999

Accepted August 30, 2000 •

New Editors Appointed, 2002-2007

The Publications and Communications Board of the American Psychological Associa-tion announces the appointment of five new editors for 6-year terms beginning in 2002.

As of January 1,2001, manuscripts should be directed as follows:

• For Behavioral Neuroscience, submit manuscripts to John F. Disterhoft, PhD, Depart-ment of Cell and Molecular Biology, Northwestern University Medical School, 303 E.Chicago Avenue, Chicago, IL 60611-3008.

• For the Journal of Experimental Psychology: Applied, submit manuscripts to Phillip L.Ackerman, PhD, Georgia Institute of Technology, School of Psychology, MC 0170, 2745th Street, Atlanta, GA 30332-0170.

• For the Journal of Experimental Psychology: General, submit manuscripts to D. StephenLindsay, PhD, Department of Psychology, University of Victoria, P.O. Box 3050, Victoria,British Columbia, Canada V8W 3P5.

• For Neuropsychology, submit manuscripts to James T. Becker, PhD, NeuropsychologyResearch Program, 3501 Forbes Avenue, Suite 830, Pittsburgh, PA 15213.

• For Psychological Methods, submit manuscripts to Stephen G. West, PhD, Department ofPsychology, Arizona State University, Tempe, AZ 85287-1104.

Manuscript submission patterns make the precise date of completion of the 2001 vol-umes uncertain. Current editors, Michela Gallagher, PhD; Raymond S. Nickerson, PhD; NoraS. Newcombe, PhD; Patricia B. Sutker, PhD; and Mark I. Appelbaum, PhD, respectively, willreceive and consider manuscripts through December 31,2000. Should 2001 volumes be com-pleted before that date, manuscripts will be redirected to the new editors for consideration in2002 volumes.