24
NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, AND AN ALTERNATIVE ! . , DAVID R. ANDERSON." Colorado Cooperaifve Fish and Wlldllie Research Unit, Room 201 Wagar Building, Colorado State University, Fort Collins, CO 80523, USA ! KENNETH P. BURNHAM: Calorado Cooperative Fish and Wildlife Research Unit, Room 201 Wagar Building, Colorado State University. Fort Collins. CO 80523. USA I I WlUiAM L THOMPSON, U.S. Forest Service, Rocky Mountain Research Station, 316 E. Mynie St., Boise, Idaho 83702, USA Absrroct: This paper presents a review and critique of statistical null hypothesis testing in ecological studies in general, and wildlife studies in pa%&, and describes an alternative. Our review of Ecology andthelmrnal of Wildl$e Management found the use of null hypothesis testing to be pervasive. The estimated number of P- values appsanhg within e d e s of Ecology exceeded 8.000 in 1991 and has exceeded 3,000 in each year since 1984, whereas the estimatsd number of P-values in the Jmml of WildifeManagenrent exceeded 8.000 in 1887 and has exceeded 3,000ip each year since 1994.We estimated,that 47% (SE = 3.9%) of the P-values in the larrnnl of WUd$e Magenrent lacked estimates' of me& or effect sizes or even the sign of the difference in means or other parameters. We find that null hypothesis testing is uninformative when no esti- mates of means or effect size and their precision are given. Contrary to common dogma, tests of sta&tical null hypotheses have relatively littie utility in science and are not a fundamental q e c t of the scientific method. We recommend their use be reduced in favor of more informative approaches. Towards this objective, we describe a relatively new paradigm of data analysis based on Kullback-Leibler information. This paradigm Is an extension of likelihood theory and, when used correctly, avoids many of the fundamental limitations and common misuses of null hypothesis testing. Information-theoretic methods focus on providmg a sbength of e-,idence for an a priori set of alternative hypotheses, rather than a statisticaltest of a null hypothesis. This paradigm allows the following t p e s of evidence far the altemative hypotheses: the rank of each hypothesis, exprsssed as a model; an estimate of the formal likelihood of each model, given the data; a measure of precision that incorporates model selection uncertainty: and simple methods to allow the use of the set of alternative madelr; in m a h g formal inference. We provide an example ofthe information-theoretic approach using data on the effect of lead on survival in spectacled eider du& (Somatsria'$scheri). Regardlesr of the analysis used, we strongly remmrnend inferences based on a priori considemtions be clearly separated from those resulting from some fom of data dredging. JOURNAL OF WILDLIFE MANAGEMENT 64(4):912-923 Key words: AIC, M e weights, Ecobgy. information theory,J o u d of Wildlife Manogonsnt, 6dback- Leibler information, model relectlon, null hypothesis. P-values, si&cance tests. Theoretical and applied ecologists continually A test statistic is computed &om sample data strive for rigorous, objective approaches for and compared to its hypothesized null distri- malang valid inference concerning science bution to assess the consistency of the da:a with questions. The dominant, traditional approach the null hypothesis. More extreme values of the has been to frame the question in terms of 2 test statistic suggest that the sample data are not contrasting statisticd hypotheses: 1 represent- consistent with the null hypothesis. A substan- ing no difference between population parame- ti* arbitrary level (a) is often preset to serve ters of interest (i.e., the null hypothesis, H.) and as a cutoff (i.e., the basis for a decision) for sta- the other representing either iunidirecti&l or tistically significant versus statistically nonsignif- bidirectional altemative lie.. the altemative hv- icant,results. This procedure has various names, . pothesis, H,). These hypotheses basically cdr- including null hypothesis testing, significance respond to different models. For example, testing, and null hypothesis s i e c a n c e tesbbg. when comparing 2 groups of interest, the as- In fact, this procedure' is a hybridization of sumption is that they are from the same popu- Fisher's (1928) signiiicance testing and Neyman lation so that the difference between their true means is 0 (i.e., H, is J L ~ - JL) = 0, or 111 = b2). and Pearson's (1928, 1933) hypothesis testing (Gigerenzer et al. 1989, Goodman 1993, Royall ' Employed by U.S. Geological Survey, Division of 1997). Biological Resources. There are a number of problems with the ap- E-mail: [email protected] plication of the null hypothesis t e s ~ g ap- 912

NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

  • Upload
    phamnhi

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, AND AN ALTERNATIVE !

. ,

DAVID R. ANDERSON." Colorado Cooperaifve Fish and Wlldllie Research Unit, Room 201 Wagar Building, Colorado State University,Fort Collins, CO 80523, USA !

KENNETH P. BURNHAM: Calorado Cooperative Fish and Wildlife Research Unit, Room 201 Wagar Building, Colorado State University. Fort Collins. CO 80523. USA I

I

WlUiAM L THOMPSON, U.S. Forest Service, Rocky Mountain Research Station, 316 E. Mynie St., Boise, Idaho 83702, USA

Absrroct: This paper presents a review and critique of statistical null hypothesis testing in ecological studies in general, and wildlife studiesin pa%&, and describes an alternative. Our review of Ecology andthelmrnal of Wildl$e Management found the use of null hypothesis testing to be pervasive. The estimated number of P-values appsanhg within e d e s of Ecology exceeded 8.000 in 1991 and has exceeded 3,000 in each year since 1984, whereas the estimatsd number of P-values in the Jmml of WildifeManagenrent exceeded 8.000 in 1887 and has exceeded 3,000ip each year since 1994.We estimated, that 47% (SE = 3.9%) of the P-values in the larrnnl of WUd$e Magenrent lacked estimates' of me& or effect sizes or even the sign of the difference in means or other parameters. We find that null hypothesis testing is uninformative when no esti-mates of means or effect size and their precision are given. Contrary to common dogma, tests of sta&tical null hypotheses have relatively littie utility in science and are not a fundamental q e c t of the scientific method. We recommend their use be reduced in favor of more informative approaches. Towards this objective, we describe a relatively new paradigm of data analysis based on Kullback-Leibler information. This paradigm Is an extension of likelihood theory and, when used correctly, avoids many of the fundamental limitations and common misuses of null hypothesis testing. Information-theoretic methods focus on providmg a sbength of e-,idence for an a priori set of alternative hypotheses, rather than a statisticaltest of a null hypothesis. This paradigm allows the following tpes of evidence far the altemative hypotheses: the rank of each hypothesis, exprsssed as a model; an estimate of the formal likelihood of each model, given the data; a measure of precision that incorporates model selection uncertainty: and simple methods to allow the use of the set of alternative madelr; in m a h g formal inference. We provide an example ofthe information-theoretic approach using data on the effect of lead on survival in spectacled eider du& (Somatsria'$scheri). Regardlesr of the analysis

used, we strongly remmrnend inferences based on a priori considemtions be clearly separated from those resulting from some fom of data dredging.

JOURNAL OF WILDLIFE MANAGEMENT 64(4):912-923

Key words: AIC, M e weights, Ecobgy. information theory,J o u d of Wildlife Manogonsnt, 6dback- Leibler information, model relectlon, null hypothesis. P-values,si&cance tests.

Theoretical and applied ecologists continually A test statistic is computed &om sample data strive for rigorous, objective approaches for and compared to its hypothesized null distri- malang valid inference concerning science bution to assess the consistency of the da:a with questions. The dominant, traditional approach the null hypothesis. More extreme values of the has been to frame the question in terms of 2 test statistic suggest that the sample data are not contrasting statisticd hypotheses: 1 represent- consistent with the null hypothesis. A substan-ing no difference between population parame- ti* arbitrary level (a)is often preset to serve ters of interest (i.e., the null hypothesis, H.)and as a cutoff (i.e., the basis for a decision) for sta- the other representing either iunidirecti&l or tistically significant versus statistically nonsignif- bidirectional altemative lie.. the altemative hv- icant,results. This procedure has various names, . pothesis, H,).These hypotheses basically cdr- including null hypothesis testing, significance respond to different models. For example, testing, and null hypothesis s i e c a n c e tesbbg. when comparing 2 groups of interest, the as- I n fact, this procedure' is a hybridization of sumption is that they are from the same popu- Fisher's (1928) signiiicance testing and Neyman lation so that the difference between their true means is 0 (i.e., H, is J L ~- JL)= 0, o r 111 = b2).

and Pearson's (1928, 1933) hypothesis testing (Gigerenzer et al. 1989, Goodman 1993, Royall

'Employed by U.S. Geological Survey, Division of 1997).

Biological Resources. There are a number of problems with the ap- E-mail: [email protected] plication of the null hypothesis t e s ~ g ap-

912

Page 2: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that
Page 3: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

J. Wildl. Manage. 64(4):2000

n is ' poach, some of which we present herein (Cam-and i er 1978, Cohen 1994, Nester 1996). Although 1 of %is j doubts among statisticiansconcerning the utility ssis, I of null hypothesis testing are hardly new (Berk- sian : son 1838, 1942; Yates 1951; Cox 1958), criti- dive j cisms have increased in the scienti6c literature &ta - in recent years (Fig. 1).Over 300 references lysis iom ! , now exist in the scienii5c literature that wam

of the limitations of statistical null hypothesis testing. A list of citations is located at http:N

tistical usage in the ecological field as a whole. We chose the Juurnal of WiIdlrfe Management. as an applied journal for comparison. We review theoretical or philosophical problems with the null hypothesis testing approach as we11 as its common misuses. We offer a practical, theoret- ically sound alternative to null hypothesis test- ing and provide an example of its use. We con- dude with our views concerning data analysis and the presentation of scienac results, as well as our recommendations for changes in editolial and review policies of biological and ecological

PROBLEMS WITH NULL HYPOTHESIS OR SIGNIFICANCE TESTING

The fundamental problem with the null hy- pothesis testing paradigm is not that it is wrong (itis not), but that it is uninformative in most cases, and of relatively little use in model or variable selection. StaMcal tests of null hy- potheses are logically poor (e.g., the arbitrruy declaration of signiscance). Berhon (1938) was one of the h t statisticians to object to the prac- tice.

'Rze most curious problem with null hpoth- esis testing, as the p~imary basis foi data anal-ysis and inference, is that nearly all null hy- potheses are false on a priori grounds (Johnson

-+ sonl.hhnl and http://www.cnr.colostate.edul

( -andemonlnestechtml. The former website also

- ; includes a link to a list of papers supporting the : use of tests. We believe that few wiidlife biol-

ogists and eco1ogiscE are aware of the debate iata j regarding null hypothesis testing among statis- 3 t r i - 1 ticians. Discussion and debate have been par-

vith 1 titularly evident in the social sciences, where at the

I least 3 special features ~ o u m lof Expefimpntalnot

1 Education 61(4); Psychological Science 8(1); Ae-tan- i search in the Schools 5(2))and 2 edited books :Ne

i, (Morrison and Henkel 1970, Harlow et al. sta- i 1997) have debated the utility of null heothesis

; tests in scientific research. The ecological sci- nes, ences have lagged behind other disuplines with nce I respect to awareness and discussion of problems ing. associated with null hpothesis testing (Fig. 1; 1 of I Yoccoz 1991; Cheny 1998; Johnson 1999). nan I We present information concerning preva-

; lence of null hypothesis testing by reviewing pa- byall , pers in Ecology and the Journal of Wildlge

Management. We chose Ecology because it is

www.~nr,colostate.edu/-anderson/thomp-1995). Considerthe null H.: 0, = 81 = B2 = . . . = Bs, where 6, is an expected control response and the others are ordered beament responses (e.g., different nitrogen levels applied to agri- cultural fields). Ths null hypothesis is almost surely false as stated. Even the application of sawdust would surely make some difference in respo,me. The rejection of this strawman hardly advances science (Savage 1857). nor does it give meanin@ insights for conservation, planning, management, or fixher reseafch. These issues should properly focus on the esbation of ef-fects or differences and their precision and not on testing a ~ ~ a l(uninformative) null. Other general examples of a priori false null hypoth- eses Include (1)%: kc= !+ (mean growth rate is equal in control vs. aluminum-treated bull- frog, Rana cotesbeiana); (2) q:Slc = S,, (SUI-

vival probability in week j is the same for con- trol vs. lead-dosed gull chicks, Lams v.),and (3) H,:p, = 0 (zero correlation between vari- ables Y and X). Johnson (1999) provided addi- tional examples of null hypotheses that are

Page 4: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that
Page 5: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

I 914 HYPOTHESIS TESTING An&m et al.

is that a particular a-level is without theoretical basis and is therefore arbitrary except for the adoption of conventional values (commonly 0.1, 0.05 or 0.01, but often 0.15 in stepwise variable

ingless categories sigdcant and nonsislJ;cant and is relatively uninformative. This Neyman-Pearson approach is an arbitrary reject or not reject decision when the substantive issue is one of strength of evidence concerning a scientific issue (Royal1 1997) or estimation of size of an

Consider an example from a recent issue of the WildlifeSociety Bulldin, "Response rates did not v q among areas ( ~ 2= 16.2, 9 df,P =

the P-value been 0.01 lower, the conclusion would have been that sigdcant differences were found and the estimates & and their pre- cision given. Alternatively, had the arbitrary a level been 0.10 inidally, the result would have been quite different (i.e., response rates varied among areas, x2 = 16.2,9 df, P = 0.06). Here, as in most cases, the null hypothesis was false on a priori grounds. Many examples can be found where contradictory or nonsensical re- sults have been reported (Johnson 1999). Legal hearings conceming scientific issues are unpro- d u ~ v eand lead confusion when 1 party claims si@cance (based on a = 0.1), whereas the opposing party argues nonsignificance (based on a = 0.05).

The cornerstone of null hypo&esis testing, the P-value, has problems as an inferential tool that stem from its very deiinition, its application in observational studies, and its interpretation (Chew 1998, J&mon 1999). The P-value is de-fined as the probability of obtaining a test sta- tistic at least extreme as the observed one, conditional on the null hypothesis being true. There are 2 important points to consider about this definition. First, a P-value is based not only on the observed result (the data collected), but

J. Wildl. Manage. 64(4):2000

that never occurred than it is a concise state-

event (i.e., the data). Bayesians.(people maldng statistical inferences using Bayes' theorem; Gellman et al. 1995) find this propeq of P-

;1' values objectionable; they tend to avoid null hy-

a P-value is explicitly conditional on the null hy- i pothesis (i.e., it is computed based on the dis-tribution of the test statisiic assuming the null hypothesis is true). The null distribution of the test statistic (e.g., often assumed to be F, t, z, or x2) may closely match the actual sampling distribution of that statistic in strict experi- ments, but this property does not hold in ob- servational studies. In these latter studies, the distribution of the test statistic is unknown be-

der the null hypothesis is not deducible from the study design. Consequently, the form of the distribution is not known, only n@vely ass'umed, which makes interpretation of test resultr prob- lematic.

It has long been hown and criticized that the P-value is dependent on sample size (Berkson 1938). One can always reject a null hypothesis with a large enough sample, even if the h e difference is trivially small. This points to the difference between statistical significance and biological importance raised by Yoccoz (1991) and many others bdore and since. Another problem is that using a fixed a-level (e.g., 0.1) to decide to reject or not reject the null hy-pothesis makes little sense as sample size in- creases. Here, even when the null hypothesis is b e and sample size is infinite, a Type I error (rejecting a null that is true)' still occurs with probabiliv a (e.g., O.l), .and therefore this ap-proach is not consistent (theoretically, a should go to zero n goes to idkity). Still another

issue is that the P-value does not provide infor- mation about either the size or the precision of the estimated effect. The solution here is to merely present the estimate of effect size and a measure of its precision.

Page 6: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that
Page 7: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

A pervasive problem in the use of P-values is sets the in their misinterpretation as mideuce for either

the null or alternative hypothesis (see Ellison 1996 for recent examples of such misuse). The proper interpretati'on of the P-value is based on rents the probability of the data given the null hy-tap-

rved pothesis, not the converse. We caMot accept or prove 'the null hypothesis, only fail to reject it.

b g The P-value cannot validly be taken as the prob- rem; d P- ability that the null hypothesis is h e , although

I hy- this is often the interpretation given. Similarly, the magnitude of the P-value does not indicate

that a proper strength of evidence for the alternative

1 hy- hypothesis (i.e., the probabiliq of Ha, given the

dis- data), but rather the degree of consistency (or

null inconsistency) of the data with H. (Ellison

l the 1996). Phrases such ashighly significant (often denoted as ** or even ***) only reinforce this t, z,

>ling error in interpretation of P-values (RoyaU 1997).

peri- Presentation of only P-values also limits the

ob- effectiveness of (future) meta-analyses. There is

the a strong publication bias whereby only si@- I be- cant P-values tend to get reported (accepted) in

the literature (Hedges and O l h 1985:285-290,ence :tors Iyengar and Greenhouse 1988). Thus,the pub- ~onal lished literature is itself biased in favor of re-

sults arbitrarily deemed sigmficant. It is impor-:un-lrom tant to present parameter estimates (effect size) E the and their precision from any well designed ned, study, regardless of the outcome; these become frob- the relevant data for a meta-analysis.

A host of other problems &st in the null hy- t the pothesis testing paradigm, but we wiU, mention kson only a fav. We generally lack a rigorous theory hesis for testing null hypotheses when a model con- h e tains nuisance parameters (e.g., sampling prob- the abilities in capture-recapture studies). The dis-

r

and hibution of the likelihood ratio test statistic be- 991) tween models that are not nested is unknown

and this makes comprehensive analysis prob- lematic. Given the prevalence of null hypothesis

hy- testing, we warn against the invalid notion of ? in- post-hoc or retrospective power analysis (Good- sis is man and Berlin 1994, Gerard etal. 1998) and

note that this practice has become more com- 31TOI

with mon in recent years. 8 ap- The central issues here are twofold. First, sci- auld entists are fundamentally interested in esti-~ther mates of the magnitude of the differences and nfor- their precision,the so-called effect size. Is the )n of difference trivial, small, medium, or large? IS

is to this difference biologically meaningful? This is nd a an estimation problem. Second, one often wants

to know if the diiferences are large enough to

just+ inclusion in a model to be used for in-ference in more complex science settings. This is a model selection problem. These central is- sues that fnrther our understanding and howl- edge are not prop'erly addressed with statistical hypothesis testing. Statistical science is much more than merely significance testing, wen though many statistics courses are still offered with an unfounded emphasis on null hypothesis testing (Schmidt 1996). Many statisticians ques- tion the practical ntiiity of hypothesis testing (i.e., the arbitrary a-levels, the false n d hy- potheses being tested, and the notion of signif- icance) and stress the value of estimation of ef- fect size and associated precision (Goodman and Royal1 1988, Graybill and Iyer. 199435).

PREVALENCE OF FALSE NULL HYPOTHESES AND P-VALUES

We randomly sampled 20 papers in the Ar-.ticZes section from each volume of Ecology for years 1978-97 to assess the prevalence of trivial n d hypotheses and associated P-values in pub-lished ecological studies. We then randomly sampled 20 papers from each volume of the Joumal of Wildlife Management (JWM) for years 199498 for comparison. In each sampled article, we noted whether the null hypotheses tested seemed at all plausible In additiori, we counted the number of P-values and equivalent symbols, such as statistics with superscripted as- terisks or comparisons speci6cally marked non- significant. We tallied the number of cases where only a P-value was given (some papers also provided the test statistic, degrees of free- dom, or sample size), without an estimate of effect size, its sign or its precision, even in an associated table, for papers appearing in the JWM during the 1994-98 period. However, our counts did not include comparisons that were both nonsigdcant and unlabeled or uospeci- fied, nor did they include all possible statistical comparisons or tests. Consequently, ours is an underestimate of the total number of statistical tests and associated P-values contained within each article.

In the 347 sampled articles in Ecology con-taining null hypothesis tests, we found few ex- amples of null hypotheses that seemed biolog- ically plausible. Perhaps 5 of 95 articles in JWM contained 2 1 null hypothesis that could be con- sidered a plausible alternative. Only 2 of 95 ar- ticles in JWM incorporated biological impor- tance into the interpretations of results, the re-

Page 8: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that
Page 9: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

916 HWOTFZSIS TESTING.A&m et d. J. Wildl. Manage.64(4):2000

Table 1. Median, mean (SE),and range of the number of P-values per article, and estimated total (SE)number of Rvalueo per year, based on a random sample of 20 papers each year from the Articlessectlonof Ecology for 1978-97.

mainder merely used statistical significance. In the vast majority of cases, the null hypotheses we found in both journals seemed to be obvi- ously false on biological grounds even before these studies were undertaken. A major re-search failing seems to be the exploration of un- interesting or even trivial questions. Common examples included null hypotheses assuming survival probabilities were the same between ju- veniles and adults of a species, assuming no cor- relation or relationship existed between vari-ables of interest, assuming density of a species remained the same across time, assuming net primary production rates were constant across sites and years, and assuming growth rates did not differ among individuals or species.

We estimate that there have been a minimum of several thousand P-values appearing in every volume of Ecology (Table 1)and J W M (Table 2) in recent vears. Given the conservatism of - ~- -~ --~ ~~~

ow counting procedure, the number of null hy- pothesis tests that were actually perfonned in each study was probably much larger. Approi- mately 47% (SE= 3.9%) of the P-values that we counted in JWM appeared alone, without estimated means, differences, effect sizes, or as- sociated measures of precision. Such results, we maintain, are paiiicukry uninfonnafive (e.g., not even the sign of the difference being indi-cated). The key problem here is the general f d - w e to explore more relevant questions and to

report informative summary statistics (e.g., es- m t e s of effect size and their precision), even when significance was, found. The secondary problem is not recognizing the arbitrariness of u,hence perpetuakg an arbitrary classification of results as sigdcant or not si$cant.

A PRACTICAL ALTERNATIVE TO NULL HYPOTHESIS TESTING

We advocate Charnberlin's (1890, 1966) con- cept of multiple working hypotheses rather than a single statistical null vs, an alternative-this seems like superior science. However, this ap-proach leads to the mudtiple testing problem in statistical hypothesis testing, and arbitrariness in the choice of a-level and of which hypothesis to serve as the null. Although commonly used in practice, si@cance testing is a poor approach to model selection and variable selection in re-gression analysis, .discriminant function analysis, and similar procedures. (Akaike 1974, Mc-Quanie and Tsai 1898:427-428).

Akaike (1973, 1574) developed data analysis procedures that are now called infaGdionthk-oretic because they are based on Kullback-Lei- bler (1951) information. Kullback-Leibler infor- mation is a fundamental quantity in the sciences and has earlier roots back to Boltzmm's con- cept of entrow. The Kullback-Leiblw infor- mation between conceptual truth, f, and ap-

Page 10: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that
Page 11: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

, . :: i TESTING. 917ze. U(4):ZOOO dl. Manage. .&(4):2000 HWOTXESIS Andamm et nl.

: 1 tDer 01 Pvalues ! 780 e 2. Msolan mean (sE), ano ranpa ot tne numDer ot P-va ~ e sDer an.cle, ana esnmatso ma! EE, wmwr ot P-vstbes

per year, base0 on a ranoom samoe of 20 Papers (exc ~olng lnvned Pepsrs an0 CommenVRepiyanmc.eg eacn year tmm tne Journal 01 W~ldlrleManagemsnffor Years 199448

ics (e.g., es- jsion), even :secondary itrariness of :lassScation kcant.

F 0 NULL

, 1965) con- :rather than native-this ver, this ap-problem in

bitrhess in ypothesis to ~nlyused in or approach :ction in re- ion analysis, 1974, Mc-

lata analysis mation the-ullback-Lei-eibler infor- the sciences man& con-:ibler infor- f. and ap-

proximating model g is defined for continuous Model Selection Criteria functions as the integral Ak.aike (1973) found a formal relationship be-

tween Kullback-Leibler information (a dominant I(f.g) = 1f ( x ) l o g e ( s ) 4 paradigm in information and coding theory) and

maximum likelihood (the dominant paradigm in wheref and g are n-dimensional probabiliv dis- staiisticli;deLeeiw 1992). This finding makes it bibutions. Kullback-Leibler information, denot- possible to combine estimation and model selec- ed I(f,g), is the information lost when model g tion under a single theoretical framework--op is used to approximate truth, f:The right hand *tion. Akaike's breakthrough W ~ Sderiving side looks difficult to understand, however it an estimator of the qec ted , relative Kullbd- can be viewed as a s t a ~ c d expectation of the Leibler information, based on the maximized natural logarithm of the ratio off (full reality) log-likelihood function. This led to M e ' s in- to g (approhating model). That is, Kullback- formation criterion (AIC), Leibler information could be written as

(*)I AIC = -2lo&(t(8ldda)) + 2K,

E'[lo geg(xle) ' . where lo&4(6ldata) is the value of the m a w e d log-likelihood over the unknown parameters (e),

where the expectation is taken with respect to given the data and the model, and K is the num- full reality,$ Using the property of logarithms, ber of parameters estimated in that approximat- this expression can be further simpNed as the ing model. Theye is a simple transformation of difference between 2 expectations, the estimated residual sum of squares (RSS) to

obtain the value of lo&(t(8l&a)) when using I(f8g) = Ef[lOg.(f(x))l - Ef[lO&(g(rle))l. least squares, rather ]ikelihood

Clearly, full reality is unlmom, but it is fixed The value of AIC for least squares models is

across models, thus a further simpliiication c& merely, be written as

AIC = n.i0&(6~)+ , .:

I(f,g) = C - E~lo&(g(rl0))1, where n is sample size and b2 = RSSIn. Such

where the expectation of the logarithm of full quantities are easy to compute once the RSS reality drops out into a simple scaling constant, values for each model are available using stan-C. Thus,the focus in model selection is on the dard computer software. term glo&(g(x10))1. Assuming a set of a priori candidate models

One seeks an approximating model (hypoth- (hypotheses) has been defined and well sup- esis) that loses as little information as possible ported, AIC is computed for each of the a p about truth; this is equivalent to minimizing I(f, proximating models in the set (i.e., g,, i = 1,2, g), over the set of models of interest (we assume . . . , R).The model where AIC is minimized is there are R a priori models, each representing selected as best for the empirical data at hand. an hypothesis, in the candidate set). Obviously This concept is simple, compelling, and is based Kullback-Leibler information, by itself will not on deep theoretical foundations (i.e., Kullback- aid in data analysis as both truth (f) and the Leibler information). The AIC is not a test in parameters (0) are unknown to us. any sense: no single hypothesis (i.e., model) is

Page 12: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that
Page 13: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

918 HYPO~sIsTESTING.Andamn et al.

made to be the null,. no arbitrary a level is set, and no notion of sigdcance is needed. Instead, there is the concept of a best inference, ,given the data and the set, of a priori models, and further developments provide a strength of ev- idence for each of the models in the set.

It is ,jmportant to use a modified criterion (called AIC.) when K is large relative to sample

4, AIC, = -2 log& (61 data)) + 2K + 2K(K + 1) (n - -

and this should be used unless nlK > about 40 (Burnham and Anderson 1998).As sample size increases, AIC = AIC., thus,if in doubt, always use AIC. as the final term is also trivial to com- pute. Both AIC and AIC, are estimates of ex-pected (relative) Kullback-Leibler information and are useful inthe analysis of real data in the "noj;' sciences.

Ranking Models The evidence for each of the alternative mod-

els can best be done by rescaling AIC values such that the model with the minimum AIC (or AIC,) has a value of 0,i.e.,

A, = AICi - m M C .

The 4 values are easy to interpret and allow a

Likelihood oi a Model, Given the Data

functions $I the same sense that S(8ldda. g,) is the likelihood of the parameters 8, given the data ( x ) and the model (gJ.It is convenient to normalize these values such that they sum to 1,

J. Wild. Manage. 64(4):2000

The w,, called tlkaike weigbts, can be inter- ~ re t edas approximate probabilities that model i is, in fact, the Kullback-Leibler best model in the set of models considered. Akaike weights are a measure of the weight of evidence that model i is the actual Kullback-Leibler best model in the set. The relative likelihood of model i versus model j is just WJW? Inference here is conditional on both the data and the set of a priori models considered.

Unconditional Sampling Variance

Typically, estimates of sampling variance are conditional, on a given model. When model se- lection has been done, a variance component due to uncertainty in model selection should be incorporated into esihates of precision such that these estimates are unconditional on the selected model, but still conditional on the models i n the set. An estimator of the uncon- ditional variance for the parameter 8 from the selected (best) model is,

where R

6 = c w;&. i-1

methods,

Multi-model Inference Rather than base inferences on a single se-

lected best model from an a priori set of mod- els, inference can be based on the entire set of models (multi-model inference, MMI). Such in-ferences can be made if a parameter, 8 , is in common over all models (as 8, in model gt), or o w goal is re diction. Then by using the

Page 14: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that
Page 15: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

be inter-at model

: model in :e.weights i$ce +at. :bier best :lihood of Inference nd the set

tiance are model se- 3mponent should be sion such al on the 3 on the le uncon- from the

1997). in- plingvar-v%(e,Igi)

model se; variance

e Akaike node1 im- ! parame-mditional nce inter- mn log- or al. 1987: .997:158-ani 1993)

single se- t of mod- tire set of . Such in-r, @;is in 3el g,), or lsing the

HYPOTHESISTESIZNC Anderson st nl. 919

i. Table 3. Example of multl-model Inference based an models and results presented in Tables 3 and 4 In Grand st al. (1898) on enect of lead poisoning on spsotacied eider annual survival pmbabiilty; + denotes annual survival pmbabillty: subscripb e

1. and o denoted exposed or unexposed to lead, respectively. From each model we get an estimate of lead.enect on survival (sffecf- dU- 63, and estlrneted cond~lonal standard ermr, Se(BffecfIg,), glven the model: see text for furher axplanation. I.

Mdslgr X AS wi L e d cIfeot 6,,-b Se(lfjecfI~1.

~ + I P . ] 3 0.00 0.613 ;i (+,+I p.1 4 2.07 0.239

(%.I p.1 5 4.11 0.086 $ i+.p. l 2 12.71 0.001 1; (+*p.I 3 14.25 0.001 1 model averaged

atio ion idam thst of Lebnron et nl. (1982); r - I - Irduponue; . - con-,

0.331 0.105 0.335 0.148 0.330 0.216 0.000 0.000 0.000 0.000 0.335 0.125

8 sod I; p - rrcnp-i ~ m r pmbnbufy.ye-.

weighted average, 8 = Xw16i. we are basing point inference on the entire set of models. This approach has both practical and philosophical advantages (Gardner and Altman 1986, Hen- derson 1993, Goodman and Berlin 1994). Where a model-averaged estimator can be used, it often has. better precision and reduced bias compared to the estimator of that parameter from only the selected best model.

An Example: Lead-Effect on Spectacled Eider Survival

Grand et al. (1998) evaluated the effect .of lead exposure on annual survival probability (+) of female spectaded eiders. Data were from 3 years of a larger capture-recapture study at 2 sites on the Yukon-Kuskokwim Delta in Alaska. Nesting female eiders were capturedin May- June of 1994-96. At capture in 1994 and 1995, blood was drawn to use in determining lead ex- posure (assumed to be from ingested lead pel- lets). Grand et al. (1998) &&ed each female either as exposed or unexposed to lead. For analysis of lead-effect on annual survival they used 5 models determined a prioli (but partly based on analysis of the larger data set). They used program MARK (White and Bumham 1999, White et al. 2000) to model the capture- recapture data and estimate model parameters.

The parameterization of all 5 models was structurally similar in that each model was based on an annual probability of survival (+) and a recapture probability (p), conhtional on a bird being alive at the beginning of year j. Grand et al. (1998) let the recapture probabil- ities be constant across years, denoted asp., and let the survival probabilities vary by lead expo- sure ( 1 ) and site (s). The notation is standard in the capture-recapture literature (Lebreton et al. 1902). Thus, model {+I, p . ) represents the hy- pothesis that annual sunival probability varied

by lead exposure, but not year or site, wbile recapture probability was constant over years, sites, and exposure. Model (+s+~,p . ) represented the hypothesis that annual survival was constant across years, but varied by site and exposure but with no interactions. Model {+,.I, +.I represent- ed the hypothesis that survival was constant across years, varied by site and exposure, but with an interaction term, s X 1. Model {+., p.) assumed that both survival and recapture prob- . abilities were constant across years, site, and ex-posure, while model (+,, p.] assumed that sur- vival varied by site, but not year or exposure. Thus, empirical supporl for a hypothesized lead effect must stem from models {+I,p.1, i+s+~,21.1, and (+,.I. p.1.

Model selection resultr presented by Grand et al. (1998) in their Table 3 are basically iust . " the AIC differences, 4; they bare inference about the lead-effect (their Table 4) only on tile selected best model. Here we extend their re- sults to incorporate multi-model inference (MMI; Burnham and Anderson 1998). First, we have the Akaike weights, w,, shown in our Table 3. The best model bas +.wrying by lead expo- sure, but only has wl = 0.673 as a strength of evidence for this best model. This weight siig- gests that model (41,p.) is not convincingly the best model if other replicate data sets were available. The next two models add little sup- po* for a site-effect, either with or without in- teraction terms. This can be seen by consider- ing AIC,

AIC = -21og&$'(+, p)) + 2K.

AJC is an estimator of Kullback-Leibler infor-, mation loss and embodies the prinaple of par- simony as a byproduct, not in its derivation. The first term in AIC is a lack of 6t component and gets smaller as more parameters are fitted in the model, however, the second component gets

Page 16: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that
Page 17: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

920 HYPOTHESISTESTING. Anderson e t nl.

larger as a penalty for adding additional param- eters. Thus, it can be seen that AIC enforces a eade-off between bias and variance as the num- ber of parameters is increased. From Table 3, one can see that 4 for model p.) with K = 4 parameters increased by 2 units over the best model, while model p.] witb K = 5 parameters increased by 4 units over the best model. The fit of the 6rst 3 models in Table 3 is nearly identical; the additional hypothesized effect of site, with or without an interaction term, is not supported by the data. In each case, the 4 values increase by about 2 as the number of parameters increases by one. In total, the ev- idence for a lead-effect is vev strong in that the sum of the Akaike weights for these 3 mod- els is 0.9Q8. Empirical support for the 2 models without a lead-effect is lacldng (Table 3) as both models have w, = 0.001. The evidence strongly

: suggests the presence of an effect on annual survival caused'by ingestion of lead.

The evidence that model (42, p . ) is the best over replicated data sets can be easily judged by the ratio of the M e weights of the best model'and the second ranked model. This evi-dence (e.g., wl/ws= 0.673/0.238 = 2.8) is in- suf6cient to justify ignoring issues of model se- lection uncertaint,, Hence, from the Akake weights, it is clear that a lead-effen on sunival is required for a model (hypothesis) to be plau-

Finalty, rather than ignore model selection uncei-taiuty, we can use the model-averaged es- timate of lead-effect on annual sumival and its unconditional standard error (from Table 3). As is often the case, the model averaged estimate of effect size is vely similar to the estimate from just the best model (0.335 vs. 0.337). However, the unconditional standard error, 0.125, is about 20% larger than the conditional standard error, 0.105, from the best model. This increase re- fleas model selection uncertainty and is an hon- est measure of uncertainty in the estimated ef- fect of lead on eider survival probabilities.

~umrnaryof the Information-Theoretic

The principle of parsimony, or Occam's razor> provides a philosophical basis for model selec- tion; Kullback-Leibler information provides an objective target based on deep, fundamental theory; information criteria (AIC and AIC,), along with likelihood-based inference, provide a practical, general methodology for use in data

J. Wildl. Manage. 6414):20W

analysis. Objective data analysis can be rigor- ously based on these principles without having to assume that the true model is contained in the set of candidate models. There are surely no true models in the biological sciences. Pa-hers using information-theoretic approaches are beginning to appear in theoretical and applied journals in the biological sciences.

At a conceptual level, reasonable data and a good model allow a separation of information and noise. Here, information relates to the structure of relationships, estimates of model parameters, and components of variance. Noise then refers to the residuals: variation left un-erplained. We can use the information exbacted from the data to make proper inferences. The goal here is an approximating model that min-imizes information loss, If$ g), and properly separates noise (non-information or entropy) from structural information. In an impomt sense, we are not trying to model the data, but rather we want to model the information in the data.

Information-theoretic methods are relatively simple to undersL=d and practical to employ acmss a large class of empizical situations and scientific disciplines. The methods can be com- puted by hand ifnecessary (assuming one has the parameter estimates, maximized log-likeli- hood values, and vZ?($lg,) for each of the A a priori models). The information-theoretic me&- ods are easy to understand and we believe it is important that people understand the methods they employ. Further material on information- theoretic methods can be found in recent books by Burnham and Anderson (1998) and Mc- Quame and Tsai (1998). M e ' s collected work.,have been recently published by P q e n et al. (1998).

CONCLUSloNS The overwhelming occurrence of fahe null

hypotheses in our sample of d c l e s from Ecol-ogy and JWM seems sobering. Why are such strawmen being continually tested and the re- sults accepted as science? We believe research- e n in the applied sciences have been indoctri- nated into thinking that statistical null hypoth- esis tes'dng is a fundamental component of the scient&c method. Researchers commonly treat scienfi6c hypotheses and statistical null hypoth- eses as one in the same, which they are no* (Romesburg 1981, Ellison 1996). As a result, ecologists live or die by the arbitrarily assigned

Page 18: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that
Page 19: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

iansge. 64(4):2000

is can be rigor- ;without having is contained in

3exe .are surely !&.sciences. Pa-. . :approaches are icd and applied !es, . ; ~. -~

cable data and a I of information

relates to the mates of model 'variance. Noise xiation left un- nation exb.acted inferences. The node1 that min-). and properly .on or entropy) n an important lel the data, but Formation in the

ds are relatively :tical to employ llsituations and ~ d scan be com- suming one has nized log-likeli- :acb of the R a -theoretic meth- we believe it is

nd thk methods on information- in recent books 1998) and Mc- aike's collected shed by Parzen

ce of false null icles from Ecob

Why are such ted and the re- elieve research- :been indoctri- -al null hypoth- mponent of the commonly treat cal null hypoth- Ih they are not 6). As a result, itrarily assigned

siflcant P-value (see Nester 1996 for a col- orful description of the different types of emo- tional response to significance testing results).

In the worst, but common, case, only a P- value is presented, without even the sign of the supposed difference. Null hypothesis testing does not represent a fundamental aspect of the

-scientificmethod; but -rather a pseudoscientSc approach that provides a false sense of objedv-

. ity and rigor to. analysis and interpretation of research data. Carver (1978:394) offers the ex-treme statement, ". .. statistical significance testing usually involves a corrupt form of the scien&c method and, at best, is of trivial im-portance . . . ." Much of the statistical software currently available aggravates this situation by computing and displaying quantities related to various tests.

Results from null hypothesis testing lead to relatively little increase in understanding and divert attention from the important issues-es- timation of effect size, its sign and its precision, and meaningful mechanistic modeling of Pre-

and causal relationships. We urge re-searchers to avoid using the words signi6cant

nonsigni6cant as these meant something of biological importance. Do not rely on statistical hypothesis tests in the analysis of data from Obsenrational studies, not only P-values, and avoid reliance on arbitray rr-levels to judge significance. Editors and referees should be wary of tivial null hypotheses being tested, the related P-values, and the implication of supposed significance.

There are alternatives to the traditional null hypothesis testing approach in data analysis. For example, the standard likelihood ratio pr&des a more realistic basis for strength of evidence (Edwards 1972, 1992; Royall 1997). There is a great deal of current research on Bayesian methods and practical approaches are forth-coming for uie in the sciences. However, the Bayesian approaches seem computationally dif-ficult and there may continue to be objections of a fundamental nature (Foster 1995, Dennis 1gg6, all 1997)to heuse ofBayesian ods in strength-of-evidence assessments and conclusion-oriented, empirical science.

Information-theoretic methods offer a more useful, general approach in the analysis of em- pirical data than the mere te*g of hy-potheses. The information-theoretic paradigm avoids statistical hypothesis testing concepts and focuses on relationships of variables (via model

HYPOTHESIS TESTING. Andarson et nl. 921

sele&on) and on the estimation of effect size and measures of its precision. These relatively new approaches are conceptually simpler and easily computable, once the model s tawcs are available. This paradigm is useful in providing evidence and madng inferences from either a single (best) model or from many models (e.g., using-MMI based 'weigbg). M o t - mation-theoretic approaches cannot be used unthinkingly; a good set of a priori models is essential and this involves professional judg- ment and integration of the science of the issue into the model set.

Increased attention 'is needed to separate those inferincesthat rest on a priori consider- ations from those resulting from some degree of data dredging. Essentially no justifiable the- ory exists to estimate precision (or test bypoth- eses, for those & so inched) when data dredging has taken place (the theory (-)used isfor a prio,j analyses, assuming the model was the only one fit to the &h).-j-his glaring fact is either not understood by praditioners and joy- nal editors or issimply ignored. Two types of datadredging (1)an iterative where patterns and differences obseAd after initial analysis are chased by repeatedly building new with these effeds in~uded, and (2) analysis of all possible models (unless, perhaps, if model averaging is used), Data dredging is a poor approach tomakingreliable inferences

the sample-, pop&tion. Both types of data dredging are best reserved for mpre ex-ploratory investigations that probably should of- ten remain unpublished, The incorporation of a priori considerations is of paramount impor- tance and, as such, editors, referees, and au- thors should pay much closer attention to these issues and be wary of inferences obtained from post hoc data dredging.

CITED

AKIIlKE, H. 1973. Idomation theory as an extension of the maximum likelihood principle: Pages 267-281 in B. N.Petrov andF. C s k , editors. Second International Symposium on Information Theoxy Akademiai Kiado, Budapest,-. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Contml AC 19:716723.

BERGEX,J. 0..AND D. A. BERRY. 1988. Statistical analysis and the illusion of objed5ity. American S c i e n ~ t76:153-165,-, AND T SEUE. 1987. Tes+ing a point null hypothesis: the irrecond~bility of P values and

Page 20: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that
Page 21: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

922 HYPOTHESIS TESTING .Anderson ei 01 J Wddl Manage. 64(4).2000

evidence Journal of the Amencan Srahshcd As- rather than hypothws remng. Brihrh hle&cal soclanon 82:111-122. Journal 292.746750.

BERKSON. 1. 1938. Some difficdnes of rnterpnration GELhLAN, A . 1 C. CARLW, H. STERN, AND D E. enmuntered m the apphcanon of the &-squared Rmw. 1995 Bayesian data analysis. Chapman & tes:. Journal of the Amencan S ~ ~ m ' i c a l H d . New York.Assoua uan 33:52€-536. GEW, P D.. D R S.\IITH.AND G.WEER~KSOD? -1942. Tesrs of s~pf icance considered as w- 1998. b t s o i reuospecwe power analvsu. idence. Journal o i the ~ m c n c a n Scanshca! &so- Journal of Wildlife Management 61:801-80- uahon 37:325335 GICERENZER,G., 2. SWLplvK. T. PORTER, L. D a -

BUCKLAND.S . T , K. P BURNHAM, AND N H. AU- TON. J. Bmm, AND L. KRUGER. 1989. The em- GUSTIN. 1997. Model selemon. an mcegral part pire of chance: how probabhry changed suencc of Inference. Blomemcs 53603618 and everyday Me. Cambridge Ur~vcrslry Press,

BURNHAM.K. P.. AND D R ANDERSON. 1998 hlodel Cmbndge, Lrnred b p d o m relemon and mlerence a p a c u d mformauan- GOOD.^, 5 . N. 1993. P values, h otheur rests, and theorehc approach. Splmger-Verlag, New Pork. likekood imphcahom ior epgemology oi a ne-Picw York. USA glecced fustoncal debate mencan ] o m < oi--, G. C. WHITE, C. BROW, UD K Epiaemology 137:48%96. H. P o u o ~ .1967. Dcstgn and andp metho& -,AND 1.A. BERUN 1994 The use of predicted for fish m v a l urpwvnenrr based on release re- confidence mrervals when pianmg upenrnents wpnue Amencan Fuhener Soncry Monograph 5 and the msuse of power when lnterprenng re-

CARVER,R P 1978 The case agamsr sranmcal sig- rultr Annds of lnremal Medicme 12?:200-206. nificmce resnng. H a n u d Educauanal Renew -, AND R. Rovhu. 1988 Evidence and suer.-46:37&399. dfic resear&. Amencan Journal of PubLc Hedin

CHAMBERUN,T. 1965 (1890). The method of mdb- 78 1568-1574. ple worlang hypotheses Science 148:7%759. JGRAND, B., P. L FLINT, M R PETERSEN, AND C ,repnnr o i 1890 paper m Science). L. M o m \ 1998. Effect of lead potsome or.

CHERRY. S. 1998 Ststisacd cesD rn ~ u b h c ~ h o n s n~ectacied elder s-val of rates Iownd of ivildlii~ The Wildlife Souen Wlldhfe SO&;?). Bde+ir. 26 Management 62:110?-1109. 947-953 GFNBILL,F. A, AUD H K. IYER 1994 Reyerslan

COHEN, 1. 1994 The earth is round :p<.O5. Amen- analysts concepts and apphcauons Dubury can Psychologsr 49:997-1003. Press, Belmanr. Cahforms. USA.

COX D. R 1958. Some problemr connected \vie. sra- H~RLOW,L L.,S. A MUML. AND ! H. STELCEE tlsdcd rnferenca Ann& of .Mathemaucal Stahr- EDITOPS. 199: What ii there were no s l p f i -hcr 19.357372 a c e resrr. Lawrence Erlbaum Assoc~ater. MJh-

DELEEUW. J 1992. lntroduchon to M e 11973 in- w&, Pieu,)e~se), USA formanon theor). and an extension of the maa- HEDGES, L V., AND I . OLKIN.1965 S:aurdcsl me3.- mum LkeLhood pmcipie. Pages 599-609 in S. ods for meu-analyns Academlc Pnrr. hndor . KOQ and N L Johnson, editors Breakthroughs HENDERSON, F. 1993 Chemuu) ultt coddence.A m scanrticr \blume : Spmger-Verlag. Londoi., should C h n d C h m m t y re ure confidence ir. Eluted Kmgdom rervair for analmcal and a%er daraT CLnaca!

DENNIS, B. 1996 Discuss~on should ecologsts be- Chermso). 3LkY26935 come Bayertans? Ecolog~cal Apphcauonr 6.1095- IYENGAR,S., ANDJ B. GREENHOUSZ 1995. Srlechar 1103 models and the file kawer problem S:ansncd

EDWARDS, A. M' F. lQ1? L&ekaod an accotm? o i Science 3.109-135 the rransucd concep; of Wvlihood and its appli- JOHNSON, D. H. 1995 Stsbdcd wens t i e allwe of cahon to scientific lnierence Cmbndge Umver- ronparamemcr. Ecology 7E:1996-2000. s~ry Press, London. Un~red Ibgdom. -. 1999 The msi@ficmce of sransncal s~gni t -- 1992. Likelihood expanded e&nan The a c e teshng Journal of w~ldLfP Mansgemen: Johns Hophns Unlverstry Press, balumore, 63:763-772 M q l a n d , USA KCLLBACK,S., M'V P.. A LEIELER 1951 On m?or-

EFRON, B.. AND R. nBSHIRATI. 1693 An mmduc- mauon and rufftuency hnndr of Marhcmancd hon ro the boouuap. Chapman & Hall. New Stam'ics a79-86 . York, New York, USA LEBRETON, P BURNHMI. J . D., K ] CLODERT. AND

ELLISON, A. M 1996. An rnwoducnon ro Bayeslar. D R. ANDERSON. 1992. Modehng s m v d and inference for ecolopcal rerearch and e n w o n - ternng b~ologlcal hypotheso uung marked an,- mental deus10r.-malang Ecologcal Appbcaoans m&. case sru&er and recent advances Ecoiogi-6.10361046. cal Monographs 626--118

FISHER, R . A. 1928. Scahsocal methods for research MCQUmIE, A. D. R., AVD C-L TShl 199@ Re. workers. Second e&tion OLver and Boyi, Lor- grerslan and nme sener model reiecuon Wa~:d don, United Kvlgdam. SdenGc Fresz, Singapore. Malavna

FOSTER, M R 1995. Bayes and bus: the problem of ORRISON ON, D. E.. AND R. E. HEN% 1959 Staus. sunphclry far a ?robabiht'r approact. co con&- ncal rerrr reconnderei. The Anlencar. Sociologsr mauon Brinsh Journh! ior t i e Philosophy o i Scj- 4:131-:40 ence 46,395424 -, ANL: -. E ~ I T O R S197C Tae ag?ificancc

GARDNEP.11 I . ANC D. G ALTW 1986. Conf- rest contove~s~-6 re&: AIAne F u b l ~ s h f . dence inrewds rarher ha: P values esmanon Clucago, I h n o ~ ,USA

Page 22: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that
Page 23: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that

. . . , . . >

a. 64(4):2000 ~.. .

J. wildl. Manage. 64(4):iooo

. ~ hMedical

, AND D. B. chapman 6r

xer' analysis. 801407. . TER, L. DAS-389. The em- nged science zersity Press,

?pis tests, and )logy of a ne. D Joymal of

:of predicted experiments

erpreting re-21:200-206. :e and scien- 'ublic Health

ISEN, AND C. poisoning on ,al of Wildlife

I. Regression ns. Duxbury

H. STEIGKR, .e no signifi-,dates, Mah-

tistical meth- !sr, London. h confidence: onfidence in-ata? Clini=d

188. Selection m. Statistical

the allure of 000. istical signifi- Management

LOBERT, AND survival and marked ani-

ices. Ecologi-

J. 1998. Re- rction. World i.

1969. Statis- m Sociologist

NESTER, M. R. 1996. An applied stamcian's creed. Applied Statistics 45:401410.

NEYMAN,J., AND E. S. P W O N . 1928. On the use and interpretation of certain test criteria for pur- poses of s t a a c a l infersnce, I & 11. Biometdka 20A:llS-200, 263-204.

-, AND -. 1933. On the problem of the most efficient test of s t a aca l hypotheses. Phil-osophical Transactions of the Royal Statistical So- ciety A 231:288337.

PARZEN,E..K. TANABE,AND G.KITACAWA, EDITORS. 1988. Selected papers of Hirotugu Akaike. Springer-Verlag Inc.. New York, New York, USA.

.ROMESBURG, H. C.1881. Wildlife science: ffavling re-liable howledge. Journal of Wildlife Manage- ment 45:293313.

ROYALL,R. M. 1997. Statistical evidence: a likelihood paradigm. Chapman & HaU, London, United Kinedom.

SAV*GPI.~R.1957. Nonparmetnc statistics. Joumal of the American Statistical Association 52:331- RAA-* -.

SCHMIDT,F L. 1996. S t a W d sigdcance testing and cumulative h l e d g e in psychology: impli-

cations far training of researchers. Psychological Methods 1:115-129.

W m ,G. C., AND K. P. BURNU. 1999. Program MARK-survival estimation from popuktions of marked animals. Bird Study 46 (supplement): S12C-Sl39. --, AND D.R. ANDERSON. 2000. Ad- vanced features of pro am MARK. Pages m-xax inhtegrating Peopf and Wildlife for a Sus- tainable Future,R. Fields, editor. Proceedings of the Second International Wildlife Management Conmess. The W~ldlife Societi, Bethesda, M w - land:

YAIES,F. 1951. The inftuence of Statistical Msthods fw Reseanh Workers on the development of the science of statistics. Joumal pf the American Sta- tistical As~ociation 46:19--34.

Yoccoz, N.G. 1991. Use, overuse, andmiruse of sig- nificance tests in evolution? biology and ecolo- u.Bulletin of the Ecologic Soclety of America 72:106-111.

Received 5 May 1998. Accepted 20 June 2000. Assonate Editor: Bunck

Page 24: NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE…€¦ · NULL HYPOTHESIS TESTING: PROBLEMS, PREVALENCE, ... This paper presents a review and critique of statistical ... We find that