How Strong is the Confirmation of a Hypothesis by Significant Data?philsci-archive.pitt.edu/9994/1/Bartelborth-significance... · 2013. 9. 11. · Confirmation by Significance Tests

How Strong is the Confirmation of a Hypothesisby Significant Data?

Thomas Bartelborth*

Preprint: 2013-9-11

Abstract: The aim of the article is to determine how much a hypothesis H isactually confirmed if it has successfully passed a classical significance test.Bayesians have already raised many serious objections against significancetesting, but in doing so they have always had to rely on epistemic probabil-ities and a further Bayesian analysis, which are rejected by classical statisti-cians. Therefore, I will suggest a purely frequentist evaluation procedurefor significance tests, that should also be accepted by a classical statistician.This procedure likewise indicates some additional problems of significancetests. Such tests generally offer only incremental support of a hypothesis,although an absolute confirmation is necessary, and they overestimate posi-tive results for small effects, since the confirmation of H in these cases isoften rather marginal. This phenomenon leads in specific cases, for exam-ple, in cases of ESP-hypotheses, such as precognition, too easily to a signifi-cant confirmation. I will propose a method of how to evaluate and supple-ment significance tests so that we can avoid their epistemic deficits.

Keywords: significance tests, Bayesianism, confirmation, likelihoodism, falsifica-tion, inference to the best explanation.

1 Epistemic Uncertainty and theLogic of Significance Testing

Scientists of most empirical disciplines present the results of their researchas soon as their theories have been successfully confirmed by a significancetest. Many scientific journals only publish an article when its hypotheseshave successfully passed a significance test. These tests are carried out todecisively reduce our epistemic uncertainty. However, despite the great im-portance this test procedure has in science, it remains unclear how strongthe confirmation by a successful test of significance is. Does it provide uswith a good reason to accept the hypotheses which have been proven to be

* University of Leipzig, Institute of Philosophy, Beethovenstraße 15, 04107 Leipzig,email: <[email protected]>.

Confirmation by Significance Tests 2

significant? To answer this question, I will examine how significance testingworks in simple examples and will develop a special method to evaluatethe reasons we have to believe in significantly confirmed hypotheses.

For the justification of a hypothesis H by data E, we first have to distin-guish between different types of confirmation that justify different degreesof confidence in a hypothesis. In most cases the confirmation will be an in-cremental confirmation which makes the hypothesis more plausible, but doesnot confirm it to an extent that we should accept it. In order to accept a hy-pothesis, we need an absolute confirmation. Incremental confirmations willseldom change the epistemic status of the hypotheses significantly. In anycase, it would be helpful if we could determine the exact degree of support,and if we would have a method of adding or balancing different confirma-tions of a hypothesis by different data. If we find a sufficient number of in-cremental confirmations of a hypothesis, we will eventually assume that ithas become rational to accept the hypothesis for now. In this case, wewould speak of an absolute confirmation and take the hypothesis to betrue, at least, until further data appear.

The first problem relates to how we can represent different degrees ofplausibility or support in the classical framework. For classical epistemol-ogy as well as for classical statistics the only representation of epistemic un-certainty that is allowed is given by a distribution of all statements intothree categories: We first have the belief set G of all accepted propositions,and second the dual set R of rejected propositions, and third the neutral setN of statements for which we have neither sufficiently good reasons for itsacceptance nor for its rejection. In particular, in classical epistemology wedo not work with degrees of belief as the Bayesians, who make finer episte-mological distinctions and thereby can add up different incremental confir-mations.

In the classical case, the reduction of uncertainty consists in shiftingstatements H from the neutral set N (by accepting them) to the belief set Gwhen we achieve an absolute confirmation of H. In classical significancetesting (as advocated by Sir Ronald Fisher, e.g., Fisher, 1935a, 1935b) twohypotheses are involved: the so-called null hypothesis H0 and an alternativehypothesis H1, which I will call the research hypothesis (or target hypothesis),because it is the hypothesis we want to justify in the end. At the beginning,both hypotheses have to be placed in the neutral set N, and we aim at mov-ing H1 to the belief set G (and accordingly the null hypothesis has to be clas-

3 Thomas Bartelborth

sified as rejected). To do this, we need an absolute confirmation that statisti-cal tests of significance therefore should provide. If, on the other hand, theresults of these tests were insignificant both hypotheses should remain inthe neutral set.

The logic of significance testing is thus a special case of eliminative induc-tion. In order to confirm the target hypothesis, at first the target or researchhypothesis H1 is proposed as a sort of antithesis to a certain null hypothesisH0. If, for example, our research question is whether a new teaching meth-od has really improved the performance of students (which we want tomeasure with a certain standardized school achievement test) our null hy-pothesis may be that the new performance (as measured by the schooltest) has remained the same as before (measured in earlier tests): = a. Thevalue a is our null value. If we then get some measurement results repre-sented by a value of a random variable T (and perhaps we find T > a), thenwe have to decide if the difference to our null value is only a chance varia-tion (or measurement error) or is due to a real effect (> a) of the newteaching method. Our research hypothesis H1 in this situation may be that > a (one-sided test) or a (two-sided test). For greater simplicity I will al-ways work with one-sided tests > a here. In most cases, the null hypothe-sis represents the skeptical view that the observed deviation from our nullvalue (here: T > a) is only a random fluctuation and represents no real effect(i.e. = a).

In order to justify the research hypothesis we must, in the second step,falsify (probabilistically) the null hypothesis. If this is successful, we arethen rationally justified in accepting the last remaining possibility: our re-search hypothesis H1. If the observed data or the value of our test variable Tis too far away from the null value (to be interpreted as being due to purechance), we reject the null hypothesis and speak of a significant result vindi-cating our research hypothesis. Typically, we use the so-called p-value todecide if we can reject the null hypotheses. If T takes the value x in a certainempirical study, we construct as our evidence set all values T x (includingin the set all values of T that speak against H0 at least as strong as the valuex in a one-sided test), and then determine the p-value p := P(T x|H0). Sta-tistics tells us how we can determine this p-value by modeling the test sit-uation in such a way that we obtain a probability distribution for T (thesampling distribution) if we assume H0 to be true.


Typically, if p < 0.05 or p < 0.01, we conclude that the result is too im-probable (given H0) such that we would remain with H0. These values arecalled the significance levels of our test. They determine in which cases weregard the null hypothesis as effectively probabilistically falsified. Intui-tively, we reject the null hypothesis on the basis of data which seem veryimprobable if the null hypothesis is true. Of course, this is no strict deduc-tive falsification and this may be a first reason for some of the problemswith significance testing that will be discussed in the following. The mainquestion is how strong the confirmation given by significant data is for ourresearch hypothesis. In many cases we can admit that we have a (weak) in-cremental confirmation, but it nevertheless seems to be much weaker thanthe absolute confirmation we need.

2 Significance Testing and Feeling the FutureA recent debate shows some of the problems of this method. The psycholo-gist Daryl Bem examined in nine experiments phenomena of precognition(Bem 2011). For instance in the first experiment, the participants were sup-posed to predict on which side of a screen (left or right) certain pre-deter-mined pictures would appear. In other experiments well known psycholog-ical effects (like priming) were tested to see if they also work reverse chro-nologically. These experiments were designed as significance tests, and ineight experiments the results confirmed significantly (and sometimes evenhighly significantly) the precognition hypothesis. In the first experiment ofpredicting the side of the picture the 100 test subjects (50 women and 50men) with 36 trials each succeeded in 53.1% of the cases (p-value 0.011),and if we restrict the test to erotic pictures, they even succeeded in 57.6% ofthe cases (p-value 0.00008) against a null hypothesis that predicted only ahit rate of 50%. These results were eventually published in a prominentpsychology journal.

The critics of these results (e.g. Alcock 2011) have raised many “minor”methodological objections against it, but the criticized methods of Bem arewithin the range of normal practices of psychological research—and also ofthat of other sciences. We will presumably have to admit that the test re-sults speak weakly for the target hypothesis H1 that precognition exists. Butit remains the crucial question whether the confirmation of H1 is strongenough that we should now accept H1, or at least should regard it provi-


sionally as a scientifically acceptable hypothesis (as long as we have no oth-er evidence). Since most scientists do not want to accept the precognitionhypothesis on the basis of Bem’s results, the method of significance testingwas once again the target for criticism. Thus, the example of Bem (2011)seems to be a good illustration of how difficult it is to assess the strength ofthe confirmation provided by significance tests. In any case, some critics(cf. Wagenmakers et al. 2011) argued that significance testing makes it tooeasy to confirm a hypothesis, or that the test overstates the meaning of thedata for the hypothesis.

My aim here is not to give a review of psi phenomena. There actuallyare some significant results speaking in favor of psi phenomena, whichBem (2011) mentions. However, they mostly have not been positively repli-cated; and the next difficult question is how we are to incorporate the re-sults of replication studies in our assessment of hypotheses. There is un-fortunately no easy rule for doing this. Furthermore, many renowned jour-nals show no willingness to publish the results of replication studies at all;but due to the intense discussion of Bem’s article the same journal pub-lished a great replication study (Galak et al. 2012), in which no evidence forthe precognition hypothesis was found. In any case, we need, as a primarymatter, highly dependable standard tests that can themselves provide usabsolute confirmations.

Some of the direct criticisms of Bem’s article (2011) are the following: Al-cock (2011), for example, has argued, among other things, that Bem has ac-tually done several t-tests at once, and hence the significance level shouldhave to be corrected accordingly (perhaps on the basis of the Bonfferronicorrection); and Alcock also argued against one-sided significance testing,because the error probability is then focused on one side, whereby it is eas-ier to gain significant results. This corresponds to a deviation in two-sidedtests with a significant result on each side at the 2.5% level. Both objectionsare primarily arguments for operating with a smaller level of significance,but Bem’s procedure is a common practice in scientific research. In particu-lar, these criticisms do not put into question the general method of signifi-cance testing.

We can find other criticisms of Bem's results in Wagenmakers et al.(2011). First, the authors complain that Bem has done an exploratory studyand mixed it with a confirmatory study. Hence, they plead for a better sepa-ration of these types of research.


Instead of presenting exploratory findings as confirmatory, one should ideallyuse a two-step procedure. First, in the absence of strong theory, one can explorethe data until one discovers an interesting new hypothesis. But this phase of ex-ploration and discovery needs to be followed by a second phase, one in whichthe new hypothesis is tested against new data in a confirmatory fashion. (Wa-genmakers et al. 2011, 427)

This distinction between exploratory and confirmatory data is, however,rather doubtful from an epistemological point of view and in practice oftendifficult to check.

Although the mentioned objections to Bem's results seem to be thusrather marginal, we do not want to accept Bem’s conclusion. Most scientistswill probably see no sufficient reasons in the data of Bem to believe in pre-cognition and will continue to assume that we until now have no absoluteconfirmation for the hypothesis of precognition. This certainly has to dowith the fact that the hypothesis is in conflict with our basic views aboutthe functioning of the world. Precognition is a form of reverse causality forwhich there seems to be no plausible mechanism. So we have the particularsituation that the results of Bem in our example can speak more against themethodology used than in favor of precognition. This is, at least, how somepsychologists as Wagenmakers et al. (2011) interpret the results.

3 Some Bayesian Objections against Significance TestsBefore I present my own approach for assessing confirmation by signifi-cance testing, I want to address some other serious criticisms, mostly putforward by Bayesians. Some Bayesians (e.g. Wagenmakers et al., 2011 Wet-zels et al. 2011) argue that we have to look for a better method of hypothe-ses testing since significance testing makes it too easy to confirm a theoryand the example of Bem (2011) was only one such case. Of course, they pleafor the application of Bayesian methods in evaluating scientific theories andpresent us some of the classical problems of significance testing.

A reasonably comprehensive collection of these problems with many in-structive examples can be found in Wagenmakers (2007). Wagenmakers di-vides the objections into three types. The first type of problems shows thatthe p-value is a function not of actual but of hypothetical data. To determinehow strongly certain data confirm a specific research hypothesis (or speakagainst the null hypothesis) one uses the p-value PVT(E) of the data E (rep-resented by T = x) with respect to a certain test statistic T, which is a sum of


probabilities summed up over all values of T that are at least as extreme asthe one actually observed (T = x). That means, instead of working withP(T=x|H0) we use the likelihood P(TD|H0) with D: = {y|P(T=y|H0) P(T=x|H0)}. For a one-sided null hypothesis we can normally simplify: D ={y|yx}. Greco (2011) calls this step from P(T=x|H0) to P(TD|H0) “weaken-ing the evidence”. As a consequence of this procedure it may happen thatevidence E occurs and E has the same likelihood in both theories T1 and T2

(P(E|T1) = P(E|T2)), but the theories suggest different p-values for E and,therefore, perhaps T1 is confirmed more strongly by E than T2. The Baye-sian regards this as implausible. In Bayesian epistemology often onlyP(E|T) is taken as indicator of the confirmation of T by E.

Bayesian methodology is, in this case, based on the so-called law of likeli-hood that looks quite plausible at first sight. It says that the likelihoodP(E|T) should determine the strength of the confirmation of T provided byE, and especially when comparing two theories T1 and T2 that the likeli-hood ratio P(E|T1)/P(E|T2) expresses the relative confirmation E providesto T1 relative to T2. Likelihoodists (as Royall 1997) often argue in this con-text that confirmation is in every case (only) comparative. Modern Bayesiansalso often use the corresponding Bayes factor BF01 as a measure of compara-tive confirmation that can decide between the null hypothesis H0 and ourresearch hypothesis H1. The Bayes factor coincides with the likelihood ratiowhen the likelihoods are objective. For the Bayesian the importance of theBayes factor BF10 results from its role in the updating of two hypotheses bydata E:

The Bayes factor BF01 is the update factor for Bayesians if we consider the ra-tio of the posterior probabilities of two hypotheses. This establishes its epis-temic significance. Two advantages of this factor with respect to signifi-cance testing are firstly that it is symmetrical in both hypotheses and thuscan also constitute a confirmation of the null hypothesis1 and secondly thatit allows us to quantify the degree of relative confirmation. Furthermore,

1 Rouder et al. (2009) argues that any test procedure that only allows confirmationof the target hypothesis but has no means to confirm the rival null hypothesis is al-ways biased by this fact alone to the target hypothesis. This seems to me an interest-ing general point for further epistemological discussions.


new data (e.g. replication studies) can be taken quickly into account byagain multiplying with the new Bayes factor. In addition, the Bayes factorseems not to be dependent on the prior probabilities of the hypotheses andthus seems to be an objective magnitude.

However, we still have the problem that the value P(E|H1) for our targethypothesis cannot be easily determined because it typically still contains a(nuisance) parameter (in our above example for the mean school achieve-ment) which we do not know. The hypothesis H1 (which for example saysthat > a) does not directly determine probabilities for the occurrence ofcertain data E, since it is no point hypothesis ( = b) but a “long disjunction”of point hypotheses. Therefore, we need a density distribution f(|H1) forthe parameter of H1 in order to integrate it out: P(D|H1) = P(D|, H1) f(|H1) d. These calculations will often be somewhat complicated (and weregularly need computer assistance by special Bayesian programs), and, inparticular, the dependence on a prior density f(|H1) may lead to some ob-jections to this procedure (s. below). Furthermore, Bayesians provide uswith certain approximations as the Savage-Dickey Density Ratio or theBayesian Information Criterion (BIC) for an easier application of their meth-odology (cf. Rouder et al. 2009).

Bayesians, in addition, supply a classification in which case the Bayesfactor BF10 indicates a more or less strong comparative confirmation of H1

relative to H0. Dating back to Jeffreys they interpret 1–3 as insignificant con-firmation of H1, 3–10 as weak confirmation and speak of above 10 as strongconfirmation. While these assessments are not entirely uniform by variousBayesians, we do have an intuitively comprehensible characterization ofthis magnitude: If we have only two hypotheses and start with the sameprior probability P(H1) = 0.5 = P(H0) for both, we can use the Bayes factor tocalculate the posterior probability of the hypotheses: P(H1|D) = BF10 / (1 +BF10). Thus we obtain for BF10 = 3 just P(H1|D) = 0.75 (for a Bayes factor of10 we get the posterior probability 0.91, for Bf10 = 20 we get 0.95 and forBayes factor of 100 we get 0.99). We can thereby at least interpret a Bayesfactor of 10 as indicating an absolute confirmation if we work with a thresh-old conception of theory choice and regard a probability above 90% as suf-ficient to accept a theory. In order to facilitate comparisons between classi-cal (three-valued) epistemology and Bayesianism, I will take such a thresh-old conception of theory acceptance for granted in the rest of the article. Ifthe posterior probability of a theory T arrives at 90% or above, I will thus


regard T as absolutely confirmed by our background knowledge and willaccept it. If we start with two hypotheses and are at first neutral with re-spect to them, we should, therefore, assess confirmations with Bayes factorsbelow 10 only as weak incremental confirmations.

If we apply these ideas to the experiments of Bem (2011), we obtain forthe research hypothesis H1 only three values below 3 and one of 5.9 whilethe other experiments speak weakly for the null hypothesis, and we evenget values up to 7.6 in favor of the null hypothesis (see Wagenmakers et al.2011). According to this analysis we have in no experiment convincing datain support of Bem’s precognition hypothesis. The question thus is if the in-terpretation of the data as absolutely confirmatory for his hypothesisshould mainly be attributed to the Fisherian methodology of null hypothe-sis testing and not to the power of the data. If this is the case, it would per-haps also explain (at least in part) the great popularity of significance test-ing among the practitioners of the empirical sciences.2

In their reply Bem et al. (2011) complain that the Bayes factor in Wagen-makers et al. (2011) is not really objective since it depends on a prior den-sity f(|H1), which according to Bem is ill chosen since it is too uninforma-tive and does not use our prior knowledge about effect sizes in psychology.Hence, it assigns too much weight to implausibly high values of effect sizeand thereby favors the null hypothesis. If we instead take the typical effectsizes and other prior information into account, we get higher Bayes factorsthat in most studies indicate substantial confirmation for the precognitionhypothesis. If we, furthermore, combine the evidence across the nine stud-ies by multiplying the nine Bayes factors we even get extreme evidence infavor of the precognition hypothesis. Of course, Wagenmakers et al. (2011a)do not agree with this diagnosis and present several objections to this pro-cedure, but they are far from being as convincing as their original pointsagainst Bem. For my evaluation of significance testing the lesson is to com-

2 In a similar way Wetzels et al. (2011) argue for the same conclusion. They have an-alyzed 855 examples of t-tests (with p-values between 1% and 5%), which were pub-lished recently in psychology, using the corresponding Bayes factors, and came tothe conclusion that according to a Bayesian analysis we can speak in 70% of thesecases only of “anecdotal evidence” with a Bayes factor below three. They claim thatsignificance testing would systematically overestimate the strength of the confirma-tion of the target hypotheses by the data.


pare only point hypotheses with other point hypotheses in order to avoidthese problems of choosing a subjective prior density.

In the context of those cases in which a second point hypothesis existsthat specifies in the same way as the null hypothesis a particular model andthus determines the likelihoods for certain data, particularly likelihoodistsas Royall (1997) argue for a strictly comparative evaluation of the hypothe-sis, and suggest the likelihood ratio (which corresponds to the Bayes factor)as a measure for the relative confirmation of the two hypotheses. However,likelihoodists do not accept epistemic probabilities in the form of priorprobabilities or densities. Only in the case of objective likelihoods do theyrely on the law of likelihood, which seems to be intuitively very plausible atthe first sight. It says: If P(E|H2) > P(E|H1) then the evidence E favors (pri-ma facie) hypothesis H2 to hypothesis H1.3

Unfortunately, Fitelson (2011) has argued against the law of likelihoodwith certain counter-examples: Consider a well-mixed typical deck of cardsand draw randomly a card K for which we have two competing hypothe-ses: (H1) “K is a black card” and (H2 ) “K is an ace of spades”. Our informa-tion is: (E) “K is a spade”. Now, indeed it is P(E|H2) = 1 > P(E|H1) = 0.5.Thus, according to the law of likelihood E supports H2 more than H1, but onthe other hand, H1 follows deductively from our information E, and logicalderivability seems to be stronger than the purely probabilistic support ofthe likelihood comparison. Of course, we have here a special situation,which is due to the logical relations that obtain: H2 E H1.

This is obviously not a common situation for hypotheses testing and,thus, we should not immediately reject the law of likelihood altogether. Butit remains an open question whether it should be mandatory. Consider afurther simple example: Suppose for two diseases X and Y we find that inboth cases for 20% of the ill persons their maximum temperature is 39° yetfor Y additionally in 10 % of the cases temperatures may rise up to 40°. (E)John has now one of the diseases and has a maximum temperature of 39°.Then we get: P(E|X) = 0.2 = P(E|Y). Are we thereby committed to the opin-ion that both hypotheses are supported alike by E? One can argue in this di-rection but one need not. We can also argue that such a high fever is a pret-ty extreme value for disease X but is not quite as rare for disease Y and,

3 If, e.g., H1 and H2 both assert a certain disease X1 or X2 respectively to be presentand for X1 certain symptoms E can be expected, while this is not the case for X2,then the occurrence of symptoms E speaks in favor of H1 relative to H2.


hence, speaks a little more for Y. Our intuitions are not entirely clear insuch cases. The law of likelihood therefore remains plausible, but it is notso well established that it could decide the debate in favor of Bayesianism.

As a consequence, in comparing significance testing with Bayesianism Iwill determine the posterior probability of the target hypothesis, which wecan calculate in simple examples and which gives us presumably a betterintuitive assessment of the strength of the confirmation than the Bayes fac-tor. To facilitate such a comparison I'm again assuming that a probability ofabout 90% corresponds to the acceptance of a hypothesis. Of course, wecould also choose other thresholds, but for an intuitive evaluation of theclassical approach this setting turns out to be helpful.

A second objection of Wagenmakers (2007) against significance testinggoes in a similar direction as the first one. The calculation of p-values de-pends not only on the data but also on our sample plans or stop rules. Hence,the data D do not determine the confirmation of the hypotheses, but it is es-sential for the test which intentions the experimenter had when he plannedthe test. It may be that two scientists have exactly the same data D, but theysupport a hypothesis H with different strength for the two scientists just be-cause they had different sample plans in mind. According to the Bayesiansthis demonstrates an unnecessary subjective influence on theory choice,which for Bayesians should be only based on the data and perhaps furtherobjective background knowledge but not on the goals of the scientists col-lecting the data. Bayesians, therefore, have developed some examples inwhich these influences actually appear to be entirely strange.

Consider such a simple example: We toss a coin 20 times and it landsheads 15 times. If we designed the experiment for exactly 20 throws, thenthis result is a significant deviation from the null hypothesis that the coin isfair (calculated using the standard binomial distribution). But we will getanother p-value if we use another sample plan as for example: toss the coinuntil you get 15 times heads (calculated by the negative binomial distribu-tion) or even: toss the coin until you get a significant result. For the last case,William Feller (1968) has already shown that we arrive at a significant re-sult with probability one. These cases show why it’s necessary for signifi-cance testing to take into account the intentions of the experimenter in plan-ning the experiment. This subjective influence on the test results is ofcourse irritating, but certainly not a knock-out criterion for the Fisherianapproach. The classical statistician can point out that Bayesians also have to


use subjective elements in their theory evaluations as the prior probabilitiesof hypotheses and perhaps some likelihoods (especially the so called catch-all likelihood).

The third type of criticism of significance testing from a Bayesian per-spective is that the p-value is often regarded as a measure of the strength ofthe confirmation. Several classical statisticians assumed (s. Wagenmakers2007, 787 ff.) that the smaller the p-value and the greater the number n oftested subjects the better is the support of the target hypothesis, since theevidence against the null hypothesis is measured by the p-value and largetrials give us the most reliable evidence against it. Wagenmakers mentionsfor example Fisher for whom the p-value gives “the strength of the evi-dence against the hypothesis” (Fisher, 1958, p 80). But this assumption(called the p-postulate by Wagenmakers) can not be maintained if we takeinto account certain Bayesian considerations presented firstly in 1957 byLindley, which are now known as Lindley’s paradox. It shows, at least, thata smaller p-value need not be accompanied by a higher posterior probabil-ity of the target hypothesis. More specifically, it can be shown that for everyconstant p-value and an increasing number n the posterior probability ofthe null hypothesis converges (slowly) to one for all plausible prior proba-bilities. Bayesians considered this phenomenon a fatal flaw of significancetesting and for the p-postulate, since in this case the same p-value leads todifferent confirmations of the null hypothesis, and we recognize even a de-creasing plausibility of the target hypothesis in contrast to the p-postulatefor greater values of n.

Berger & Delampady (1987) suggested a possible answer which the clas-sical statistician can give: The point-null hypothesis = a is only an idealiza-tion actually representing the more complex case where the value lies in asmall interval around a, and it would even be unusual for a Bayesian togive a strictly positive probability to such a point hypothesis. In the Baye-sian account we eventually compare two models which do not fit well to-gether, namely a point hypothesis and an interval hypothesis. This idealiza-tion produces only small deviations for the classical statistician but is moreproblematic for a Bayesian (cf. Berger & Delampady 1987, 321). In any case,for the more realistic representation of a null hypothesis as small interval,the convergence which the results in Lindsey’s paradox suggest cannot bereproduced in the Bayesian framework, but the probability of the null hy-pothesis converges in many cases to (see Berger & Delampady 1987, 322).


Nonetheless, important differences also remained in the article of Berger &Delampady between measures of confirmation based on p-values and thecorresponding Bayesian calculations that culminate for the authors in theconclusion:

The overwhelming conclusion is that P-values are typically at least an order ofmagnitude smaller than Bayes factors or posterior probabilities for H0. Thiswould indicate that say, claiming that a P-value of .05 is significant evidenceagainst a precise hypothesis is sheer folly; the actual Bayes factor may well benear 1, and the posterior probability of H0 near 1/2. (Berger & Delampady 1987,326)

Consequently, the Bayesians have, after all, raised several serious objectionsto significance testing and certainly offer a number of important alternativemethods to it. Furthermore, it seems to be too easy to confirm ESP-hypothe-ses by tests of significance. However, there is a certain incommensurabilitybetween the classical and the Bayesian approach. The Bayesian methodol-ogy works with epistemic probabilities, and normally specifies no thresh-old value for accepting a scientific hypothesis, whereas the classical statisti-cian, on the other hand, rejects epistemic probabilities altogether, works pri-marily with frequencies or likelihoods of the form P(E|H) and has no placefor degrees of belief in a hypothesis. We have, therefore, to look for a con-vincing and more direct epistemological evaluation of the Fisherian meth-od that classical statisticians can accept.

This new assessment method will be based on the relative frequencieswith which the Fisherian method will give us the right results, but thesefrequencies have subsequently to be interpreted as certain expectationssimilar to epistemic probabilities. This is the only way to derive rational ex-pectations from relative frequencies. Though this form of statistical syllo-gism is officially neither appreciated by Bayesians nor by classical statisti-cians (since it is the basis of an inductive logic) both approaches have to re-ly on it if they want to provide epistemologically relevant results (cf. Frank-lin 2001, Campbell & Franklin 2004, and Bartelborth 2012, ch. 5.2).

4 Epistemological Evaluation: The Filter AnalogyThe question remains how we can directly assess significantly confirmedhypotheses H in the classical frequentist framework. The likelihoodsP(E|H) of the data E offer no direct evidence for the probability of hypothe-sis H. But, we want to know whether the hypothesis is probably true if cer-


tain data occur, and to answer that question seems to require (prior) episte-mic probabilities. Of course, the error probability (or significance level) of5% does not mean that significantly confirmed hypothesis are only false in5% of the cases, although we sometimes come across misleading remarks instatistics textbooks hinting in this direction. Then, how can we evaluateepistemologically significance tests? We want to know whether a signifi-cantly confirmed target hypothesis H1 is true or at least supported so wellthat we should accept it, at least preliminarily.

The Bayesian determines the epistemic probability of H for this purpose,but the classical statistician accepts only objective likelihoods or relative fre-quencies. Thus, I will base my assessment of the plausibility of significantlyconfirmed hypotheses solely on relative frequencies. For this purpose wehave to use the following analogy (cf. Bartelborth 2012: ch. 6.4): If we havea true null hypothesis H0 (and thus a wrong target hypothesis H) and weaccomplish a significance test on H0 with significance level , then our teststatistic will still fall into the rejection region in ·100% of the cases, and wewould, therefore, falsely reject the null hypothesis. If we set = 5%, thatmeans that of 100 false target hypotheses which we send into the test onaverage 5 of them will be confirmed by it. The test works as a filter that fil-ters out 95 of 100 wrong target hypotheses which we have sent into the fil-ter.

Of course, we still do not know how many of the target hypotheseswhich pass the filter are true. In order to determine that magnitude we ad-ditionally have to know the -error, which gives the proportion of true tar-get hypotheses that are also stopped by the filter. Even if the target hypoth-esis is true (and the null hypothesis is therefore wrong), the null hypothesiswill not always be rejected. In ·100% of these cases we still obtain an insig-nificant result, and the true target hypothesis will not be confirmed as sig-nificant. These target hypotheses will remain stuck in the filter and there-fore remain in the neutral area, since only those research hypotheses will beaccepted that have passed the filter. Altogether, from the false target hy-potheses ·100% will pass the filter and from the true ones (1–)·100% willpass it, and will thus be accepted. Since the -error is increasing with de-creasing significance level (our filter lets pass on the whole fewer hypoth-eses) both error types depend on each other, but in general we do not knowthe -error and, at best, can estimate it. But the basic idea of significancetesting is, of course, that only few false and many true target hypotheses


will pass the filter, whereby a passing of the filter can be regarded as an in-dicator for the truth of these hypotheses.

However, to exactly determine the overall effect of the filter, we alsoneed the ratio of true hypotheses to false hypotheses that are sent into thefilter. Suppose that, of the hypotheses which we send into the significancefilter, 100·q% are true and 100·(1–q)% are false, then I will call q the priortruth-proportion. Now we can figure out how many of the significantly con-firmed hypotheses (that means the hypotheses that pass the filter) are trueon average. We have to determine (or estimate) the three parameters , ,and q to arrive at the posterior truth-proportion w:

Thus, a significance test is a method that lets a certain proportion of targethypotheses be confirmed as significant through the filter. In particular, themajority of the false hypotheses are filtered out while the true hypotheseswill pass mostly through the filter. In the ideal case, we obtain with this fil-tering a very high proportion of true target hypotheses as output, as is de-picted in the following graph:

Figure 1: Significance testing considered as a filter for hypotheses

The posterior truth-proportion offers the best method for a direct epistemicevaluation of significance testing because it tells us how many true researchhypotheses we find among the (significantly confirmed) hypotheses pass-


ing the filter and therefore characterizes the performance of the filter. If wewant to find out the plausibility we should assign to a specific hypothesis Hthat is confirmed by a test, we have additionally to rely on the above men-tioned statistical syllogism. It says that if x·100% of some kind of objectshave the characteristic F and we have an object a of this kind and have nofurther knowledge whether it has F, then we should give Fa the epistemicprobability x or the degree of plausibility x (P(Fa) = x). There are good rea-sons to use this form of empirical calibration for our degrees of plausibility,but I will not discuss them here and rather just accept this intuitive princi-ple at this point. It seems to be the only plausible method of how we can ra-tionally learn from purely frequentist data and can use them as a guide toour life.

If we use this type of reasoning in our example (and how can we other-wise use statistical information), then we should take the posterior truth-proportion w as our best estimate for evaluating the epistemic supportwhich our data provide for the hypothesis H. If we have further knowledgeabout the truth of our research hypotheses, that can of course be used to es-timate the prior truth-proportion. In any case, a first question is how theprior truth-proportion influences the posterior truth-proportion. We can ac-knowledge here similarities to the Bayesian updating procedure but thevalues are different as we will see later.

First of all, suppose that we use a significance test with fixed values =0.05 and estimated = 0.2. In this case the dependence of the ratio of theposterior truth-proportion on the prior truth-proportion is obvious. How-ever, we can improve the test procedure if we choose a smaller significancelevel of = 0.01 or even = 0.001 and thereby increase the posterior truth-proportion (if the -error is not too large). I assume that is adjusting inthese cases only to 0.5 and 0.8 respectively, but we will see later that caneven rise to critical values in concrete cases. That depends mainly on wherethe actual parameter that we want to determine is located.

prior truth-proportion q 0 0.001 0.01 0.1 0.3 0.5 0.6

= 0.05 ( = 0.2) 0 1.6 14 64 87 94 96

= 0.01 = 0.5) 0 5 34 85 95 97 98

= 0.001 = 0.8) 0 17 67 96 99 99.5 99.7

Table 1: Posterior truth-proportions w in percent


In the second column we can see that we only obtain false hypotheses pass-ing through the filter if all hypotheses we think of and send into it are false.This is a first explicit hint that significantly confirmed hypothesis need nei-ther be true nor even plausible. Furthermore, if a scientist is very produc-tive, testing many hypotheses, of which only every tenth is true, and heworks with a significance level of 5%, then its significant results will be trueon average only in 64% of the cases. For most research questions that doesnot seem to be a sufficient certainty to rely on—for example in the case oftherapies. Nevertheless, if the applied p-value is much smaller than 5% thatmeans that the posterior truth-rate is clearly rising if the -error will not be-come too large. The p-value is thus a first helpful indicator of the strengthof the confirmation of a target hypothesis by significant data. However,there are also other parameters to consider and we will later see that the-errors can become very large (even approaching 1) and then can certainlyalso affect the posterior truth-proportion substantially. Thus, in particular,we cannot identify the posterior truth-rate only by means of the p-value.We also have to take the -error and the prior truth-rate into consideration.

Moreover, at the least, it seems to be clear that the strong dependence ofthe posterior truth-proportion w on our prior truth-proportion q has theconsequence that only plausible hypotheses should be sent into the filter if wewant to have a good reason to believe that significantly confirmed hypothe-ses are true. The classical statistician must therefore—quite similar to theBayesian—at first evaluate the initial plausibility of his research hypothesesbefore testing them with a significance test, since, otherwise, many false tar-get hypotheses will pass the filter and the posterior truth-proportion be-comes too low. This is obviously a serious problem for the testing of ESP-hypotheses such as Bem’s precognition hypothesis. We have to at least esti-mate the relevant parameters in these cases in order to assess the epistemicstatus after the successful tests with the help of the posterior truth-propor-tion in such cases.

How can we accomplish such an informal plausibility assessment? At aminimum, I would like to mention some aspects which we have to take intoconsideration: We have, first of all, to examine whether the target hypothe-ses can indeed explain our data and whether they can do it better than anyalternative hypotheses. To this purpose we have to search explicitly for alter-native explanations of the data and have to determine how many plausiblealternative explanations for the data are to be found. Furthermore, we have


to assess how well our research hypothesis fits with our other accepted sci-entific theories. The Bayesian attempts to provide an exact number for thisinitial plausibility of our hypotheses as their prior probability (which is cer-tainly a difficult enterprise), but at least we have to give an informal assess-ment of the initial plausibility, which tells us that they are plausible candi-dates for the best explanations of our data.

From now on I will always assume that the target hypotheses and thenull hypotheses have (at the start) the same initial plausibility and thus thatour prior truth-proportion for the tests is approximately q = 0.5. But even inthis favorable situation we will encounter further epistemic problems forthe practice of significance testing.

5 Problems of Significance Testing in Concrete ExamplesWe will now look at some problems of classical significance testing in asimple example. It involves hypotheses about a proportion. Our null hy-pothesis always says that a particular coin is fair (i.e. P(head) = 0.5). The da-ta we consider at first are the number of heads in 60 tosses of the coin. Thetest statistic T (as the number of heads in the sample) has a binomial distri-bution and can therefore be calculated easily. This situation corresponds,e.g., to a hypothesis that postulates a proportion of 50% women in the totalpopulation of all students in England who enroll in a particular year of phi-losophy. In this case we can choose a random sample of 60 students forwhom we determine, as a test statistic, how many women are to be foundin the sample. If we assume that the population of all students is far greaterthan 60 we can again assume a binomial distribution for the test statistic.We will consider for simplicity only the one sided target hypothesis H1 thatthe coin is biased in favor of heads (or that we find more girls in the popu-lation than boys). In the example of Bem, the null hypothesis correspondsto a random hit rate of just 0.5 and the target hypotheses are designed asone sided and postulate that the true hit rate is greater than 0.5.

The first problem is that the target hypothesis in our case consists of alarge “disjunction” of the different values > 0.5. Therefore, it provides noobjective likelihoods and accordingly no specific -error. To obtain determi-nate values we could adopt the Bayesian strategy and assume a prior den-sity for > 0.5 (indicating with which probability these values q will occur)and then remove the unwanted parameter by integrating it out (see above),


but that cannot be accepted by the classical statistician because it necessa-rily depends on using epistemic probabilities. Moreover, we would therebycompare a point hypothesis with a weaker interval hypothesis, which canlead to miscalculations due to the additional free parameter. Actually, weknow that some concrete value = r is the true value specifying the proba-bility with which our coin comes up heads. The true hypothesis is hence apoint hypothesis with some value r which we do not know. We can nowcompare our null hypothesis with different hypotheses Hr (saying q = r) forseveral values of r and examine in each case how good different data(which will significantly confirm Hr) will really support the research hy-pothesis Hr.

I will, therefore, consider different hypothetical values of , and then ex-amine in each case whether the significance test interprets the data comingup intuitively correct (as confirming or not) and whether significant resultswill actually support our target hypothesis so strongly that we should re-gard it as absolutely confirmed and accept it. For this case to obtain I willrequire a posterior truth-proportion of at least 90%. The general (interval)research hypothesis > 0.5, on the other hand, provides us with a certainkind of averaging over all such hypothesis, which we will consider here inisolation. Furthermore, we can compare the results with the correspondingBayesian results for specific (point) hypotheses, and to facilitate this com-parison I will always assume that the prior truth-proportion (in the classicalcase) and the prior probability (in the Bayesian case) is in each example 1/2.

In order to uncover some of the problems of significance testing we willconsider a different specific target hypothesis according to which H1 statesthat it is (approximately) P(head) = 0.7 or has another concrete value downto 0.51. In each of these scenarios we will assume that the null or the targethypothesis is true and can therefore also give a Bayesian analysis as an ad-ditional benchmark for our truth-proportion assessment. We can see, e.g.,that if the null hypothesis and target hypothesis make very similar predic-tions (are close-by hypotheses) then the posterior truth-proportion of thetarget hypothesis shows that our hypothesis choice becomes a form ofguessing. This is a problem well known to the Bayesian, but it does notseem to be similarly obvious for classical significance testing.

In our example, we consider the -errors 0.05 and 0.01 and 0.001 respec-tively, calculate the corresponding -errors for different target hypotheses,and examine how they affect the posterior truth-proportions. For 60 tosses


of a coin the acceptance set for the null hypotheses and = 0,05 goes up to36 times heads (k = 36), and for = 0,01 up to 39 times heads and at least for = 0.001 up to 42 times heads. In addition, I assume an optimistic truth-proportion of 0.5, and will besides determine the Bayesian posterior proba-bility updated with results that fall just in the significant range, whereby Iassume a prior probability of 0.5. Note that only significant results will beconsidered in the table.

Furthermore, I have marked hypotheses with posterior truth-propor-tions and probabilities of more than 90% as absolutely confirmed to allevi-ate the comparison between the two methodologies. The values for w in thetable give us the posterior truth-proportion that we can (on average) expectfor certain significance levels and a certain definite target hypothesis H1

for 60 trials. A target hypothesis H1 that has passed a significance test suc-cessfully has to be regarded as plausible to the degree w, since this is theproportion of true hypotheses in its new reference class. Bayesian calcula-tion provides us with a different evaluation, but we have to keep in mindthat I always update with the least value that is significant.

H1: 100·= 70 65 60 59 58 57 55 52 51

(for =0.05) 6 25 55 61 67 72 82 92 94

w (for =0.05) 95 94 90 89 87 85 78 63 56

P(H1|k=37) 67 82 83 83 81 80 75 63 57

(=0.01) 24 55 82 86 89 92 96 98 99

w (=0.01) 98.7 98 95 93 92 89 82 60 51

P(H1|k=40) 96 97 95 93 92 90 85 68 60

(=0.001) 55 83 96 97 98 98.6 99.4 99.8 99.9

w (=0.001) 99.8 99.4 98 97 95 93 86 60 48

P(H1|k=43) 99.7 99.5 98 97.7 97 95.6 91 73 62

Table 2: Posterior truth-proportions w (and -errors) for = 0,05 and = 0,01and = 0,001 with n = 60 and different target hypotheses H1 (values in percent)

We can see in table 2, at first, that for = 0.05 and a greater distance be-tween the null and the target hypotheses significance testing generatesgreater truth-proportions than the probabilities provided by Bayesian up-dating. This seems to be an advantage of significance testing over the Baye-


sian approach for this special situation of large or middle sized effects andonly few trials. But that impression is misleading. In fact, the significancefilter works well on average, because in the given situation (in which we as-sume that H0 or H1 is true) k = 37 only seldom occurs. As a result, in 95% ofthe cases the hypotheses that have passed the filter are true. The filter isdefinitely reliable on average, but it does not tell us the degree of confirma-tion given by the special datum k = 37. That is provided by the Bayesiananalysis, and we see (which seems to be obvious in this case) that the con-firmation of the target hypothesis by this datum is almost negligible. In thatcase, why do we not opt for the Bayesian analysis as the correct way to de-termine the confirmation? The classical statistician will answer that we nor-mally are not in the comfortable position to know which point target hy-pothesis is the only candidate next to the null hypothesis. We just have thenull hypothesis and can determine how probable certain data are if we as-sume it. Therefore, we have to be content with the averaging test procedureof significance testing. In table 2 we only examine how this test procedureworks by looking at different possible point hypotheses, but we are rarelyin the position to specify one of them as privileged.

Moreover, we can see in table 2 that the p-postulate has some truth in itin such situations, which can be established further in the following com-parison with Bayesian posterior probabilities.

k 31 33 35 36 37 38 39 40 41 42 43 45p-value 45 26 12 8 5 2.6 1.4 0.7 0.3 0.13 0.05 0.007P(H1|k) 1 6 27 46 67 82 92 96 98 99 99.7 99.94

Table 3: p-values (in percent) for n = 60 and k times heads(and P(H1|k) for the target hypothesis = 0.7)

Nevertheless, from the Bayesian point of view only for smaller -errors dowe get a considerable conformity with the proper judgment. For = 0.01and for = 0.001 we find in table 2 great similarities in the assessments be-tween the two approaches.

Another important effect is that with decreasing distance between thetwo hypotheses the -error increases dramatically (up to 1–) and, thus, theposterior truth-proportion drops considerably. We already know this phe-nomenon from Bayesian calculations, but in the process of classical signifi-cance testing the phenomenon does not appear, because the method re-quires only a falsification of the null hypothesis and does not actually pro-ceed comparatively. In cases of small effects the Fisherian methodology in-


dicates a significant confirmation although the posterior truth-proportionshows the confirmation to be rather weak. In the more extreme cases wecan hardly speak of a notable confirmation at all, e.g. when the truth-pro-portion falls clearly below 80%. Significance testing obviously falls short ofits goals for these cases, which means that these data have no real impactfor the question which of the two hypotheses is probably true. The filterlets pass as many false as true hypotheses and thereby looses its originalfunction, which was based on the assumption that the filter predominantlywithholds the false hypotheses but lets the true ones mostly pass. So espe-cially in the problematic cases of only small effects we must use other meth-ods or, at least, work with a much smaller -error.

Intuitively, we can describe this phenomenon in the following way: Ifwe find only a small distance between the null and the target hypothesisour data are often not sufficient to decide between the hypotheses. Both hy-potheses are likewise compatible with the data since the hypotheses tell ussimilar things about the world, i.e., the likelihoods for the data diverge onlyslightly. In these cases, we obviously need more data to decide between thehypotheses than those data which we need in order to get a significant con-firmation in the Fisherian methodology. Hence, the data are often clearlyover-interpreted by the significance test. This problem is overlooked by theclassical statistician because he does not use comparative methods andtherefore the distance between the hypotheses plays no role in his proce-dure. The only aspect that is taken into consideration is if the data are prob-abilistically compatible with the null hypothesis, and it is not asked howcompatible they are with the target hypothesis. In particular, no compari-son takes place in form of the Bayes factor (or likelihood ratios) as in otherapproaches. I will call the resulting problem the over-interpretation of data forsmall effects.

This problem presumably relates to the experiments of Bem (2011), butcertainly also to many results in other disciplines. This corresponds to theassessment of Wagenmakers et al. (2011), which he has, however, achievedin a different way (by the calculation of Bayes factors). Accordingly, theclassical statistician comes too easily—especially for small effects—to sig-nificant results and these results provide only a small and incremental con-firmation of the target hypothesis, but not the required absolute confirma-tion. In order to determine the validity of a significance test we have thus to


estimate the effects and then the truth-proportions before and after the ex-periment.

Our next question is how the number of trials or objects of investigationinfluences the results. Here we can find another phenomenon that shouldworry us:

H1: P(Kopf) = = 0.7 0.65 0.6 0.59 0.57 0.55 0.52 0.51

=0.05, n=20, maintaining the null hypothesis up to and including k=14

w 94 92 89 88 86 83 78 75

P(H1|k=15) 92 89.6 83 82 77 71 60 55

=0.05, n=200, maintaining the null hypothesis up to and including 112

w 95 95 94.5 94 92 88 70 58

P(H1|k=113) 0.2 20 77 81 84 83 71 62


w 99 99 98.6 98 97 95 80 67

P(H1|k=116) 5 75 94 95 94 92 77 65


w 99.9 99.9 99.7 99.6 99 97 81 64

P(H1|k=123) 87 99 99.5 99.4 99 97 84 71


w 95 95 95 95 95 95 87 75

P(H1|k=527) 0 0 0 0.1 9 60 80 70


w 99 99 99 99 99 99 93 80

P(H1|k=538) 0 0 1 7 69 93 90 80

Table 4: Posterior truth-proportions for different values of n, and (all data in percent)

First, table 4 shows that an increase in the numbers prima facie also leads toan improvement of the posterior truth-proportions similar as a reduction ofthe -error. However, the problem that we know from Table 2 is not en-tirely avoided by larger numbers and a smaller -error. In addition, the in-


crease in numbers still shows a new problem for the situations of large ef-fects. If the presumed probability according to our target hypothesis forheads is about 0.7 then the significant datum of 113 times heads at 200tosses or 527 times heads at 1000 tosses speaks intuitively in favor of thenull hypothesis, since if the target hypothesis is true we would on averageexpect 140 times heads in the first case or in the second case 700 timesheads. Therefore, the data are closer to the results we normally expect if thenull hypothesis is true than if the target hypothesis is true. Nevertheless, insignificance tests the null hypothesis is regarded as falsified by these data.

This problem is also reproduced in the corresponding Bayesian proba-bilities and correlates to the paradox of Lindsey. But these data speak accord-ing to significance testing in against the null hypothesis and therefore aremisinterpreted by the Fisherian methodology. This is once more the case be-cause significance testing is not really comparative. It only asks if the dataspeak against the null hypothesis, but not whether they actually are in fa-vor of a specific target hypothesis. We can call it the problem of misinterpreta-tion for large effects and with large numbers n. More trustworthy are signifi-cance tests for smaller numbers of trials. We can see this problem in the fol-lowing diagram:

Figure 2: Distances of data and the two hypotheses

In a significance test we obtain a significant test result d if d is greater thanthe acceptance threshold 0.5+x (with some x > 0) for the null hypothesis.But by increasing n, x will become smaller and smaller. If the true hypothe-sis H1 is more on the right side, then we can see that (by increasing n) an in-creasing number of data d are nearer to H0 than to H1 but are neverthelesssignificant. These data are misinterpreted by significance testing. They in-tuitively speak in favor of H0, which is represented in the diagram by thegreater distance to H1 than to H0.

In consequence, it seems that the significance testing methodology iscaught in a dilemma: For small effects it is subject to the danger of an overinterpretation of the data and for large effects we find even a wrong inter-


pretation of the data—and this is the case for the favorable situation thatevery second hypothesis tested is a true one. This is a powerful argumentfor, at first, estimating the initial plausibility of the hypothesis and, at sec-ond, estimating the size of the effect and on this basis estimating the poste-rior truth proportion. At a minimum, the examples have proven the poste-rior truth proportion to be a better indicator of the strength of the confirma-tion than the p-value. In addition, the problems strongly suggest using oth-er evaluation methods such as Bayesian updating at least parallel to the sig-nificance tests.

The problematic is supported by an example that Beck-Bornhold andHans-Herrmann Dubben (1996) have published in Nature, but which wasnot always interpreted correctly. A newer version of it is to be found in(Beck-Bornholt & Dubben 1998). We choose in this example as a null hy-pothesis H0 that Jim has played in a lottery and as the target hypothesis thatthis is not the case. We now get the evidence E that Jim was paid the firstprize of the lottery. Intuitively, we can conclude from E that Jim has (prob-ably) played in the lottery. But, we further assume that the chance to winthe first prize in the lottery is only one out of a million. Therefore the p-val-ue for our evidence E is: P(E|H0) = 0.000001, and from the logic of signifi-cance testing (corresponding to a probabilistic Modus Tollens, compareGreco 2011) we should reject the null hypothesis. The problem once againresults from the fact that significance testing is not comparative. AlthoughE has a small probability on assuming the null hypothesis it has an evenmuch smaller probability if we think the target hypothesis is true. Such acapital error of the lottery company (to pay the first prize to someone whohas not even participated in the lottery) has certainly taken place less oftenthan once in a million. Therefore, the resultant likelihood ratio (or the resul-tant Bayes factor) is a far better indicator that speaks clearly in favor of ournull hypothesis. The Fisherian methodology is obviously not designed forsuch cases, but it is noticeable that significance testing is the only approachfrom the three approaches we consider that is not appropriate in drawingthe right conclusions from the evidence in this situation.

The point is that in this example we accept, at first, an information of theform H0H1 with two distinct hypotheses H0 and H1, both of which deter-mine the likelihoods for many data. Then it seems obvious that we shouldbase a comparison of the two hypotheses on the likelihoods P(E|H0) andP(E|H1). Prima facie the hypothesis explains the data E better for which E is


more probable (cf. Strevens 2000), and it therefore fits the data E better thanthe rival hypothesis. The problem with the Fisherian methodology is that it,firstly, cannot use the special information at hand (H0H1) and that it, sec-ondly, cannot really compare how the data fit the first and the second hy-potheses. In many situations in science we agree on a certain finite list ofhypotheses and assume one of them to be the correct answer to a scientificproblem, but the significance testing methodology cannot use this back-ground knowledge and cannot concentrate on a comparison between thehypotheses, which seems the natural strategy for a further evaluation of thehypotheses.

Significance tests seem to be more apt for cases in which one concretepoint hypothesis (the null hypothesis) determines the likelihoods for manydata but the rival hypothesis (our target hypothesis) is, e.g. of an disjunc-tive type (or an interval hypothesis) and provides no particular likelihoodsfor the data. Then only two options remain: First, we can try to falsify thenull hypothesis or, second, we can work with (meta-) probabilities for thedisjuncts of the target hypothesis (or a prior-density) in order to calculatelikelihoods for the target hypothesis. These additional probabilities for theparts of the target hypothesis are normally epistemic probabilities as in theBayesian approach, which the classical statistician wants to avoid. He thushas to bite the bullet in these cases and live with the problems of signifi-cance testing that we have acknowledged above and can additionally usethe estimation of the posterior truth proportion.

For the cases of clear misinterpretation of the data he can argue thatthese critical data will occur only with a very low frequency. Only in theserare cases will the test procedures mislead us. Nevertheless, significancetesting cannot reconstruct these situations correctly in which the data speakin favor of the null hypothesis and if in such cases the data are improbablefrom the viewpoint of the null hypothesis it will misinterpret them. Wehave to see this deficit of the Fisherian methodology clearly to avoid sys-tematic errors in our inductions from certain data.

Greco (2011), who defends the Fisherian methodology, even argues thatthe cases with large effects occur only seldom. He believes that to be thecase at least for hypotheses about continuous magnitudes which are ratherevenly distributed over a certain region. For hypotheses about proportions,on the other hand, he suggests a maximum likelihood comparison, butdoes not give the details of how the comparison should proceed exactly


and in which cases we should accept certain hypotheses. Furthermore, forsmall effects the problem remains that the data are often over-interpreted.

Nevertheless, the problem of misinterpretation for larger effects seemsto be tolerable if the posterior truth-proportion remains high for signifi-cantly confirmed hypotheses in these cases. If H0 is true it will only seldombe rejected. However, a more accurate analysis of the data is provided bythe Bayesian methodology especially for the cases in which different deter-minate (point) hypotheses concur as in the lottery example. But, to be sure,the debate will continue.

6 ConclusionThe filter analogy facilitated a direct frequentist assessment of the supportsignificant data give to some target hypotheses. It reveals some weaknessesof significance testing and indicates how we can improve it. Firstly, signifi-cance testing is just as dependent on an assessment of the prior plausibilityof hypotheses as the Bayesian methodology. If the prior plausibility is lowas in the example of the precognition hypothesis of Bem, the significantlyconfirming data result only in a weak (or even negligible) support for thehypothesis. In any case, we achieve only a weak incremental confirmationand not the absolute one required in classical epistemology for acceptingthe hypothesis. But if we even only send sufficiently plausible hypothesesin the filter, the Fisherian methodology shows still further problems. Werun into a dilemma: If we have only small effects significant data only re-sulting in low posterior truth-proportions, i.e. the data being often over-in-terpreted, and if we deal with large effects, the posterior truth-proportionsmay be sufficient but nevertheless we find an obvious misinterpretation ofthe data.

In the first situation the filter looses its proper function because the -er-ror has become so great that the number of false target hypotheses that willpass the filter is critical relative to the -error. That seems to be a problemfor the precognition hypothesis even if we would attest it a sufficient priorplausibility. This result corresponds to the objections of Wagenmaker et al.(2011) who obtains only low values of the Bayes factor for Bem’s hypothe-sis.

In the second situation, in which the effects are greater, we have to dealwith the problem of misinterpretation. For = 0.05 and n = 200 a result of


113 times heads is already significant and we can reject the null hypothesis.But if the real probability for heads is 0.7 this datum obviously speaks morein favor of the null than of the target hypothesis P(head) = 0.7. The signifi-cance test cannot deal with this problem satisfactorily because it is notreally comparative. Even smaller -errors as = 0.01 (or two-sided tests)cannot completely solve the problem. The need for a comparative analysisis also reflected in the lottery example.

Consequently, to assess the confirmation by a significant result, we mustfirstly discuss the prior plausibility of the target hypothesis and secondlymust determine the posterior truth-posterior based on estimates of the ef-fect size. Furthermore, it seems helpful to make some Bayesian calculationsalongside and use, e.g., the Bayes factor as an additional indicator of thesupport given to the target hypothesis by the data. Altogether, we shouldnot only consider whether the data are significant in the sense of Fisher forthe hypothesis, but we have to provide a more comprehensive assessmentof the confirmation the data give to our target hypothesis, which also in-cludes an analysis of the intuitive explanatory power the hypotheses havewith respect to the data, and we have to consider how good the explana-tions are that alternative hypotheses can provide.

BibliographyAlcock, J. (2011): Back from the Future: Parapsychology and the Bem Affair,

Skeptical Inquirer: URL: http://www.csicop.org/specialarticles/show/back_from_the_future.

Bartelborth, T., Die erkenntnistheoretischen Grundlagen induktiven Schließens,E-Book 2012, URN=http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-84565.

Beck-Bornholdt, H. P. & Dubben, H. H. (1996): Is the pope an alien?, Nature,381, S. 730.

Beck-Bornholdt, H. P. & Dubben, H. H. (1998): Der Hund, der Eier legt. Erken-nen von Fehlinformationen durch Querdenken, Hamburg: Rowohlt.

Bem, D. J. (2011). Feeling the future: Experimental evidence for anoma-lousretroactive influences on cognition and affect, Journal of Personality andSocial Psychology, 100,1–19.


Bem, D. J. & Utts, J., & Johnson, W. O. (2011): Reply: Must PsychologistsChange the Way They Analyze Their Data? Journal of Personality and So-cial Psychology, 101, 716 –719.

Berger, J. O., & Delampady, M. (1987). Testing precise hypotheses. StatisticalScience 2, 317-352.

Campbell, S. & Franklin, J. (2004): Randomness And the Justification of In-duction, Synthese, 138, 79–99.

Feller, W. (1968): An Introduction to Probability Theory and its Applications,Volume I, 3rd edition, Wiley.

Fisher, R. A. (1935a): The design of experiments, Edinburgh: Oliver & Boyd.Fisher, R. A. (1935b). The logic of inductive inference, Journal of the Royal

Statistical Society, 98, 39-82.Fisher, R. A. (1958). Statistical methods for research workers, (13th ed.)., New

York: Hafner.Fitelson, B. (2011): Favoring, Likelihoodism, and Bayesianism, Philosophy

and Phenomenological Research, 83, Issue 3, 666–672.Franklin, J., (2001): Resurrecting logical probability, Erkenntnis, 55, 277-305.Galak, Jeff & LeBoeuf, Robyn A. & Nelson, Leif D. & Simmons, Joseph P.

(2012): Correcting the past: Failures to replicate psi, Journal of Personalityand Social Psychology, 103, 933-948.

Greco, D. (2011): Significance Testing in Theory and Practice, Brit. J. Phil.Sci., 62, 607-637.

Rouder, J. N. & Speckman, P. L. & Sun, D., Morey, R. D. & Iverson, G.(2009): Bayesian t tests for accepting and rejecting the null hypothesis.Psychonomic Bulletin & Review, 16 (2), 225-237.

Royall, R. M. (1997): Statistical Evidence: A Likelihood Paradigm, New York:Chapman & Hall/CRC.

Strevens, M. (2000): Do Large Probabilities Explain Better? Philosophy of Sci-ence, 67, 366–90.

Wagenmakers, E.-J. (2007): A practical solution to the pervasive problemsof p values, Psychonomic Bulletin & Review, 14, 779-804.

Wagenmakers, EJ & Wetzels, R. & Borsboom, D. & van der Maas, H. L. J.(2011). Why psychologists must change the way they analyze their data:the case of psi: comment on Bem (2011). Journal of personality and socialpsychology, 100, 426–32.

Wagenmakers, E.-J. & Wetzels, R. & Borsboom, D. & Kievit, R. & van derMaas, H. L. J. (2011a): Yes, psychologists must change the way they ana-

http://www.ejwagenmakers.com/2007/pValueProblems.pdf

http://www.ejwagenmakers.com/2007/pValueProblems.pdf


lyze their data: Clarifications for Bem, Utts, & Johnson (2011) manu-script: http://dl.dropbox.com/u/1018886/Clarifications ForBemUtts John-son.pdf

Wetzels, R. & Matzke, D. & Lee, M. D. & Rouder, J. N. & Iverson, G. J., &Wagenmakers, E.-J. (2011): Statistical evidence in experimental psychol-ogy: An empirical comparison using 855 t tests, Perspectives on Psycholo-gical Science, 6, 291-298.

http://www.ejwagenmakers.com/2011/WetzelsEtAl2011_855.pdf

http://www.ejwagenmakers.com/2011/WetzelsEtAl2011_855.pdf

Documents

How Strong is the Confirmation of a Hypothesis by Significant Data?philsci-archive.pitt.edu/9994/1/Bartelborth-significance... · 2013. 9. 11. · Confirmation by Significance Tests