75
The ASA (2016) Statement on P- values: How to Stop Refighting the Statistics Wars The CLA Quantitative Methods Collaboration Committee & Minnesota Center for Philosophy of Science April 8, 2016 Deborah G Mayo

Mayo minnesota 28 march 2 (1)

Embed Size (px)

Citation preview

Page 1: Mayo minnesota 28 march 2 (1)

The ASA (2016) Statement on P-values:How to Stop Refighting

the Statistics WarsThe CLA Quantitative Methods

Collaboration Committee & Minnesota Center for Philosophy of Science

April 8, 2016

Deborah G Mayo

Page 2: Mayo minnesota 28 march 2 (1)

2

Brad Efron“By and large, Statistics is a prosperous and happy country, but it is not a completely peaceful one. Two contending philosophical parties, the Bayesians and the frequentists, have been vying for supremacy over the past two-and-a-half centuries. …Unlike most philosophical arguments, this one has important practical consequences. The two philosophies represent competing visions of how science progresses….” (2013, p. 130)

Page 3: Mayo minnesota 28 march 2 (1)

3

Today’s Practice: Eclectic

O Use of eclectic tools, little handwringing of foundations

O Bayes-frequentist unificationsO Scratch a bit below the surface

foundational problems emerge….O Not just 2: family feuds within (Fisherian,

Neyman-Pearson; tribes of Bayesians. likelihoodists)

Page 4: Mayo minnesota 28 march 2 (1)

4

Why are the statistics wars more serious today?

O Replication crises led to programs to restore credibility: fraud busting, reproducibility studies

O Taskforces, journalistic reforms, and debunking treatises

O Proposed methodological reforms––many welcome (preregistration)–some quite radical

Page 5: Mayo minnesota 28 march 2 (1)

5

I was a philosophical observer at the ASA P-value “pow wow”

Page 6: Mayo minnesota 28 march 2 (1)

6

“Don’t throw out the error control baby with the bad statistics bathwater” The American Statistician

Page 7: Mayo minnesota 28 march 2 (1)

7

O “Statistical significance tests are a small part of a rich set of:

“techniques for systematically appraising and bounding the probabilities … of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033)

O These I call error statistical methods (or sampling theory)”.

Page 8: Mayo minnesota 28 march 2 (1)

8

One Rock in a Shifting Scene

O “Birnbaum calls it the ‘one rock in a shifting scene’ in statistical practice

O “Misinterpretations and abuses of tests, warned against by the very founders of the tools, shouldn’t be the basis for supplanting them with methods unable or less able to assess, control, and alert us to erroneous interpretations of data”

Page 9: Mayo minnesota 28 march 2 (1)

9

Error Statistics

O Statistics: Collection, modeling, drawing inferences from data to claims about aspects of processes

O The inference may be in error

O It’s qualified by a claim about the method’s capabilities to control and alert us to erroneous interpretations (error probabilities)

Page 10: Mayo minnesota 28 march 2 (1)

10

“p-value. …to test the conformity of the particular data under analysis with H0 in some respect:…we find a function t = t(y) of the data, to be called the test statistic, such that• the larger the value of t the more

inconsistent are the data with H0;• The random variable T = t(Y) has a

(numerically) known probability distribution when H0 is true.

…the p-value corresponding to any t asp = p(t) = P(T ≥ t; H0)”

(Mayo and Cox 2006, p. 81)

Page 11: Mayo minnesota 28 march 2 (1)

11

O “Clearly, if even larger differences than t occur fairly frequently under H0 (p-value is not small), there’s scarcely evidence of incompatibility 

O But a small p-value doesn’t warrant inferring a genuine effect H, let alone a scientific conclusion H*–as the ASA document correctly warns (Principle 3)”

 

Page 12: Mayo minnesota 28 march 2 (1)

12

A Paradox for Significance Test Critics

Critic: It’s much too easy to get a small P-value

You: Why do they find it so difficult to replicate the small P-values others found?

Is it easy or is it hard?

Page 13: Mayo minnesota 28 march 2 (1)

13

O R.A. Fisher: it’s easy to lie with statistics by selective reporting (he called it the “political principle”)

O Sufficient finagling—cherry-picking, P-hacking, significance seeking—may practically guarantee a researcher’s preferred claim C gets support, even if it’s unwarranted by evidence

O Note: Rejecting a null taken as support for some non-null claim C

Page 14: Mayo minnesota 28 march 2 (1)

14

Severity Requirement:

O If data x0 agree with a claim C, but the test procedure had little or no capability of finding flaws with C (even if the claim is incorrect), then x0 provide poor evidence for C

O Such a test fails a minimal requirement for a stringent or severe test

O My account: severe testing based on error statistics

Page 15: Mayo minnesota 28 march 2 (1)

15

Two main views of the role of probability in inference (not in ASA doc)

O Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0. (e.g., Bayesian, likelihoodist)—with regard for inner coherency

O Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson)

Page 16: Mayo minnesota 28 march 2 (1)

16

What happened to using probability to assess the error probing capacity by the severity criterion?

O Neither “probabilism” nor “performance” directly captures it

O Good long-run performance is a necessary, not a sufficient, condition for severity

O That’s why frequentist methods can be shown to have howlers

 

Page 17: Mayo minnesota 28 march 2 (1)

17

O Problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, are not problems about long-runs—

O It’s that we cannot say the case at hand has done a good job of avoiding the sources of misinterpreting data

Page 18: Mayo minnesota 28 march 2 (1)

18

A claim C is not warranted _______

O Probabilism: unless C is true or probable (gets a probability boost, is made comparatively firmer)

O Performance: unless it stems from a method with low long-run error

O Probativism (severe testing) something (a fair amount) has been done to probe ways we can be wrong about C

Page 19: Mayo minnesota 28 march 2 (1)

19

O If you assume probabilism is required for inference, error probabilities are relevant for inference only by misinterpretation False!

O I claim, error probabilities play a crucial role in appraising well-testedness

O It’s crucial to be able to say, C is highly believable or plausible but poorly tested

O Probabilists can allow for the distinct task of severe testing (you may not have to entirely take sides in the stat wars)

Page 20: Mayo minnesota 28 march 2 (1)

20

The ASA doc gives no sense of different tools for different jobs

O “To use an eclectic toolbox in statistics, it’s important not to expect an agreement on numbers form methods evaluating different things

O A p-value isn’t ‘invalid’ because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of H0) (Trafimow and Marks*, 2015)

*Editors of a journal, Basic and Applied Social Psychology

Page 21: Mayo minnesota 28 march 2 (1)

21

O ASA Principle 2 says a p-value ≠ posterior but one doesn’t get the sense of its role in error probability control

O It’s not that I’m keen to defend many common uses of significance tests

O The criticisms are often based on misunderstandings; consequently so are many “reforms”

Page 22: Mayo minnesota 28 march 2 (1)

22

Biasing selection effects:

O One function of severity is to identify problematic selection effects (not all are)

O Biasing selection effects: when data or hypotheses are selected or generated (or a test criterion is specified), in such a way that the minimal severity requirement is violated, seriously altered or incapable of being assessed

O Picking up on these alterations is precisely what enables error statistics to be self-correcting—

Page 23: Mayo minnesota 28 march 2 (1)

23

Nominal vs actual significance levels

The ASA correctly warns that “[c]onducting multiple analyses of the data and reporting only those with certain p-values” leads to spurious p-values (Principle 4)

Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent! (Selvin, 1970, p. 104)

From (Morrison & Henkel’s Significance Test controversy 1970!)

Page 24: Mayo minnesota 28 march 2 (1)

24

O They were clear on the fallacy: blurring the “computed” or “nominal” significance level, and the “actual” level

O There are many more ways you can be wrong with hunting (different sample space)

O Here’s a case where a p-value report is invalid

Page 25: Mayo minnesota 28 march 2 (1)

25

You report: Such results would be difficult to achieve under the assumption of H0

When in fact such results are common under the assumption of H0

(Formally):

O You say Pr(P-value < Pobs; H0) ~ Pobs small

O But in fact Pr(P-value < Pobs; H0) = high

Page 26: Mayo minnesota 28 march 2 (1)

26

O Nowadays, we’re likely to see the tests blamed

O My view: Tests don’t kill inference, people do

O Even worse are those statistical accounts where the abuse vanishes!

Page 27: Mayo minnesota 28 march 2 (1)

27

On some views, taking account of biasing selection effects “defies scientific sense”

Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value…But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value” (Goodman 1999, p. 1010)

(To his credit, he’s open about this; heads the Meta-Research Innovation Center at Stanford)

Page 28: Mayo minnesota 28 march 2 (1)

28

Technical activism isn’t free of philosophy

Ben Goldacre (of Bad Science) in a 2016 Nature article, is puzzled that bad statistical practices continue even in the face of the new "technical activism”:

The editors at Annals of Internal Medicine,… repeatedly (but confusedly) argue that it is acceptable to identify “prespecified outcomes” [from results] produced after a trial began; ….they say that their expertise allows them to permit — and even solicit — undeclared outcome-switching

Page 29: Mayo minnesota 28 march 2 (1)

29

His paper: “Make journals report clinical trials properly”

O He shouldn’t close his eyes to the possibility that some of the pushback he’s seeing has a basis in statistical philosophy!

Page 30: Mayo minnesota 28 march 2 (1)

30

Likelihood Principle (LP)

The vanishing act links to a pivotal disagreement in the philosophy of statistics battles

In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses

P(x0;H1)/P(x0;H0)

The data x0 are fixed, while the hypotheses vary

Page 31: Mayo minnesota 28 march 2 (1)

31

Jimmy Savage on the LP:

O “According to Bayes' theorem,…. if y is the datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ…” (Savage 1962, p. 17)

Page 32: Mayo minnesota 28 march 2 (1)

32

All error probabilities violate the LP (even without selection effects):

Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space (Lindley 1971, p. 436) 

The LP implies…the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects (Rosenkrantz, 1977, p. 122)

Page 33: Mayo minnesota 28 march 2 (1)

33

Paradox of Optional Stopping:

Error probing capacities are altered not just by cherry picking and data dredging, but also via data dependent stopping rules:

Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0.

Instead of fixing the sample size n in advance, in some tests, n is determined by a stopping rule:

Page 34: Mayo minnesota 28 march 2 (1)

34

“Trying and trying again”

O Keep sampling until H0 is rejected at 0.05 level

i.e., keep sampling until M 1.96 s/√nO Trying and trying again: Having failed to

rack up a 1.96 s difference after 10 trials, go to 20, 30 and so on until obtaining a 1.96 s difference

Page 35: Mayo minnesota 28 march 2 (1)

35

Nominal vs. Actual significance levels again:

O With n fixed the Type 1 error probability is 0.05

O With this stopping rule the actual significance level differs from, and will be greater than 0.05

O Violates Cox and Hinkley’s (1974) “weak repeated sampling principle”

Page 36: Mayo minnesota 28 march 2 (1)

36

1959 Savage Forum

Jimmy Savage audaciously declared:

“optional stopping is no sin”so the problem must be with significance levels

Peter Armitage:

“thou shalt be misled” 

if thou dost not know the person tried and tried again (p. 72)

Page 37: Mayo minnesota 28 march 2 (1)

37

O “The ASA correctly warns that “[c]onducting multiple analyses of the data and reporting only those with certain p-values” leads to spurious p-values (Principle 4)

O However, the same p-hacked hypothesis can occur in Bayes factors; optional stopping can be guaranteed to exclude true nulls from HPD intervals

Page 38: Mayo minnesota 28 march 2 (1)

38

With One Big Difference:

O “The direct grounds to criticize inferences as flouting error statistical control is lost

O They condition on the actual data, O error probabilities take into account other

outcomes that could have occurred but did not (sampling distribution)”

Page 39: Mayo minnesota 28 march 2 (1)

39

Tension: Does Principle 4 Hold for Other Approaches?

O “In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches”

(They include Bayes factors, likelihood ratios, as “alternative measures of evidence”)O They appear to extend “full reporting and

transparency” (principle 4) to all methods. O Some controversy: should it apply only to “p-

values and related statistics”

Page 40: Mayo minnesota 28 march 2 (1)

40

How might probabilists block intuitively unwarranted inferences (without error

probabilities)?

A subjective Bayesian might say:

If our beliefs were mixed into the interpretation of the evidence, we wouldn’t declare there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences)

Page 41: Mayo minnesota 28 march 2 (1)

41

Rescued by beliefsO That could work in some cases (it still

wouldn’t show what researchers had done wrong)—battle of beliefs

O Besides, researchers sincerely believe their hypotheses 

O So now you’ve got two sources of flexibility, priors and biasing selection effects

Page 42: Mayo minnesota 28 march 2 (1)

42

No help with our most important problem

O How to distinguish the warrant for a single hypothesis H with different methods

(e.g., one has biasing selection effects, another, pre-registered results and precautions)?

Page 43: Mayo minnesota 28 march 2 (1)

43

Most Bayesians are “conventional”

O Eliciting subjective priors too difficult, scientists reluctant to allow subjective beliefs to overshadow data

O Default, or reference priors are supposed to prevent prior beliefs from influencing the posteriors (O-Bayesians, 2006)

Page 44: Mayo minnesota 28 march 2 (1)

44

O A classic conundrum: no general non-informative prior exists, so most are conventional

O “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, p. 299)

O Prior probability: An undefined mathematical construct for obtaining posteriors (giving highest weight to data, or satisfying invariance, or matching or….)

 

Page 45: Mayo minnesota 28 march 2 (1)

45

Conventional Bayesian Reforms are touted as free of selection effects

O Jim Berger gives us “conditional error probabilities” CEPs

O “[I]t is considerably less important to disabuse students of the notion that a frequentist error probability is the probability that the hypothesis is true, given the data”, since under his new definition “a CEP actually has that interpretation”

O “CEPs do not depend on the stopping rule”(“Could Fisher, Jeffreys and Neyman Have Agreed on Testing?” 2003)

Page 46: Mayo minnesota 28 march 2 (1)

46

By and large the ASA doc highlights classic foibles

“In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result” (Fisher 1935, p. 14)

(“isolated” low P-value ≠> H: statistical effect)

Page 47: Mayo minnesota 28 march 2 (1)

47

Statistical ≠> substantive (H ≠> H*)

“[A]ccording to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter...requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions” (Gigerentzer 1989, pp. 95-6)

Page 48: Mayo minnesota 28 march 2 (1)

48

O Flaws in alternative H* have not been probed by the test,

O The inference from a statistically significant result to H* fails to pass with severity 

O “Merely refuting the null hypothesis is too weak to corroborate” substantive H*, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called ‘a highly improbable coincidence’” (Meehl and Waller 2002, p.184)

Page 49: Mayo minnesota 28 march 2 (1)

49

O Encouraged by something called NHSTs –that supposedly allow moving from statistical to substantive

O If defined that way, they exist only as abuses of tests

O ASA doc ignores Neyman-Pearson (N-P) tests

Page 50: Mayo minnesota 28 march 2 (1)

50

Neyman-Pearson (N-P) Tests:

A null and alternative hypotheses H0, H1 that exhaust the parameter space

O So the fallacy of rejection H – > H* is impossible

O Rejecting the null only indicates statistical alternatives

Page 51: Mayo minnesota 28 march 2 (1)

51

P-values Don’t Report Effect Sizes Principle 5

Who ever said to just report a P-value?

O “Tests should be accompanied by interpretive tools that avoid the fallacies of rejection and non-rejection. These correctives can be articulated in either Fisherian or Neyman-Pearson terms” (Mayo and Cox 2006, Mayo and Spanos 2006)

Page 52: Mayo minnesota 28 march 2 (1)

52

To Avoid Inferring a Discrepancy Beyond What’s Warranted:

large n problem.

O Severity tells us: an α-significant difference is indicative of less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 )

Page 53: Mayo minnesota 28 march 2 (1)

53

O What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one so insensitive that it doesn’t go off unless the house is fully ablaze? [The larger sample size is like the one that goes off with burnt toast]

Page 54: Mayo minnesota 28 march 2 (1)

54

What About the Fallacy of Non-Significant Results?

O Non-Replication occurs with non-significant results, but there’s much confusion in interpreting them

O No point in running replication research if you view negative results as uninformative

Page 55: Mayo minnesota 28 march 2 (1)

55

O They don’t warrant 0 discrepancy

O Use the same severity reasoning to rule out discrepancies that very probably would have resulted in a larger difference than observed- set upper bounds

O If you very probably would have observed a more impressive (smaller) p-value than you did, if μ > μ1 (where μ1 = μ0 + γ), then the data are good evidence that μ< μ1

O Akin to power analysis (Cohen, Neyman) but sensitive to x0

Page 56: Mayo minnesota 28 march 2 (1)

56

Improves on Confidence Intervals“This is akin to confidence intervals (which are dual to tests) but we get around their shortcomings:

O We do not fix a single confidence level,

O The evidential warrant for different points in any interval are distinguished”

O Go beyond “performance goal” to give inferential construal

Page 57: Mayo minnesota 28 march 2 (1)

57

Simple Fisherian Tests Have Important Uses

O Model validation:

George Box calls for ecumenism because“diagnostic checks and tests of fit” he argues “require frequentist theory significance tests for their formal justification” (Box 1983, p. 57)

Page 58: Mayo minnesota 28 march 2 (1)

58

“What we are advocating, ..is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data, rather than entering into a contest with some alternative model” (Gelman & Shalizi p. 20)

O Fraudbusting and forensics: Finding Data too good to be true (Simonsohn)

Page 59: Mayo minnesota 28 march 2 (1)

59

Concluding remarks: Reforms without Philosophy of Statistics are Blind

O I end my commentary: “Failing to understand the correct (if limited) role of simple significance tests threatens to throw the error control baby out with the bad statistics bathwater

O Avoid refighting the same wars, or banning methods based on cookbook methods long lampooned

O Makes no sense to banish tools for testing assumptions the other methods require and cannot perform

Page 60: Mayo minnesota 28 march 2 (1)

60

O Don’t expect an agreement on numbers form methods evaluating difference things

O Recognize different roles of probability: probabilism, long run performance , probativism (severe testing)

O Probabilisms may enable rather than block illicit inferences due to biasing selection effects

O Main paradox of the “replication crisis”

Page 61: Mayo minnesota 28 march 2 (1)

61

Paradox of ReplicationO Critic: It’s too easy to satisfy standard significance

thresholds O You: Why do replicationists find it so hard to achieve

them with preregistered trials? O Critic: Most likely the initial studies were guilty of p-

hacking, cherry-picking, significance seeking, QRPsO You: So, replication researchers want methods that

pick up on and block these biasing selection effectsO Critic: Actually the “reforms” recommend methods

where selection effects make no difference

Page 62: Mayo minnesota 28 march 2 (1)

62

Either you care about error probabilities or notO If not, experimental design principles (e.g.,

RCTs) may well go by the board

O Not enough to have a principle: we must be transparent about data-dependent selections

O Your statistical account needs a way to make use of the information

O “Technical activists” are not free of conflicts of interest and of philosophy

Page 63: Mayo minnesota 28 march 2 (1)

63

Granted, error statistical improvements are needed

O An inferential construal of error probabilities wasn’t clearly given (Birnbaum)-my goal

O It’s not long-run error control (performance), but severely probing flaws today 

O I also grant an error statistical account need to say more about how they’ll use background information

Page 64: Mayo minnesota 28 march 2 (1)

64

Future ASA project

O Look at the “other approaches” (Bayes factors, LRs, Bayesian updating)

O What is it for a replication to succeed or fail on those approaches? (can’t be just a matter of prior beliefs in the hypotheses)

Page 65: Mayo minnesota 28 march 2 (1)

65

Finally, it should be recognized that often better statistics cannot help

O Rather than search for more “idols”, do better science, get better experiments and theories

O One hypothesis must always be: our results point to the inability of our study to severely probe the phenomenon of interest

Page 66: Mayo minnesota 28 march 2 (1)

66

Be ready to admit questionable science

O The scientific status of an inquiry is questionable if unable to distinguish poorly run study and poor hypothesis

O Continually violate minimal requirements for severe testing

Page 67: Mayo minnesota 28 march 2 (1)

67

Non-replications often construed as simply weaker effects

2 that didn’t replicate in psychology:O Belief in free will and cheatingO Physical distance (of points plotted)

and emotional closeness

Page 68: Mayo minnesota 28 march 2 (1)

68

Page 69: Mayo minnesota 28 march 2 (1)

69

The ASA’s Six PrinciplesO (1) P-values can indicate how incompatible the data are with

a specified statistical model

O (2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

O (3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold

O (4) Proper inference requires full reporting and transparency

O (5) A p-value, or statistical significance, does not measure the size of an effect or the importance of a result

O (6) By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis

Page 70: Mayo minnesota 28 march 2 (1)

70

Mayo and Cox (2010): Frequentist Principle of Evidence (FEV); SEV: Mayo and Spanos (2006) FEV/SEV: insignificant result: A moderate P-value is evidence of the absence of a discrepancy δ from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., d(X) > d(x0) ) were a discrepancy δ to exist  FEV/SEV significant result d(X) > d(x0) is evidence of discrepancy δ from H0, if and only if, there is a high probability the test would have d(X) < d(x0) were a discrepancy as large as δ absent

Page 71: Mayo minnesota 28 march 2 (1)

71

Test T+: Normal testing: H0: μ < μ0 vs. H1: μ > μ0 σ known (FEV/SEV): If d(x) is not statistically significant, then

μ < M0 + kεσ/√n passes the test T+ with severity (1 – ε)

 (FEV/SEV): If d(x) is statistically significant, then

μ > M0 + kεσ/√n passes the test T+ with severity (1 – ε)

  where P(d(X) > kε) = ε

Page 72: Mayo minnesota 28 march 2 (1)

72

ReferencesO Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical

Inference: A Discussion, edited by L. J. Savage. London: Methuen.O Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?' and

'Rejoinder,', Statistical Science 18(1): 1-12; 28-32.O Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3):

385–402. O Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).”

Nature 225 (5237) (March 14): 1033.O Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin of the

American Mathematical Society 50(1): 126-46. O Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T. and

Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and Robustness. New York: Academic Press.

O Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall. O Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist

Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press.

O Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd. O Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal

Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.

Page 73: Mayo minnesota 28 march 2 (1)

73

O Gelman, A. and Shalizi, C. 2013. 'Philosophy and the Practice of Bayesian Statistics' and 'Rejoinder', British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80.

O Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The Empire of Chance. Cambridge: Cambridge University Press.

O Goldacre, B. 2008. Bad Science. HarperCollins Publishers.O Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature 530(7588);online

02Feb2016.O Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,”

Annals of Internal Medicine 1999; 130:1005 –1013.O Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical

Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.

O Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.

O Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

O Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.

Page 74: Mayo minnesota 28 march 2 (1)

74

O Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier.

O Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300.

O Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter.

O Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96.

O Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.

O Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen.

O Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.

O Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888.

O Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology 37(1): pp. 1-2.

O Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process, and purpose”, The American Statistician

Page 75: Mayo minnesota 28 march 2 (1)

75

AbstractIf a statistical methodology is to be adequate, it needs to register how “questionable research practices” (QRPs) alter a method’s error probing capacities. If little has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a stringent or severe test. The goal of severe testing is the linchpin for (re)interpreting frequentist methods so as to avoid long-standing fallacies at the heart of today’s statistics wars. A contrasting philosophy views statistical inference in terms of posterior probabilities in hypotheses: probabilism. Presupposing probabilism, critics mistakenly argue that significance and confidence levels are misinterpreted, exaggerate evidence, or are irrelevant for inference. Recommended replacements—Bayesian updating, Bayes factors, likelihood ratios—fail to control severity.