63
Severe Testing: The Key to Error Correction Deborah G Mayo Virginia Tech March 17, 2017 “Understanding Reproducibility and Error Correction in Science”

Severe Testing: The Key to Error Correction

Embed Size (px)

Citation preview

Page 1: Severe Testing: The Key to Error Correction

Severe Testing: The Key to Error Correction

Deborah G MayoVirginia Tech

March 17, 2017 “Understanding Reproducibility and Error Correction in

Science”

Page 2: Severe Testing: The Key to Error Correction

2

Statistical Crisis of Replication

O Statistical ‘findings’ disappear when others look for them.

O Beyond the social sciences to genomics, bioinformatics, and medicine (Big Data)

O Methodological reforms (some welcome, others radical)

O Need to understand philosophical, statistical, historical issues

Page 3: Severe Testing: The Key to Error Correction

3

American Statistical Association (ASA):Statement on P-values

“The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. …. much confusion and even doubt about the validity of science is arising. Such doubt can lead to radical choices such as…to ban p-values” (ASA, Wasserstein & Lazar 2016, p. 129)

Page 4: Severe Testing: The Key to Error Correction

4

I was a philosophical observer at the ASA P-value “pow wow”

Page 5: Severe Testing: The Key to Error Correction

5

“Don’t throw out the error control baby with the bad statistics bathwater” The American Statistician

Page 6: Severe Testing: The Key to Error Correction

6

O The most used methods are most criticized

O “Statistical significance tests are a small part of a rich set of:

“techniques for systematically appraising and bounding the probabilities … of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033)

O These I call error statistical methods (or sampling theory)”.

Page 7: Severe Testing: The Key to Error Correction

7

Error Statistics

O Statistics: Collection, modeling, drawing inferences from data to claims about aspects of processes

O The inference may be in error

O It’s qualified by a claim about the method’s capabilities to control and alert us to erroneous interpretations (error probabilities)

Page 8: Severe Testing: The Key to Error Correction

8

“p-value. …to test the conformity of the particular data under analysis with H0 in some respect:…we find a function T = t(y) of the data, to be called the test statistic, such that• the larger the value of T the more

inconsistent are the data with H0;• The random variable T = t(Y) has a

(numerically) known probability distribution when H0 is true.

…the p-value corresponding to any t0bs asp = p(t) = Pr(T ≥ t0bs; H0)”

(Mayo and Cox 2006, p. 81)

Page 9: Severe Testing: The Key to Error Correction

9

Testing Reasoning

O If even larger differences than t0bs occur fairly frequently under H0 (i.e.,P-value is not small), there’s scarcely evidence of incompatibility with H0 

O Small P-value indicates some underlying discrepancy from H0 because very probably you would have seen a less impressive difference than t0bs were H0 true.

O This indication isn’t evidence of a genuine statistical effect H, let alone a scientific conclusion H*

Stat-Sub fallacy  H => H*

Page 10: Severe Testing: The Key to Error Correction

10

O I’m not keen to defend many uses of significance tests long lampooned

O I introduce a reformulation of tests in terms of discrepancies (effect sizes) that are and are not severely-tested

O The criticisms are often based on misunderstandings; consequently so are many “reforms”

Page 11: Severe Testing: The Key to Error Correction

11

Replication Paradox (for Significance Test Critics)

Critic: It’s much too easy to get a small P-value

You: Why do they find it so difficult to replicate the small P-values others found?

Is it easy or is it hard?

Page 12: Severe Testing: The Key to Error Correction

12

Only 36 of 100 psychology experiments yielded small P-values in Open Science Collaboration on

replication in psychology

OSC: Reproducibility Project: Psychology: 2011-15 (Science 2015): Crowd-sourced effort to replicate 100 articles (Led by Brian Nozek, U. VA)

Page 13: Severe Testing: The Key to Error Correction

13

O R.A. Fisher: it’s easy to lie with statistics by selective reporting, not the test’s fault

O Sufficient finagling—cherry-picking, P-hacking, significance seeking, multiple testing, look elsewhere—may practically guarantee a preferred claim H gets support, even if it’s unwarranted by evidence

(biasing selection effects, need to adjust P-values)

Note: Support for some preferred claim H is by rejecting a null hypothesisO H hasn’t passed a severe test

Page 14: Severe Testing: The Key to Error Correction

14

Severity Requirement:

If the test procedure had little or no capability of finding flaws with H (even if H is incorrect), then agreement between data x0 and H provides poor (or no) evidence for H

(“too cheap to be worth having” Popper)

O Such a test fails a minimal requirement for a stringent or severe test

O My account: severe testing based on error statistics (requires reinterpreting tests)

Page 15: Severe Testing: The Key to Error Correction

15

This alters the role of probability: typically just 2

O Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0

(e.g., Bayesian, likelihoodist)—with regard for inner coherencyO Performance. Ensure long-run reliability of

methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson)

Page 16: Severe Testing: The Key to Error Correction

16

What happened to using probability to assess the error probing capacity by the severity

requirement?

O Neither “probabilism” nor “performance” directly captures it

O Good long-run performance is a necessary, not a sufficient, condition for severity

 

Page 17: Severe Testing: The Key to Error Correction

17

O Problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, are not problems about long-runs—

O It’s that we cannot say the case at hand has done a good job of avoiding the sources of misinterpreting data

Key to revising the role of error probabilities

Page 18: Severe Testing: The Key to Error Correction

18

A claim C is not warranted _______

O Probabilism: unless C is true or probable (gets a probability boost, is made comparatively firmer)

O Performance: unless it stems from a method with low long-run error

O Probativism (severe testing) unless something (a fair amount) has been done to probe ways we can be wrong about C

Page 19: Severe Testing: The Key to Error Correction

19

O If you assume probabilism is required for inference, error probabilities are relevant for inference only by misinterpretation False!

O I claim, error probabilities play a crucial role in appraising well-testedness

O It’s crucial to be able to say, C is highly believable or plausible but poorly tested

Page 20: Severe Testing: The Key to Error Correction

20

Biasing selection effects:

O One function of severity is to identify problematic selection effects (not all are)

O Biasing selection effects: when data or hypotheses are selected or generated (or a test criterion is specified), in such a way that the minimal severity requirement is violated, seriously altered or incapable of being assessed

O Picking up on these alterations is precisely what enables error statistics to be self-correcting—

Page 21: Severe Testing: The Key to Error Correction

21

Nominal vs actual Significance levels

Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent! (Selvin, 1970, p. 104)

From (Morrison & Henkel’s Significance Test controversy 1970!)

Page 22: Severe Testing: The Key to Error Correction

22

O Morrison and Henkel were clear on the fallacy: blurring the “computed” or “nominal” significance level, and the “actual” level

O There are many more ways you can be wrong with hunting (different sample space)

Page 23: Severe Testing: The Key to Error Correction

23

Spurious P-ValueYou report: Such results would be difficult to achieve under the assumption of H0

When in fact such results are common under the assumption of H0

(Formally):

O You say Pr(P-value ≤ Pobs; H0) ~ Pobs small

O But in fact Pr(P-value ≤ Pobs; H0) = high

Page 24: Severe Testing: The Key to Error Correction

24

Scapegoating

O Nowadays, we’re likely to see the tests blamed

O My view: Tests don’t kill inferences, people do

O Even worse are those statistical accounts where the abuse vanishes!

Page 25: Severe Testing: The Key to Error Correction

25

On some views, taking account of biasing selection effects “defies scientific sense”

Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value…But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value” (Goodman 1999, p. 1010)

(To his credit, he’s open about this; heads the Meta-Research Innovation Center at Stanford)

Page 26: Severe Testing: The Key to Error Correction

26

Technical activism isn’t free of philosophyBen Goldacre (of Bad Science) in a 2016 Nature article, is puzzled that bad statistical practices continue even in the face of the new "technical activism”:

The editors at Annals of Internal Medicine,… repeatedly (but confusedly) argue that it is acceptable to identify “prespecified outcomes” [from results] produced after a trial began; ….they say that their expertise allows them to permit – and even solicit –undeclared outcome-switching

Page 27: Severe Testing: The Key to Error Correction

27

His paper: “Make journals report clinical trials properly”

O He shouldn’t close his eyes to the possibility that some of the pushback he’s seeing has a basis in statistical philosophy!

Page 28: Severe Testing: The Key to Error Correction

28

Likelihood Principle (LP)

The vanishing act links to a pivotal disagreement in the philosophy of statistics battles

In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses

P(x0;H1)/P(x0;H0)

The data x0 are fixed, while the hypotheses vary

Page 29: Severe Testing: The Key to Error Correction

29

All error probabilities violate the LP (even without selection effects):

Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space (Lindley 1971, p. 436) 

The LP implies…the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects (Rosenkrantz, 1977, p. 122)

Page 30: Severe Testing: The Key to Error Correction

30

Paradox of Optional Stopping:

Error probing capacities are altered not just by cherry picking and data dredging, but also via data dependent stopping rules:

Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0.

Instead of fixing the sample size n in advance, in some tests, n is determined by a stopping rule:

Page 31: Severe Testing: The Key to Error Correction

31

“Trying and trying again”

O Keep sampling until H0 is rejected at 0.05 level

i.e., keep sampling until M 1.96 s/√nO Trying and trying again: Having failed to

rack up a 1.96 s difference after 10 trials, go to 20, 30 and so on until obtaining a 1.96 s difference

Page 32: Severe Testing: The Key to Error Correction

32

Nominal vs. Actual significance levels again:

O With n fixed the Type 1 error probability is 0.05

O With this stopping rule the actual significance level differs from, and will be greater than 0.05

O Violates Cox and Hinkley’s (1974) “weak repeated sampling principle”

Page 33: Severe Testing: The Key to Error Correction

33

O “The ASA (p. 131) correctly warns that “[c]onducting multiple analyses of the data and reporting only those with certain p-values” leads to spurious p-values (Principle 4)

O They don’t mention that the same p-hacked hypothesis can occur in Bayes factors, credibility intervals, likelihood ratios

Page 34: Severe Testing: The Key to Error Correction

34

With One Big Difference:

O “The direct grounds to criticize inferences as flouting error statistical control is lost

O They condition on the actual data,

O Error probabilities take into account other outcomes that could have occurred but did not (sampling distribution)”

Page 35: Severe Testing: The Key to Error Correction

35

How might probabilists block intuitively unwarranted inferences

(without error probabilities)?

A subjective Bayesian might say:

If our beliefs were mixed into the interpretation of the evidence, we wouldn’t declare there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences)

Page 36: Severe Testing: The Key to Error Correction

36

Rescued by beliefsO That could work in some cases (it still

wouldn’t show what researchers had done wrong)—battle of beliefs

O Besides, researchers sincerely believe their hypotheses 

O So now you’ve got two sources of flexibility, priors and biasing selection effects

Page 37: Severe Testing: The Key to Error Correction

37

No help with our most important problem

O How to distinguish the warrant for a single hypothesis H with different methods

(e.g., one has biasing selection effects, another, pre-registered results and precautions)?

Page 38: Severe Testing: The Key to Error Correction

38

Most Bayesians use “default” priors

O Eliciting subjective priors too difficult, scientists reluctant to allow subjective beliefs to overshadow data

O Default, or reference priors are supposed to prevent prior beliefs from influencing the posteriors (O-Bayesians, 2006)

Page 39: Severe Testing: The Key to Error Correction

39

O “The priors are not to be considered expressions

of uncertainty, ignorance, or degree of belief. Default priors may not even be probabilities…” (Cox and Mayo 2010, p. 299)

O Default Bayesian Reforms are touted as free of selection effects

O “…Bayes factors can be used in the complete absence of a sampling plan…” (Bayarri, Benjamin, Berger, Sellke 2016, p. 100)  

Page 40: Severe Testing: The Key to Error Correction

40

Granted, some are prepared to abandon the LP for model testing

In an attempted meeting of the minds Andrew Gelman and Cosma Shalizi say:

O “[C]rucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense” which might be seen as using modern statistics to implement the Popperian criteria of severe tests.” (2013, p.10).

O An open question

Page 41: Severe Testing: The Key to Error Correction

41

The ASA doc highlights classic foibles that block replication

“In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result” (Fisher 1935, p. 14)

“isolated” low P-value ≠> H: statistical effect

Page 42: Severe Testing: The Key to Error Correction

42

Statistical ≠> substantive (H ≠> H*)

“[A]ccording to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter...requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions” (Gigerentzer 1989, pp. 95-6)

Page 43: Severe Testing: The Key to Error Correction

43

The Problem is with so-called NHST (“null hypothesis significance testing”)

O NHSTs supposedly allow moving from statistical to substantive hypotheses

O If defined that way, they exist only as abuses of tests

O ASA doc ignores Neyman-Pearson (N-P) tests

Page 44: Severe Testing: The Key to Error Correction

44

Neyman-Pearson (N-P) tests:

A null and alternative hypotheses H0, H1 that are exhaustive

H0: μ ≤ 12 vs H1: μ > 12

O So this fallacy of rejection HH* is impossible

O Rejecting the null only indicates statistical alternatives (how discrepant from null)

Page 45: Severe Testing: The Key to Error Correction

45

P-values Don’t Report Effect Sizes (Principle 5)

Who ever said to just report a P-value?

O “Tests should be accompanied by interpretive tools that avoid the fallacies of rejection and non-rejection. These correctives can be articulated in either Fisherian or Neyman-Pearson terms” (Mayo and Cox 2006, Mayo and Spanos 2006)

Page 46: Severe Testing: The Key to Error Correction

46

To Avoid Inferring a Discrepancy Beyond What’s Warranted:

large n problem.

O Severity tells us: an α-significant difference is indicative of less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 )

Page 47: Severe Testing: The Key to Error Correction

47

O What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one so insensitive that it doesn’t go off unless the house is fully ablaze?

O [The larger sample size is like the one that goes off with burnt toast]

Page 48: Severe Testing: The Key to Error Correction

48

What About Fallacies of Non-Significant Results?

O They don’t warrant 0 discrepancy

O Use the same severity reasoning to rule out discrepancies that very probably would have resulted in a larger difference than observed- set upper bounds

O If you very probably would have observed a more impressive (smaller) p-value than you did, if μ > μ1 (μ1 = μ0 + γ), then the data are good evidence that μ< μ1

O Akin to power analysis (Cohen, Neyman) but sensitive to x0

Page 49: Severe Testing: The Key to Error Correction

49

O There’s another kind of fallacy behind a move that’s supposed improve replication but it confuses the notions from significance testing and it leads to “Most findings are false”

O Fake replication crisis.

Page 50: Severe Testing: The Key to Error Correction

50

Diagnostic Screening Model of Tests: urn of nulls

(“most findings are false”)

O If we imagine randomly select a hypothesis from an urn of nulls 90% of which are true

O Consider just 2 possibilities: H0: no effect H1: meaningful effect, all else ignored,

O Take the prevalence of 90% as Pr(H0 you picked) = .9, Pr(H1)= .1

O Rejecting H0 with a single (just) .05 significant result, Cherry-picking to boot

Page 51: Severe Testing: The Key to Error Correction

51

The unsurprising result is that most “findings” are false: Pr(H0| findings with a P-value of .05) > .5 Pr(H0| findings with a P-value of .05) ≠ Pr(P-value of .05 | H0)  Only the second one is a Type 1 error probability)Major source of confusion…. (Berger and Sellke 1987, Ioannidis 2005, Colquhoun 2014)

Page 52: Severe Testing: The Key to Error Correction

52

O A: Announce a finding (a P-value of .05)

O Not properly Bayesian (not even empirical Bayes), not properly frequentist

O Where does the high prevalence come from?

Page 53: Severe Testing: The Key to Error Correction

53

Concluding RemarkO If replication research and reforms are to lead to

error correction, they must correct errors: they don’t always do that

O They do when they encourage preregistration, control error probabilities & require good design

RCTs, checking model assumptions)

O They don’t when they permit tools that lack error control

Page 54: Severe Testing: The Key to Error Correction

54

Don’t Throw Out the Error Control BabyO Main source of hand-wringing behind the

statistical crisis in science stems from cherry-picking, hunting for significance, multiple testing

O These biasing selection effects are picked up by tools that assess error control (performance or severity)

O Reforms based on “probabilisms” enable rather than check unreliable results due to biasing selection effects

Page 55: Severe Testing: The Key to Error Correction

55

Repligate

O Replication research has pushback: some call it methodological terrorism (enforcing good science or bullying?)

O My gripe is that replications, at least in social psychology, should go beyond the statistical criticism

Page 56: Severe Testing: The Key to Error Correction

56

Non-replications construed as simply weaker effects

O One of the non-replications: cleanliness and morality: Does unscrambling soap words make you less judgmental?

“Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. …Chronicle of Higher Ed.

Page 57: Severe Testing: The Key to Error Correction

57

…Turns out, it did. Subjects who had unscrambled clean words weren’t as harsh on the guy who chows down on his chow.” (Chronicle of Higher Education)

O Focusing on the P-values ignore larger questions of measurement in psych & the leap from the statistical to the substantive.

HH*

O Increasingly the basis for experimental philosophy-needs philosophical scrutiny

Page 58: Severe Testing: The Key to Error Correction

58

The ASA’s Six PrinciplesO (1) P-values can indicate how incompatible the data are with

a specified statistical model

O (2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

O (3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold

O (4) Proper inference requires full reporting and transparency

O (5) A p-value, or statistical significance, does not measure the size of an effect or the importance of a result

O (6) By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis

Page 59: Severe Testing: The Key to Error Correction

59

Test T+: Normal testing: H0: μ < μ0 vs. H1: μ > μ0 σ known (FEV/SEV): If d(x) is not statistically significant, then

μ < M0 + kεσ/√n passes the test T+ with severity (1 – ε)

 (FEV/SEV): If d(x) is statistically significant, then

μ > M0 + kεσ/√n passes the test T+ with severity (1 – ε)

  where P(d(X) > kε) = ε

Page 60: Severe Testing: The Key to Error Correction

60

ReferencesO Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical

Inference: A Discussion, edited by L. J. Savage. London: Methuen.O Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?' and

'Rejoinder,', Statistical Science 18(1): 1-12; 28-32.O Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3):

385–402. O Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).”

Nature 225 (5237) (March 14): 1033.O Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin of the

American Mathematical Society 50(1): 126-46. O Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T. and

Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and Robustness. New York: Academic Press.

O Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall. O Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist

Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press.

O Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd. O Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal

Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.

Page 61: Severe Testing: The Key to Error Correction

61

O Gelman, A. and Shalizi, C. 2013. 'Philosophy and the Practice of Bayesian Statistics' and 'Rejoinder', British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80.

O Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The Empire of Chance. Cambridge: Cambridge University Press.

O Goldacre, B. 2008. Bad Science. HarperCollins Publishers.O Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature 530(7588);online

02Feb2016.O Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,”

Annals of Internal Medicine 1999; 130:1005 –1013.O Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical

Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.

O Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.

O Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

O Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.

Page 62: Severe Testing: The Key to Error Correction

62

O Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier.

O Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300.

O Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter.

O Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96.

O Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.

O Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen.

O Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.

O Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888.

O Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology 37(1): pp. 1-2.

O Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process, and purpose”, The American Statistician

Page 63: Severe Testing: The Key to Error Correction

63

AbstractIf a statistical methodology is to be adequate, it needs to register how “questionable research practices” (QRPs) alter a method’s error probing capacities. If little has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a stringent or severe test. The goal of severe testing is the linchpin for (re)interpreting frequentist methods so as to avoid long-standing fallacies at the heart of today’s statistics wars. A contrasting philosophy views statistical inference in terms of posterior probabilities in hypotheses: probabilism. Presupposing probabilism, critics mistakenly argue that significance and confidence levels are misinterpreted, exaggerate evidence, or are irrelevant for inference. Recommended replacements–Bayesian updating, Bayes factors, likelihood ratios–fail to control severity.