48
Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises Controversy Over the Significance Test Controversy Controversy Philosophy of Science Association Biennial Meeting November 4, 2016 Deborah G Mayo (Virginia Tech)

Controversy Over the Significance Test Controversy

Embed Size (px)

Citation preview

Page 1: Controversy Over the Significance Test Controversy

Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises

Controversy Over the Significance Test Controversy

Controversy

Philosophy of Science Association Biennial MeetingNovember 4, 2016

Deborah G Mayo (Virginia Tech)

Page 2: Controversy Over the Significance Test Controversy

2

“Science is in Crisis!”

O Once high profile failures of replication went beyond the social sciences to genomics, bioinformatics, people started to worry about scientific credibility

O Replication research, methodological activism, fraudbusting, statistical forensics

Page 3: Controversy Over the Significance Test Controversy

3

Methodological Reforms without philosophy of statistics are blind

Proposed methodological reforms are being adopted–many welcome (preregistration)–some quite radical

Without better understanding of the philosophical, statistical, historical issues many are likely to fail

Page 4: Controversy Over the Significance Test Controversy

4

American Statistical Association (ASA):Statement on P-values

“The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. …. much confusion and even doubt about the validity of science is arising. Such doubt can lead to radical choices such as…to ban P-values” (ASA 2016)

Page 5: Controversy Over the Significance Test Controversy

5

2015: ASA brought members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values

Page 6: Controversy Over the Significance Test Controversy

6

I was a ‘philosophical observer’ at the ASA P-value “pow wow”

Page 7: Controversy Over the Significance Test Controversy

7

“Don’t throw out the error control baby with the bad statistics bathwater” The American Statistician

Page 8: Controversy Over the Significance Test Controversy

8

Error Statistics

Statistics: Collection, modeling, drawing inferences from data to claims about aspects of processes

The inference may be in error

It’s qualified by a claim about the method’s capabilities to control and alert us to erroneous interpretations (error probabilities)

Significance tests (R.A. Fisher) are a small part of an error statistical methodology

Page 9: Controversy Over the Significance Test Controversy

9

“p-value. …to test the conformity of the particular data under analysis with H0 in some respect:…we find a function T = t(y) of the data, to be called the test statistic, such that• the larger the value of T the more

inconsistent are the data with H0;• The random variable T = t(Y) has a

(numerically) known probability distribution when H0 is true.

…the p-value corresponding to any t0bs asp = Pr(t) = Pr(T ≥ t0bs; H0)”

(Mayo and Cox 2006, p. 81)

Page 10: Controversy Over the Significance Test Controversy

10

Testing Reasoning

O If even larger differences than t0bs occur fairly frequently under H0 (P-value is not small), there’s scarcely evidence of incompatibility with H0 

O Small P-value indicates some underlying discrepancy from H0 because very probably you would have seen a less impressive difference than t0bs were H0 true.

O This indication isn’t evidence of a genuine statistical effect H, let alone a scientific conclusion H*

Stat-Sub fallacy  H => H*

Page 11: Controversy Over the Significance Test Controversy

11

Neyman-Pearson (N-P) tests: A null and alternative hypotheses

H0, H1 that are exhaustive

H0: μ ≤ 12 vs H0: μ > 12

O So this fallacy of rejection HH* is impossible

Rejecting the null only indicates statistical alternatives (how discrepant from null)

Page 12: Controversy Over the Significance Test Controversy

12

I’m not keen to defend many uses of significance tests long lampooned

I introduce a reformulation of tests in terms of discrepancies (effect sizes) that are and are not severely-tested

The criticisms are often based on misunderstandings; consequently so are many “reforms”

Page 13: Controversy Over the Significance Test Controversy

13

A paradox for significance test critics

Critic: It’s much too easy to get small P-values.

You: Why do they find it so difficult to replicate the small P-values in published reports? 

Is it easy or is it hard?

Page 14: Controversy Over the Significance Test Controversy

14

Only 36 of 100 psychology experiments yielded small P-values in Open Science Collaboration on

replication in psychology

OSC: Reproducibility Project: Psychology: 2011-15 (Science 2015): Crowd-sourced effort to replicate 100 articles (Led by Brian Nozek, U. VA)

Page 15: Controversy Over the Significance Test Controversy

15

R.A. Fisher: it’s easy to lie with statistics by selective reporting, “political principle”

Sufficient finagling—cherry-picking, P-hacking, significance seeking, multiple testing, look elsewhere—may practically guarantee a preferred claim H gets support, even if it’s unwarranted by evidence (verification fallacy)

(biasing selection effects, need to adjust P-values)

Note: Rejecting a null taken as support for some non-null claim H

Page 16: Controversy Over the Significance Test Controversy

16

O You report: Such results would be difficult to achieve under the assumption of H0

O When in fact such results are common under the assumption of H0

The ASA (p. 131) correctly warns that “[c]onducting multiple analyses of the data and reporting only those with certain p-values” leads to spurious p-values (Principle 4)

You say Pr(P-value ≤ Pobs; H0) = Pobs (small)

But in fact Pr(P-value ≤ Pobs; H0) = high*

*Note P-values measure distance from H0 in reverse

Page 17: Controversy Over the Significance Test Controversy

17

Minimal (Severity) Requirement for evidence

If the test procedure had little or no capability of finding flaws with H (even if H is incorrect), then agreement between data x0 and H provides poor (or no) evidence for H

(“too cheap to be worth having” Popper)

Such a test fails a minimal requirement for a stringent or severe test

My account: severe testing based on error statistics (requires reinterpreting tests)

Page 18: Controversy Over the Significance Test Controversy

18

Alters role of probability: typically just 2

Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0.

(e.g., Bayesian, likelihoodist)—with regard for inner coherency

Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson)

Page 19: Controversy Over the Significance Test Controversy

19

What happened to using probability to assess error probing capacity and severity?

Neither “probabilism” nor “performance” directly captures it

Good long-run performance is a necessary, not a sufficient, condition for severity

Page 20: Controversy Over the Significance Test Controversy

20

A claim H is not warranted _______

O Probabilism: unless H is true or probable (or gets a probability boost, is made comparatively firmer)

O Performance: unless it stems from a method with low long-run error

O Probativism (severe testing) unless something (a fair amount) has been done to probe ways we can be wrong about H

Page 21: Controversy Over the Significance Test Controversy

21

Problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, are not problems about long-runs—

It’s that we cannot say about the case at hand that it has done a good job of avoiding the sources of misinterpreting data

Page 22: Controversy Over the Significance Test Controversy

22

O If you assume probabilism, error probabilities are relevant for inference only by misinterpretation False!

O They play a key role in appraising well-testedness

O It’s crucial to be able to say, H is believable or plausible but this is a poor test of it

O With this in mind consider a continuation of the paradox of replication

Page 23: Controversy Over the Significance Test Controversy

23

Critic: It’s too easy to satisfy standard significance thresholds

You: Why do replicationists find it so hard to achieve significance thresholds (with preregistration)?

Critic: Obviously the initial studies were guilty of P-hacking, cherry-picking, data-dredging (QRPs)

You: So, the replication researchers want methods that pick up on, adjust, and block these biasing selection effects.

Critic: Actually “reforms” recommend methods where the need to alter P-values due to data dredging vanishes

Page 24: Controversy Over the Significance Test Controversy

24

Likelihood Principle (LP)The vanishing act links to a pivotal disagreement in the philosophy of statistics battles

In probabilisms (Bayes factors, posteriors), the import of the data is via the ratios of likelihoods of hypotheses

P(x0;H1)/P(x0;H0) for x0 fixed

They condition on the actual data,error probabilities take into account other outcomes that could have occurred but did not (sampling distribution)

Page 25: Controversy Over the Significance Test Controversy

25

All error probabilities violate the LP (even without selection effects):

“Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space.” (Lindley 1971, p. 436) 

“The LP implies…the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects.” (Rosenkrantz, 1977, p. 122)

Page 26: Controversy Over the Significance Test Controversy

26

Today’s Meta-research is not free of philosophy of statistics

“Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value…

But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value” (Goodman 1999, p. 1010)

(To his credit, he’s open about this; heads the Meta-Research Innovation Center at Stanford)

Page 27: Controversy Over the Significance Test Controversy

27

Sum-up so far:O Main source of hand-wringing behind the

statistical crisis in science stems from cherry-picking, hunting for significance, P-hacking

O Picked up by concern for performance or severity (but violated in abuses of tests)

O Reforms based on “probabilisms” enable rather than check unreliable results due to biasing selection effects

“Bayes factors can be used in the complete absence of a sampling plan” (Bayarri, Benjamin, Berger, Sellke 2016) O Probabilists may find other ways to block bad

inferences: background beliefs–for the discussion

Page 28: Controversy Over the Significance Test Controversy

28

A few remarks on interconnected issues that cry out for philosophical insight…

Page 29: Controversy Over the Significance Test Controversy

29

1. Replication research

O Aims to use significance tests correctly

Preregistered, avoid P-hacking, designed to have high power

Free of “perverse incentives” of usual research: guaranteed to be published

Page 30: Controversy Over the Significance Test Controversy

30

Repligate

O Replication research has pushback: some call it methodological terrorism (enforcing good science or bullying?)

O I’m (largely) on pro-replication side, but they need to go further….

Page 31: Controversy Over the Significance Test Controversy

31

Non-replications construed as simply weaker effects

O One of the non-replications: cleanliness and

morality: Does unscrambling soap words make you less judgmental?

“Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. …”

Page 32: Controversy Over the Significance Test Controversy

32

…Turns out, it did. Subjects who had unscrambled clean words weren’t as harsh on the guy who chows down on his chow.” (Chronicle of Higher Education)

O Focusing on the P-values ignore larger questions of measurement in psych & the leap from the statistical to the substantive.

HH*O Increasingly the basis for experimental

philosophy-needs philosophical scrutiny(free will/cheating-another non-replication)

Page 33: Controversy Over the Significance Test Controversy

33

2. Philosophy and History of Statistics

 O What actually happened: N-P tests aimed to put

Fisherian tests on logical ground (a theory of generating tests)

O All is hunky dory until Fisher and Neyman began fighting (from 1935) almost entirely due to professional and personality disputes

O What’s read into what happened: A huge philosophical difference is read into their in-fighting

 

Page 34: Controversy Over the Significance Test Controversy

34

Long-run error control (performance) vs inference

O (Neyman) N-P methods ->performance only, irrelevant for inference

O (Fisher) P-values are inferential (in some sense)

Contemporary work begins hereO The only way for P-values to be inferential is to

misinterpret them as posterior probabilities!O All of error statistics is deemed problematic

Page 35: Controversy Over the Significance Test Controversy

35

It’s the method, stupid

O Even if it were true that Fisher-Neyman held rival philosophies (inferential-behavioral performance), we should look at what the methods do

(Beyond “inconsistent hybrid” Gigerenzer 2004)O Instead P-value users rob themselves of

features from N-P tests they need(an animal called NHST)O Many say P-values must be reconciled with

posteriors in some way

Page 36: Controversy Over the Significance Test Controversy

36

3. Diagnostic Screening Model of Tests: urn of nulls

(focus on science-wise error rate performance)

O If we imagine randomly selecting hypotheses from an urn of nulls 90% of which are true

O Consider just 2 possibilities: H0: no effect H1: meaningful effect, all else ignored,

O Take the prevalence of 90% as Pr(H0 you picked) = .9, Pr(H1)= .1

O Rejecting H0 with a single (just) .05 significant result, cherry-picking to boot

Page 37: Controversy Over the Significance Test Controversy

37

The unsurprising result is that most “findings” are false Pr(H0|findings with a P-value of .05) > .5 Pr(H0| findings with a P-value of .05) ≠ Pr(P-value of .05| H0)  Major source of confusion….Neyman on steroids(not a N-P type 1 error probability) (Berger and Sellke 1987, Ioannidis 2005, Colquhoun 2014)

Page 38: Controversy Over the Significance Test Controversy

38

4. Shifts in Philosophy of Statistics

Decoupling of methods from traditional philosophies

O Some Bayesians reject probabilism…..

O They’re interested in “using modern statistics to implement the Popperian criteria of severe tests.” (Gelman and Shalizi 2013, p. 10)

O “Bayesian methods have seen huge advances in the past few decades. It is time for Bayesian philosophy to catch up…” (ibid. p. 79)  

Page 39: Controversy Over the Significance Test Controversy
Page 40: Controversy Over the Significance Test Controversy

40

The ASA’s Six PrinciplesO (1) P-values can indicate how incompatible the data are with

a specified statistical model

O (2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

O (3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold

O (4) Proper inference requires full reporting and transparency

O (5) A p-value, or statistical significance, does not measure the size of an effect or the importance of a result

O (6) By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis

Page 41: Controversy Over the Significance Test Controversy

41

The ASA’s Six PrinciplesO (1) P-values can indicate how incompatible the data are with a

specified statistical model

O (2) P-values do NOT measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

O (3) Scientific conclusions and business or policy decisions should NOT be based only on whether a p-value passes a specific threshold

O (4) Proper inference requires full reporting and transparency

O (5) A p-value, or statistical significance, does NOT measure the size of an effect or the importance of a result

O (6) By itself, a p-value does NOT provide a good measure of evidence regarding a model or hypothesis

Page 42: Controversy Over the Significance Test Controversy

42

Mayo and Cox (2010): Frequentist Principle of Evidence (FEV); SEV: Mayo and Spanos (2006) FEV/SEV: insignificant result: A moderate P-value is evidence of the absence of a discrepancy δ from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., d(X) > d(x0) ) were a discrepancy δ to exist  FEV/SEV significant result d(X) > d(x0) is evidence of discrepancy δ from H0, if and only if, there is a high probability the test would have d(X) < d(x0) were a discrepancy as large as δ absent

Page 43: Controversy Over the Significance Test Controversy

43

Test T+: Normal testing: H0: μ < μ0 vs. H1: μ > μ0 σ known (FEV/SEV): If d(x) is not statistically significant, then

μ < M0 + kεσ/√n passes the test T+ with severity (1 – ε)

 (FEV/SEV): If d(x) is statistically significant, then

μ > M0 + kεσ/√n passes the test T+ with severity (1 – ε)

  where P(d(X) > kε) = ε

Page 44: Controversy Over the Significance Test Controversy

44

References• Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical

Inference: A Discussion, edited by L. J. Savage. London: Methuen.

• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses." Journal of Mathematical Psychology 72: 90-103. 

• Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?' and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32.

• Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.

• Berger, J. O. and Sellke, T. 1987. 'Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence (with Discussion & Rejoinder)', Journal of the American Statistical Association 82(397): 112–22; 135-9.

• Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033.

• Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin of the American Mathematical Society 50(1): 126-46.

• Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T. and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and Robustness. New York: Academic Press.

• Colquhoun, D. 2014. 'An Investigation of the False Discovery Rate and the Misinterpretation of P-values', Royal Society Open Science, 1(3): 140216 (16 pages).

Page 45: Controversy Over the Significance Test Controversy

45

• Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall.

• Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press.

• Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd.

• Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.

• Gelman, A. and Shalizi, C. 2013. “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80.

• Gigerenzer, G. 2004. 'Mindless statistics', Journal of Socio-Economics, 33(5): 587-606. • Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The

Empire of Chance. Cambridge: Cambridge University Press.• Gilbert, D. Twitter post: https://twitter.com/dantgilbert/status/470199929626193921

• Gill, comment: On the “Suspicion of Scientific Misconduct by Jens Forster by Neuroskeptic May 6, 2014 on Discover Magazine Blog: http://blogs.discovermagazine.com/neuroskeptic/2014/05/06/suspicion-misconduct-forster/#.Vynr3j-scQ0.

• Goldacre, B. 2008. Bad Science. HarperCollins Publishers.

• Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature 530(7588); 7; online 04Feb2016.

Page 46: Controversy Over the Significance Test Controversy

46

• Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999; 130:1005 –1013.

• Handwerk, B. 2015. “Scientists Replicated 100 Psychology Studies, and Fewer than Half Got the Same Results.” Smithsonian Magazine (August 27, 2015) http://www.smithsonianmag.com/science-nature/scientists-replicated-100-psychology-studies-and-fewer-half-got-same-results-180956426/?no-ist

• Hasselman, F. and Mayo, D. 2015, April 17. “seveRity” (R-program). Retrieved from osf.io/k6w3h

• Ioannidis, J. 2005.  'Why most published research findings are false', PLoS Med 2(8):0696-0701.

• Levelt Committee, Noort Committee, Drenth Committee. 2012. 'Flawed science: The fraudulent research practices of social psychologist Diederik Stapel', Stapel Investigation: Joint Tilburg/Groningen/Amsterdam investigation of the publications by Mr. Stapel. (https://www.commissielevelt.nl/)

• Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.

• Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.

• Mayo, D. G. 2016. 'Don't Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary', The American Statistician, online March 7, 2016. http://www.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108.

• Mayo, D. G. Error Statistics Philosophy Blog: errorstatistics.com

Page 47: Controversy Over the Significance Test Controversy

47

• Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

• Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.

• Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier.

• Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300.

• Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter.

• Open Science Collaboration (Nozeck, B. et al). 2015. “Estimating the Reproducibility of Psychological Science.” Science 349(6251)

• Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96.

Page 48: Controversy Over the Significance Test Controversy

48

•Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.

•Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen.

•Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.

•Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888.

•Smithsonian Magazine (See Handwerk)

•Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology 37(1): pp. 1-2.

•Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process, and purpose”, The American Statistician

• Link to ASA statement & Commentaries (under supplemental): http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108