Mayo O&M slides (4-28-13)

5-4 1

Methodology and Ontology in Statistical Modeling: Some error statistical reflections

Our presentation falls under the second of the bulleted questions for the conference:

How do methods of data generation, statistical modeling, and inference influence the construction and appraisal of theories?

Statistical methodology can influence what we think we’re finding out about the world, in the most problematic ways, traced to such facts as:

• All statistical models are false • Statistical significance is not substantive

significance • Statistical association is not causation • No evidence against a statistical null hypothesis is

not evidence the null is true • If you torture the data enough they will confess.

(or just omit unfavorable data) These points are ancient (lying with statistics, lies damn lies, and statistics) People are discussing these problems more than ever (big data), but it’s rarely realized is how much certain methodologies are at the root of the current problems

5-4 2

All Statistical Models are False Take the popular slogan in statistics and elsewhere is “all statistical models are false!” What the “all models are false” charge boils down to:

(1) the statistical model of the data is at most an idealized and partial representation of the actual data generating source.

(2) a statistical inference is at most an idealized and partial answer to a substantive theory or question.

• But we already know our models are idealizations: that’s what makes them models

• Reasserting these facts is not informative,. • Yet they are taken to have various (dire)

implications about the nature and limits of statistical methodology

• Neither of these facts precludes the use of these to find out true things

• On the contrary, it would be impossible to learn about the world if we did not deliberately falsify and simplify.

5-4 3

• Notably, the “all models are false” slogan is followed

up by “But some are useful”, • Their usefulness, we claim, is being capable of

adequately capturing an aspect of a phenomenon of interest

• Then a hypothesis asserting its adequacy (or

inadequacy) is capable of being true! Note: All methods of statistical inferences rest on statistical models.

What differentiates accounts is how well they step up to the plate in checking adequacy, learning despite violations of statistical assumptions (robustness)

5-4 4

Statistical significance is not substantive significance

Statistical models (as they arise in the methodology of statistical inference) live somewhere between

1. Substantive questions, hypotheses, theories H

2. Statistical models of phenomenon, experiments, data: M

3. Data x

What statistical inference has to do is afford adequate link-ups (reporting precision, accuracy, reliability)

5-4 5

Recent Higgs reports on evidence of a real (Higg’s-like) effect (July 2012, March 2013)

Researchers define a “global signal strength” parameter

H0: μ = 0 corresponds to the background (null hypothesis),

μ > 0 to background + Standard Model Higgs boson signal,

but μ is only indirectly related to parameters in substantive models

As is typical of so much of actual inference (experimental and non), testable predictions are statistical:

They deduced what would be expected statistically from background alone (compared to the 5 sigma observed)

in particular, alluding to an overall test S:

Pr(Test S would yields d(X) > 5 standard deviations; H0) ≤ .0000003.

This is an example of an error probability

5-4 6

The move from statistical report to evidence

The inference actually detached from the evidence can be put in any number of ways

There is strong evidence for H: a Higgs (or a Higgs-like) particle.

An implicit principle of inference is

Why do data x0 from a test S provide evidence for rejecting H0 ?

Because were H0 a reasonably adequate description of the process generating the data would (very probably) have survived, (with respect to the question).

Yet statistically significant departures are generated: July 2012, March 2013 (from 5 to 7 sigma)

Inferring the observed difference is “real” (non-fluke) has been put to a severe test

Philosophers often call it an “argument from coincidence”

(This is a highly stringent level, apparently in this arena of particle physics smaller observed effects often disappear)

5-4 7

Even so we cannot infer to any full theory

That’s what’s wrong with the slogan “Inference to the “best” Explanation

Some explanatory hypothesis T entails statistically significant effect. Statistical effect x is observed. Data x are good evidence for T.

The problem: Pr(T “fits” data x; T is false ) = high And in other less theoretical fields, the perils of “theory-laden” interpretation of even genuine statistical effects are great

[Babies look statistically significantly longer when red balls are picked from a basket with few red balls: Does this show they are running, at some intuitive level, a statistical significance test, recognizing statistically surprising results? It’s not clear]

5-4 8

The general worry reflects an implicit requirement for evidence:

Minimal Requirement for Evidence. If data are in accordance with a theory T, but the method would have issued so good a fit even if T is false, then the data provide poor or no evidence for T. The basic principle isn’t new, we find it Peirce, Popper, Glymour….what’s new is finding a way to use error probabilities from frequentist statistics (error statistics) to cash it out To resolve controversies in statistics and even give a foundation for rival accounts

5-4 9

Dirty Hands: But these statistical assessments, some object, depend on methodological choices in specifying statistical methods; outputs are influence by discretionary judgments: dirty hands argument

While it is obvious that human judgments and human measurements are involved, (like “all models are false”) it is too trivial an observation to distinguish how different account handle threats of bias and unwarranted inferences Regardless of the values behind choices in collecting, modeling, drawing inferences from data, I can critically evaluate how good a job has been done. (test too sensitive, not sensitive enough, violated assumptions)

5-4 10

An even more extreme argument, moves from “models are false”, to models are objects of belief, to therefore statistical inference is all about subjective probability. By the time we get to the “confirmatory stage” we’ve made so many judgments, why fuss over a few subjective beliefs at the last part….

George Box (a well known statistician) “the confirmatory stage of an investigation…will typically occupy, perhaps, only the last 5 per cent of the experimental effort. The other 95 per cent—the wondering journey that has finally led to that destination---involves many heroic subjective choices (what variables? What levels? What scales?, etc. etc…. Since there is no way to avoid these subjective choices…why should we fuss over subjective probability?” (70)

It is one thing to say our models are objects of belief, and quite another to convert the entire task to modeling beliefs.

We may call this shift from phenomena to epiphenomena (Glymour 2010)

Yes there are assumptions, but we can test them, or

at least discern how they may render our inferences less precise, or completely wrong.

5-4 11

The choice isn’t full blown truth or degrees of belief.

We may warrant models (and inferences) to various degrees, such as by assessing how well corroborated they are.

Some try to adopt this perspective of testing their statistical models, but give us tools with very little power to find violations • Some of these same people, ironically, say since we

know our model is false, the criteria of high power to detect falsity is not of interest. (Gelman).

• Knowing something is an approximation is not to

pinpoint where it is false, or how to get a better model.

[Unless you have methods with power to probe this approximation, you will have learned nothing about where the model stands up and where it breaks down, what flaws you can rule out, and which you cannot.]

5-4 12

Back to our question How do methods of data generation, statistical modeling, and analysis influence the construction and appraisal of theories at multiple levels? • All statistical models are false • Statistical significance is not substantive



(or just omit unfavorable data) These facts open the door to a variety of antiquated statistical fallacies, but the all models are false, dirty hands, it’s all subjective, encourage them. From popularized to sophisticated research, in social sciences, medicine, social psychology

“We’re more fooled by noise than ever before, and it’s because of a nasty phenomenon called “big data”. With big data, researchers have brought cherry-picking to an industrial level”. (Taleb, Fooled by randomness 2013)

It’s not big data it’s big mistakes about methodology and modeling

http://www.wired.com/opinion/2013/02/big-data-means-big-errors-people/

5-4 13

This business of cherry picking falls under a more general issue of “selection effects” that I have been studying and writing about for many years.

Selection effects come in various forms and given different names: double counting,hunting with a shotgun (for statistical significance) looking for the pony, look elsewhere effects, data dredging, multiple testing, p-value hacking

One common example: A published result of a clinical trial alleges statistically significant benefit (of a given drug for a given disease), at a small level .01, but ignores 19 other non-significant trials actually make it easy to find a positive result on one factor or other, even if all are spurious. The probability that the procedure yields erroneous rejections differs from, and will be much greater than, 0.01 (nominal vs actual significance levels) How to adjust for hunting and multiple testing is a separate issue (e.g., false discovery rates).

5-4 14

If one reports results selectively, or stop when the data look good, etc. it becomes easy to prejudge hypotheses: Your favored hypothesis H might be said to have “passed” the test, but it is a test that lacks stringency or severity.

(our minimal principle for evidence again)

• Selection effects alter the error probabilities of tests and estimation methods, so at least methods that compute them can pick up on the influences

• If on the other hand, they are reported in the same way, significance testing’s basic principles are being twisted, distorted, invalidly used

• It is not a problem about long-runs either—.

We cannot say about the case at hand that it has done a good job of avoiding the source of misinterpretation, since it makes it so easy to find a fit even if false.

5-4 15

The growth of fallacious statistics is due to the acceptability of methods that declare themselves free from such error-probabilistic encumbrances (e.g., Bayesian accounts).

Popular methods of model selection (AIC, and others) suffer from similar blind spots

Whole new fields for discerning spurious statistics, non-replicable results; statistical forensics: all use error statistical methods to identify flaws

(Stan Young, John Simonsohn, Brad Efron, Baggerly and Coombes)

• All statistical models are false • Statistical significance is not substantive



(or just omit unfavorable data)

To us, the list is not a list of embarrassments but justifications for the account we favor.

5-4 16

Models are false Does not prevent finding out true things with them

Discretionary choices in modeling Do not entail we are only really learning about

beliefs Do not prevent critically evaluating the properties

of the tools you chose.

A methodology that uses probability to assess and control error probabilities has the basis for pinpointing the fallacies (statistical forensics, meta statistical analytics) These models work because they need only capture rather coarse properties of the phenomena being probed: the error probabilities assessed are approximately related to actual ones. Problems are intertwined with testing assumptions of statistical models The person I’ve learned the most about this is Aris Spanos who will now turn to that.

Documents

Mayo O&M slides (4-28-13)