The Confluence Between Stat Science and Phil Science: Shallow vs. Deep Explorations CONFLUENCE... · 2010-07-11 · "a cement bond log” (using acoustics) Passing this overall “test”

!"

The Confluence Between Stat Science

and Phil Science: Shallow vs. Deep

Explorations

Deborah G. Mayo

June 21, 2010

!#

To begin, we might probe a familiar philosophy-

statistics analogy:

Popper: Frequentists as Carnap:

Bayesians

“In opposition to [the] inductivist attitude, I

assert that C(H,x) must not be interpreted as

the degree of corroboration of H by x, unless

x reports the results of our sincere efforts to

overthrow H. The requirement of sincerity

cannot be formalized—no more than the

inductivist requirement that e must represent

our total observational knowledge. (Popper

1959, p. 418.)

Observations or experiments can be

accepted as supporting a theory (or a

hypothesis, or a scientific assertion) only if

these observations or experiments are

severe tests of the theory---or in other words,

only if they result from serious attempts to

refute the theory, ….” (Popper, 1994, p. 89)

!$

Did BP representatives have good evidence that

H: the cement seal is adequate (no gas should

leak)?

Not if they kept decreasing the pressure until

H passed, rather than performing a more

severe test (as they were supposed to) called

"a cement bond log” (using acoustics)

Passing this overall “test” made it too easy for

H for pass, even if false.

When we reason this way, we are insisting that

Weakest Requirement for a Genuine Test:

Agreement between data x and H fails to

count in support of a hypothesis or claim H, if

so good an agreement was (virtually) assured

even if H is false—no test at all!

!%

The Severe Tester Standpoint: is admittedly

skeptical, demanding—it places the burden of

proof differently than accounts built on positive

instances or agreement or the like.

Popperian Standpoint: we give our theories a

hard time so that we are spared…

Yet Popperian computations never gave him a

way to characterize severity adequately.

Aside: Popper wrote to me “I regret never

having learned statistics”

(I replied, “not as much as I do”)

!&

Oddly, modern-day Popperians seem to retain

the limitations that emasculated Popper:

The “critical rationalists” deny that the

method they recommend —accept the

hypothesis that is best-tested so far—is reliable

(Musgrave)

They are right, the best tested so far may be

poorly tested, but it does not suffice to simply

claim that this is the definition of a rational

procedure… (“I know of nothing more

rational…”)

But we could draw on statistical method to

implement “a theory of criticism” (not saying

it’s the only place)

Phil Sci Stat Sci Phil Sci

Popperian aim stat method improved Popper

(Albert will talk about critical rationalism)

'(

Interestingly (though apparently not in Popper’s

case), work in philosophy of statistics in the 70’s

and 80’s was as likely to be engaged in by statistical

practitioners as by philosophers—futuristic

That has dwindled, but I hope to revive it

We should clarify the relationship between

Frequentist stat ! frequentist philo

O-Bayesian stat ! O-Bayesian philo

(the last two of which our next two speakers

represent: Williamson, Bernardo)

Insights from statistical practitioners as to the

actual use of their methods could correct some

long-repeated misconstruals in philosophical

work

Stat Sci Phil Sci

'!

In an obscure article, found in the attic, Neyman

responds to Carnap’s criticism of “Neyman’s

frequentism”i:

“When Professor Carnap criticizes some attitudes

which he represents as consistent with my

(“frequentist”) point of view, I readily join him in

his criticism without, however, accepting the

responsibility for the criticized paragraphs”. (p. 13)

To Carnap, Neyman is an “inductive straight

ruler”:

If all you know is p% of H’s have been C’s then

the frequentist infers there is a high probability

that p% of H’s will be C in a long series of trials.

This overlooks, Neyman retorts, “that applying

any theory of inductive inference requires a

theoretical model of some phenomena, not merely

the phenomena themselves”.

The role of the statistical model is tomorrow’s

focus... I want to focus on Neyman’s other gripe....

''

“I am concerned with the term ‘degree of

confirmation’ introduced by Carnap. …We have

seen that the application of the locally best one-

sided test to the data…failed to reject the hypothesis

[that the 26 observations come from a source in

which the null hypothesis is true]. The question is:

does this result ‘confirm’ the hypothesis that H0 is

true of the particular data set]? ”.

“Locally best one-sided Test T

A sample X = (X1, …,Xn) each Xi is Normal,

N(!,"2), (NIID), s assumed known;

H0: ! !0 against H1: ! > !0.

test statistic d(X) is the sample mean X

standardized d*(X) = (X-!0)/"x,

Test fails to reject the null, d*(x0) c#$.

Carnap says yes…

')

….the attitude described is dangerous.

…the chance of detecting the presence [of

discrepancy %&from the null], when only [this

number] of observations are available, is

extremely slim, even if [%&is present].

“One may be confident in the absence of that

discrepancy only if the power to detect it were

high”.

(1) P(d(X) > c#; ! = !0 + %) Power to detect %

Just missing the cut-off c# is the worst case

It is more informative to look at the probability

of getting a worse fit than you did

(2) P(d(X) > d(x0); ! = !0 + %) “attained power”

a measure of the severity (or degree of

corroboration) the inference ! < !0 + %

'*

While data-dependent, the reasoning using (2)

is still in sync with Neyman’s argument here...ii

"I am not claiming it is part of the N-P

school,

"I recommend moving away from the idea

that we are to “sign up” for N-P or Fisherian

“paradigms”

Perhaps if Neyman had talked more to

philosophers, Neyman would have made explicit

this statistical philosophy (close to Egon

Pearson’s)

STAT-PHIL-STAT (-Phil)

e.g., Neyman corrects Carnapian

conception improved Neyman

'"

As a philosopher of statistics I am

"supplying the tools with an interpretation

and an associated philosophy of inference….

Using a distinct term, “error statistics” frees us

from the bogeymen and bogywomen of the so-

called “classical” statistical methods—or so I

intend.

Nor do my comments turn on whether one

replaces frequencies with “propensities”

(whatever they are).

'#

Error (Probability) Statistics

"What is key on the statistics side: The

probabilities refer to the distribution of

statistic d(x) (sampling distribution)—

applied to events

"What is key on the philosophical side: error

probabilities may* be used to quantify

probativeness or severity of tests (for a

given inference)

*they do not always or automatically give this

"The standpoint of the severe prober, or the

severity principle, directs us to obtain error

probabilities that are relevant to determining

well-testedness

'$

The Two Tenets

I now turn to my diagnosis of central

disagreements, criticisms and attempted

(Bayesian-Frequentist) unifications

They turn on two tenets as to:

What we need

1.Probabilism: the role of probability in

inference is to assign a degree of belief,

support, confirmation (given by mathematical

probability)

(an adequate statistical account requires a

(posterior) probability assignment to

hypotheses)

What we get from “frequentist’ methods

2.Radical Behavioristic Construal: The

frequentist’s central aim is assuring low long-

run error (left vague)

'%

(a frequentist method is adequate so long as it has a

low probability of leading to errors in a long run

series repetitions of the experiment).

'&

I. The criticisms of frequentist statistical

methods take the form of one or both:

"Error probabilities do not supply posterior

probabilities in hypotheses (tenet #1)

"The radical behavioristic standpoint leads to

counterintuitive appraisals of inferences—this

needs examination (tenet #2)

II. An adequate rescue (i.e., “unification”)

would be a way to satisfy #1, and avoid

counterintuitive example licensed from #2.

I consider key examples of each

)(

Unifications via Confidence Interval

Estimation

Unificationists point to the easy agreement with

numbers (at least in some common contexts)

Berger: “Fisher, Jeffreys and Neyman “supported

use of the same estimation and confidence

procedures

(X-1.96sx < ! < X+ 1.96sx) 95% (2-sided) CI

or

(! < X+ 1.96sx) the upper (97.5%) CI interval

estimation rule for the example with a Normal

mean,

but insisted on assigning it fiducial, objective

Bayesian and frequentist interpretations,

respectively. While the debate over

interpretation can be strident, statistical

practice is little affected as long as the reported

numbers are the same.” (Berger 2003, p. 1)

“Could Fisher, Jeffreys and Neyman Have Agreed on Testing?”

(Jim Berger 2003)

)!

“Statistical practice is little effected” by interpretive

controversies …..? (too complacent?)

A central criticism of CI’s stems from claiming they

are invariably misinterpreted, namely by arguing….

P(! < ( X + 2sx); !) = .975,

Observe mean X0

Therefore, P (! < ( X0 + 2"x); !) = .975.

While this instantiation is fallacious*, since we

know know the claim:

! < ( X0 + 2"x)

is true or false, we instantiate anyway: we’re

addicted to probabilism (or perhaps because it is

assumed they are otherwise useless):

* I call it, “Fallacy of Probabilistic Instantiation”

What underlies that criticism is tenet #1: the

theory that inductive inference has to have a

post-data probability assignment to hypotheses

(degrees of belief, support ….)

)'

Here’s a rival philosophical theory:

The role of probability in scientific induction is

to quantify (and control) how well-tested or

well-probed hypotheses are…

(highly probable vs. highly probed)

If one wants a post-data measure, one can write:

SEV(! < X0 + %"x) to abbreviate:

The severity with which test T with data x

passes claim:

(! < X0 + %"x).

))

One can consider a series of upper confidence

bounds…

SEV(! < X0 + 0"x) = .5

SEV(! < X0 + .5"x) = .7

SEV(! < X0 + 1"x) = .84

SEV(! < X0 + 1.5"x) = .93

SEV(! < X0 + 1.98"x) = .975

But aren’t I just using this as another way to say

how probable each claim is?

No. This would lead to inconsistencies (if we

mean mathematical probability), but the main

thing is, or so I argue, probability gives the

wrong logic for “how well-tested”

(or “corroborated”) a claim is

)*

What Would a Logic of Severe Testing be?

If H has passed a severe test T with x, there is

no problem in saying x warrants H, or, if one

likes, x warrants believing in H….

(H could be the confidence interval estimate)

If SEV(H ) is high, its denial is low, i.e.,

SEV(~H ) is low

But it does not follow that a severity assessment

should obey the probability calculus, or be a

posterior probability….

For example, if SEV(H ) is low, SEV(~H )

could be high or low

……in conflict with a probability assignment

(there may be a confusion of ordinary language

use of “probability”)

)"

For an extreme case, to just assume

H: the cement seal is adequate

(regardless of data) yields a very poor evidential

warrant for H, and also a poor warrant for ~H

Other points lead me to deny probability yields a

logic we want for well-testedness

(e.g., problem of irrelevant conjunctions, tacking

paradox for Bayesians and hypothetico-deductivists:

+, -./012 x2 34 56 7892 :;<=2 <><= 5, -./012x2?4 56 ;5@;2

-./012 x2 ? A 34 56 789B4

)#

Quick note:

This is different from what I call the

Rubbing off construal: The procedure is rarely

wrong, therefore, the probability it is wrong in

this case is low.

Still too behavioristic

The long-run reliability of the rule is a

necessary but not a sufficient condition to

infer H (with severity)

The reasoning instead is counterfactual:

H: ! < X0 + 1.96"x

(i.e., ! < CIu )

passes severely because were this inference

false, and the true mean ! > CIu

then, very probably, we would have observed a

larger sample mean:

)$

A very well known criticism (many Bayesians say

it is a reductio) of frequentist confidence intervals

stems from assuming that confidence levels must

give degrees of belief, critics mount -—a version

of tenet #1 about the role of probability.

Given some additional constraints about the

parameter, a 95% confidence interval might be

known to be correct -trivial intervals

But there’s no inconsistency—confidence levels are

not degrees of probability or belief in the interval

In our construal, the trivial interval informs us that

no parameter values are ruled out with severity…..

“Viewed as a single statement [the trivial interval] is

trivially true, but, on the other hand, viewed as a

statement that all parameter values are consistent

with the data at a particular level is a strong

statement about the limitations of the data.”

(Cox and Hinkley1974, p. 226)

)%

Other serious criticisms of frequentist statistics

are based on assuming the second tenet: how to

use the methods in scientific contexts?

Radical Behavioristic Standpoint

The test rule is a good one so long as it has a low

probability of leading to errors in a long run

series (low frequency of errors in long-run

repetitions of the experiment).

In a strict behavioristic approach, that’s all that

matters….

)&

Disjunctive Tests

Rather than give the technical statistical example

(Cox 1958) I jump to the kind of howler that the

frequentist account is alleged to permit:

Oil Exec: Our inference that H: the cement is

fine is highly reliable

Senator: But didn’t you just say no cement bond

log was performed, when it initially failed…?

Oil Exec: That’s what we did on April 20, but

usually we do—I’m giving the average.

We use a randomizer that most of the time

directs us to run the gold-standard check on

pressure, but with small probability tells us to

assume the pressure is fine, or keep decreasing

the pressure of the test til it passes….

*(

A report of the average error might be relevant

for some contexts (insurance?) but the oil rep

gives a highly misleading of the stringency of

the actual test H managed to “pass” from the

data on April 20.

Violates the severity criterion: —it would be

easy to ensure evidence that H is true, even

when H is false

Cox uncovered and gave a way to solve the

problem years ago: In a disjunctive experiment,

“if it is known which experiment produced the

data, inferences about ! are appropriately

drawn in terms of the sampling distribution of

the experiment known to have been performed

(Cox: Weak Conditionality Principle).

Nevertheless, we see in numerous books and

papers that the frequentist cannot really avail

herself of ways out (of reporting the irrelevant

average error rate)….

*!

Digging beneath these arguments exposes a

slightly different twist on the tenet #2:

If an account considers outcomes that could

have occurred in hypothetical repetitions of this

experiment (which we do in considering the

distribution of d(x))

then it must also consider experiments that could

have been run (in evaluating the experiment that

was run)….

*'

So if one is not prepared to average over

irrelevant experiments, then one cannot

consider any data set other than the one

observed,

It would entail no use of error probabilities.

Or, in other words, we are led to the (strong)

likelihood principle (LP)

Aside: There is a famous “radical” argument

purporting to show that error statistical

principles entail the Likelihood Principle (LP)

(Birnbaum, 1962), but the argument is

flawed—invalid or unsound (Mayo 2010).

*)

I think you are familiar with the LPiii:

One way to illustrate its violation in frequentist

statistics is via the “Optional Stopping Effect”.

We have a random sample from a Normal

distribution with mean µ and standard deviation ",

i.e.

Xi ~ N(µ,") and we test H0: µ=0, vs. H1: µ'0.

stopping rule:

Keep sampling until H is rejected at the .05 level

(i.e., keep sampling until | X | ( 1.96 "/ n ).

With this stopping rule the actual significance level

differs from, and will be greater than .05.

**

Fifty years ago, almost to the day, not far from

here (subjective) Bayesian Savage announced to

a group of eminent statisticians that “optional

stopping is no sin”.

The likelihood principle emphasized in Bayesian

statistics implies, … that the rules governing when

data collection stops are irrelevant to data

interpretation. (Edwards, Lindman, Savage 1963, p.

239).

It’s quite relevant to the error statistician: This

ensures high or maximal probability of error, —

violates weak severity

The LP violates the weak repeated sampling rule

(Cox and Hinkley)

*"

Equivalently with (2-sided) Confidence Intervals

Keep sampling until the 95% confidence

interval excludes 0

Berger and Wolpert (the Likelihood Principle, 1988)

concede that using this stopping rule

"has thus succeeded in getting the [Bayesian]

conditionalist to perceive that ! ! 0, and has

done so honestly. (pp. 80-81)

This is a striking admission—especially as they

assign a probability of .95 to the truth of the

interval estimate:

! = y + 1.96s/ n

“Bayesian credibility interval”

*While the stopping rule is determined by

intentions of the experimenter as to when to

stop, that doesn’t mean taking them into

account is to take intentions into account,

(“argument from intentions”)

*#

Savage’s Sleight of hand (quick sketch)

1.At the Savage Forum Armitage showed how the

same thing can happen for a Bayesian (where

the posterior = the confidence level)…and “thou

shalt be misled!” if thou dost not know the

stopping rule…

2. Savage does care, says no way, any more than

we can build a perpetual motion machine—!

"Savage switches to a very different case—where

the null and alternative are both (point)

hypotheses that have been fixed before the data,

and the test is restricted to these two preselected

values

"Here, the probability of erroneously finding

evidence in favor of the alternative is low, but

that’s not the optional stopping example

"To this day defenders of the LP do likewise

(Royale, Sober?iv)

*$

Stephen Senn, intriguingly, remarks:

The late and great George Barnard, through his

promotion of the likelihood principle, probably

did as much as any statistician in the second

half of the last century to undermine the

foundations of the then dominant Neyman-

Pearson framework and hence prepare the way

for the complete acceptance of Bayesian ideas

… by the De Finetti-Lindley limit of 2020.

(Senn 2004)

Many do view Barnard as having that effect, but

he himself rejected the LP (as he makes clear at

the Savage Forum—to Savage’s shock)

*%

2010 Update:

It has long been accepted that the Bayesian

foundational superiority over error statisticians

stems from Bayesians upholding, while

frequentists violate, the likelihood principle (LP),

Frequentists, concede that they (the

Bayesians) are coherent, we are not…

What then to say about leading default

Bayesians, (Bernardo, 2005, Berger, 2004)?

Admitting that “violation of principles such as

the likelihood principle is the price that has to

be paid for objectivity.” (Berger, 2004).

Although they are using very different

conceptions of “objectivity” objectivity — there

seems to be an odd sort of agreement between

them and the error statistician.

*&

(i) Do the concessions of reference Bayesians

bring them into the error statistical fold?

I think not, (though I will leave this as an open

question). I base my (tentative) no answer on:

(a) While they may still have some ability to

ensure low error probabilities in a long-

run—some might therefore say they may

be regarded as “frequentists”*

This would not make them error

statisticians….

(b) While relevant error probabilities (for

assessing well-testedness in the case at

hand) entails violating the LP, the

converse is not true.

“Bayesian statistics is about making probability statements,

frequentist statistics is about evaluating probability

statements. …. Bayesians can feel free, or even obliged, to

evaluate the frequency properties of their procedures.

Conversely, …. frequentists can .. use Bayesian methods to

derive statistical procedures with good frequency

properties. (Gelman 2004)

!"

(ii) What are its (Bayesian) foundations?

Impersonal priors may not be construed as

measuring beliefs or even probabilities—they

are often improper.

If prior probabilities represent background

information, why do they differ according to

the experimental model?

They are mere “reference points” for getting

posteriors, but how shall they be interpreted?

!#

Quick summary:

My diagnosis and Rx

Underlying the disagreements, criticisms and

attempted unifications are the following two

tenets (which are fleshed out in a variety of

ways) as to:

What we need

1.Probabilism: the role of probability in

inference is to assign a degree of belief,

support, confirmation (given by mathematical

probability)

What we get from “frequentist’ methods

2.Radical Behavioristic Construal: The

frequentist’s central aim is assuring low long-

run error

!$

I. The criticisms of frequentist statistical

methods take the form of one or both:

Error probabilities do not supply posterior

probabilities in hypotheses (tenet #1)

The radical behavioristic standpoint leads to

counterintuitive appraisals of inferences (tenet

#2)

!%

II. An adequate rescue (i.e., “unification”)

would be a way to satisfy #1, and avoid

counterintuitive example licensed from #2.

“Unifications” (e.g., between frequentists and

Bayesians) therefore, proceed by equating the

probabilities in #1 to the error probability

quantities in #2

!&

An alternative theory or philosophy replaces

these tenets with 2 other ones:

1*the role of probability in inference is to quantify

how reliably or severely claims have been tested

2* Reject radical behaviorism: control of long run

errors probabilities, while necessary is not sufficient

for good tests

the severity principle directs us to the

relevant error probabilities, thereby avoiding the

classic counterintuitive examples

Where differences remain (disagreement on

numbers) e.g., p-values and posteriors, we should

recognize the difference in the goals promoted

!!

Error (Probability) Statistics

Error probabilities may be used to quantify

probativeness or severity of tests (for a given

inference)

Its standpoint of the severe prober, or the

severity principle, directs us to obtain error

probabilities that are relevant to determining

well-testedness

The logic of severe testing (or corroboration)

is not probability logic

With our correction to tenets #1 and #2 a

completely different picture emerges

It may well be that there’s a role for both, but

that’s not to unify them, or claim the

differences don’t matter..

i “The Problem of Inductive Inference” Neyman, 1955) ii One must be careful to recognize that in the case of a

significant result x, the more powerful or sensitive the test,

!'

the smaller the discrepancy from the null that is warranted.

This is discussed at length elsewhere. iii According to Bayes’s theorem, P(x|µ) ... constitutes the

entire evidence of the experiment, that is, it tells all that the

experiment has to tell. More fully and more precisely, if y

is the datum of some other experiment, and if it happens

that P(x|µ) and P(y|µ) are proportional functions of µ (that

is, constant multiples of each other), then each of the two

data x and y have exactly the same thing to say about the

values of µ… (Savage 1962, p. 17.)

iv To be fair, they may be prepared to limit their accounts to

a comparative assessment of point against point, thereby

preventing one from ever having two-sided or even one-

sided tests, with complex alternatives). But this leaves the

account restricted to highly artificial examples, which also

suffers from the inadequacy of a purely comparative

account. Worse, by constructing the point alternative post

data, in order to perfectly fit the data (maximally likely

alternative), one is assured of rejecting the null in favor of

the maximally likely alternative, even if the null is true! So

we are back to violating severity (and the weak sampling

principle).

Documents

The Confluence Between Stat Science and Phil Science: Shallow vs. Deep Explorations CONFLUENCE... · 2010-07-11 · "a cement bond log” (using acoustics) Passing this overall “test”