Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
!"
The Confluence Between Stat Science
and Phil Science: Shallow vs. Deep
Explorations
Deborah G. Mayo
June 21, 2010
!#
To begin, we might probe a familiar philosophy-
statistics analogy:
Popper: Frequentists as Carnap:
Bayesians
“In opposition to [the] inductivist attitude, I
assert that C(H,x) must not be interpreted as
the degree of corroboration of H by x, unless
x reports the results of our sincere efforts to
overthrow H. The requirement of sincerity
cannot be formalized—no more than the
inductivist requirement that e must represent
our total observational knowledge. (Popper
1959, p. 418.)
Observations or experiments can be
accepted as supporting a theory (or a
hypothesis, or a scientific assertion) only if
these observations or experiments are
severe tests of the theory---or in other words,
only if they result from serious attempts to
refute the theory, ….” (Popper, 1994, p. 89)
!$
Did BP representatives have good evidence that
H: the cement seal is adequate (no gas should
leak)?
Not if they kept decreasing the pressure until
H passed, rather than performing a more
severe test (as they were supposed to) called
"a cement bond log” (using acoustics)
Passing this overall “test” made it too easy for
H for pass, even if false.
When we reason this way, we are insisting that
Weakest Requirement for a Genuine Test:
Agreement between data x and H fails to
count in support of a hypothesis or claim H, if
so good an agreement was (virtually) assured
even if H is false—no test at all!
!%
The Severe Tester Standpoint: is admittedly
skeptical, demanding—it places the burden of
proof differently than accounts built on positive
instances or agreement or the like.
Popperian Standpoint: we give our theories a
hard time so that we are spared…
Yet Popperian computations never gave him a
way to characterize severity adequately.
Aside: Popper wrote to me “I regret never
having learned statistics”
(I replied, “not as much as I do”)
!&
Oddly, modern-day Popperians seem to retain
the limitations that emasculated Popper:
The “critical rationalists” deny that the
method they recommend —accept the
hypothesis that is best-tested so far—is reliable
(Musgrave)
They are right, the best tested so far may be
poorly tested, but it does not suffice to simply
claim that this is the definition of a rational
procedure… (“I know of nothing more
rational…”)
But we could draw on statistical method to
implement “a theory of criticism” (not saying
it’s the only place)
Phil Sci Stat Sci Phil Sci
Popperian aim stat method improved Popper
(Albert will talk about critical rationalism)
'(
Interestingly (though apparently not in Popper’s
case), work in philosophy of statistics in the 70’s
and 80’s was as likely to be engaged in by statistical
practitioners as by philosophers—futuristic
That has dwindled, but I hope to revive it
We should clarify the relationship between
Frequentist stat ! frequentist philo
O-Bayesian stat ! O-Bayesian philo
(the last two of which our next two speakers
represent: Williamson, Bernardo)
Insights from statistical practitioners as to the
actual use of their methods could correct some
long-repeated misconstruals in philosophical
work
Stat Sci Phil Sci
'!
In an obscure article, found in the attic, Neyman
responds to Carnap’s criticism of “Neyman’s
frequentism”i:
“When Professor Carnap criticizes some attitudes
which he represents as consistent with my
(“frequentist”) point of view, I readily join him in
his criticism without, however, accepting the
responsibility for the criticized paragraphs”. (p. 13)
To Carnap, Neyman is an “inductive straight
ruler”:
If all you know is p% of H’s have been C’s then
the frequentist infers there is a high probability
that p% of H’s will be C in a long series of trials.
This overlooks, Neyman retorts, “that applying
any theory of inductive inference requires a
theoretical model of some phenomena, not merely
the phenomena themselves”.
The role of the statistical model is tomorrow’s
focus... I want to focus on Neyman’s other gripe....
''
“I am concerned with the term ‘degree of
confirmation’ introduced by Carnap. …We have
seen that the application of the locally best one-
sided test to the data…failed to reject the hypothesis
[that the 26 observations come from a source in
which the null hypothesis is true]. The question is:
does this result ‘confirm’ the hypothesis that H0 is
true of the particular data set]? ”.
“Locally best one-sided Test T
A sample X = (X1, …,Xn) each Xi is Normal,
N(!,"2), (NIID), s assumed known;
H0: ! !0 against H1: ! > !0.
test statistic d(X) is the sample mean X
standardized d*(X) = (X-!0)/"x,
Test fails to reject the null, d*(x0) c#$.
Carnap says yes…
')
….the attitude described is dangerous.
…the chance of detecting the presence [of
discrepancy %&from the null], when only [this
number] of observations are available, is
extremely slim, even if [%&is present].
“One may be confident in the absence of that
discrepancy only if the power to detect it were
high”.
(1) P(d(X) > c#; ! = !0 + %) Power to detect %
Just missing the cut-off c# is the worst case
It is more informative to look at the probability
of getting a worse fit than you did
(2) P(d(X) > d(x0); ! = !0 + %) “attained power”
a measure of the severity (or degree of
corroboration) the inference ! < !0 + %
'*
While data-dependent, the reasoning using (2)
is still in sync with Neyman’s argument here...ii
"I am not claiming it is part of the N-P
school,
"I recommend moving away from the idea
that we are to “sign up” for N-P or Fisherian
“paradigms”
Perhaps if Neyman had talked more to
philosophers, Neyman would have made explicit
this statistical philosophy (close to Egon
Pearson’s)
STAT-PHIL-STAT (-Phil)
e.g., Neyman corrects Carnapian
conception improved Neyman
'"
As a philosopher of statistics I am
"supplying the tools with an interpretation
and an associated philosophy of inference….
Using a distinct term, “error statistics” frees us
from the bogeymen and bogywomen of the so-
called “classical” statistical methods—or so I
intend.
Nor do my comments turn on whether one
replaces frequencies with “propensities”
(whatever they are).
'#
Error (Probability) Statistics
"What is key on the statistics side: The
probabilities refer to the distribution of
statistic d(x) (sampling distribution)—
applied to events
"What is key on the philosophical side: error
probabilities may* be used to quantify
probativeness or severity of tests (for a
given inference)
*they do not always or automatically give this
"The standpoint of the severe prober, or the
severity principle, directs us to obtain error
probabilities that are relevant to determining
well-testedness
'$
The Two Tenets
I now turn to my diagnosis of central
disagreements, criticisms and attempted
(Bayesian-Frequentist) unifications
They turn on two tenets as to:
What we need
1.Probabilism: the role of probability in
inference is to assign a degree of belief,
support, confirmation (given by mathematical
probability)
(an adequate statistical account requires a
(posterior) probability assignment to
hypotheses)
What we get from “frequentist’ methods
2.Radical Behavioristic Construal: The
frequentist’s central aim is assuring low long-
run error (left vague)
'%
(a frequentist method is adequate so long as it has a
low probability of leading to errors in a long run
series repetitions of the experiment).
'&
I. The criticisms of frequentist statistical
methods take the form of one or both:
"Error probabilities do not supply posterior
probabilities in hypotheses (tenet #1)
"The radical behavioristic standpoint leads to
counterintuitive appraisals of inferences—this
needs examination (tenet #2)
II. An adequate rescue (i.e., “unification”)
would be a way to satisfy #1, and avoid
counterintuitive example licensed from #2.
I consider key examples of each
)(
Unifications via Confidence Interval
Estimation
Unificationists point to the easy agreement with
numbers (at least in some common contexts)
Berger: “Fisher, Jeffreys and Neyman “supported
use of the same estimation and confidence
procedures
(X-1.96sx < ! < X+ 1.96sx) 95% (2-sided) CI
or
(! < X+ 1.96sx) the upper (97.5%) CI interval
estimation rule for the example with a Normal
mean,
but insisted on assigning it fiducial, objective
Bayesian and frequentist interpretations,
respectively. While the debate over
interpretation can be strident, statistical
practice is little affected as long as the reported
numbers are the same.” (Berger 2003, p. 1)
“Could Fisher, Jeffreys and Neyman Have Agreed on Testing?”
(Jim Berger 2003)
)!
“Statistical practice is little effected” by interpretive
controversies …..? (too complacent?)
A central criticism of CI’s stems from claiming they
are invariably misinterpreted, namely by arguing….
P(! < ( X + 2sx); !) = .975,
Observe mean X0
Therefore, P (! < ( X0 + 2"x); !) = .975.
While this instantiation is fallacious*, since we
know know the claim:
! < ( X0 + 2"x)
is true or false, we instantiate anyway: we’re
addicted to probabilism (or perhaps because it is
assumed they are otherwise useless):
* I call it, “Fallacy of Probabilistic Instantiation”
What underlies that criticism is tenet #1: the
theory that inductive inference has to have a
post-data probability assignment to hypotheses
(degrees of belief, support ….)
)'
Here’s a rival philosophical theory:
The role of probability in scientific induction is
to quantify (and control) how well-tested or
well-probed hypotheses are…
(highly probable vs. highly probed)
If one wants a post-data measure, one can write:
SEV(! < X0 + %"x) to abbreviate:
The severity with which test T with data x
passes claim:
(! < X0 + %"x).
))
One can consider a series of upper confidence
bounds…
SEV(! < X0 + 0"x) = .5
SEV(! < X0 + .5"x) = .7
SEV(! < X0 + 1"x) = .84
SEV(! < X0 + 1.5"x) = .93
SEV(! < X0 + 1.98"x) = .975
But aren’t I just using this as another way to say
how probable each claim is?
No. This would lead to inconsistencies (if we
mean mathematical probability), but the main
thing is, or so I argue, probability gives the
wrong logic for “how well-tested”
(or “corroborated”) a claim is
)*
What Would a Logic of Severe Testing be?
If H has passed a severe test T with x, there is
no problem in saying x warrants H, or, if one
likes, x warrants believing in H….
(H could be the confidence interval estimate)
If SEV(H ) is high, its denial is low, i.e.,
SEV(~H ) is low
But it does not follow that a severity assessment
should obey the probability calculus, or be a
posterior probability….
For example, if SEV(H ) is low, SEV(~H )
could be high or low
……in conflict with a probability assignment
(there may be a confusion of ordinary language
use of “probability”)
)"
For an extreme case, to just assume
H: the cement seal is adequate
(regardless of data) yields a very poor evidential
warrant for H, and also a poor warrant for ~H
Other points lead me to deny probability yields a
logic we want for well-testedness
(e.g., problem of irrelevant conjunctions, tacking
paradox for Bayesians and hypothetico-deductivists:
+, -./012 x2 34 56 7892 :;<=2 <><= 5, -./012x2?4 56 ;5@;2
-./012 x2 ? A 34 56 789B4
)#
Quick note:
This is different from what I call the
Rubbing off construal: The procedure is rarely
wrong, therefore, the probability it is wrong in
this case is low.
Still too behavioristic
The long-run reliability of the rule is a
necessary but not a sufficient condition to
infer H (with severity)
The reasoning instead is counterfactual:
H: ! < X0 + 1.96"x
(i.e., ! < CIu )
passes severely because were this inference
false, and the true mean ! > CIu
then, very probably, we would have observed a
larger sample mean:
)$
A very well known criticism (many Bayesians say
it is a reductio) of frequentist confidence intervals
stems from assuming that confidence levels must
give degrees of belief, critics mount -—a version
of tenet #1 about the role of probability.
Given some additional constraints about the
parameter, a 95% confidence interval might be
known to be correct -trivial intervals
But there’s no inconsistency—confidence levels are
not degrees of probability or belief in the interval
In our construal, the trivial interval informs us that
no parameter values are ruled out with severity…..
“Viewed as a single statement [the trivial interval] is
trivially true, but, on the other hand, viewed as a
statement that all parameter values are consistent
with the data at a particular level is a strong
statement about the limitations of the data.”
(Cox and Hinkley1974, p. 226)
)%
Other serious criticisms of frequentist statistics
are based on assuming the second tenet: how to
use the methods in scientific contexts?
Radical Behavioristic Standpoint
The test rule is a good one so long as it has a low
probability of leading to errors in a long run
series (low frequency of errors in long-run
repetitions of the experiment).
In a strict behavioristic approach, that’s all that
matters….
)&
Disjunctive Tests
Rather than give the technical statistical example
(Cox 1958) I jump to the kind of howler that the
frequentist account is alleged to permit:
Oil Exec: Our inference that H: the cement is
fine is highly reliable
Senator: But didn’t you just say no cement bond
log was performed, when it initially failed…?
Oil Exec: That’s what we did on April 20, but
usually we do—I’m giving the average.
We use a randomizer that most of the time
directs us to run the gold-standard check on
pressure, but with small probability tells us to
assume the pressure is fine, or keep decreasing
the pressure of the test til it passes….
*(
A report of the average error might be relevant
for some contexts (insurance?) but the oil rep
gives a highly misleading of the stringency of
the actual test H managed to “pass” from the
data on April 20.
Violates the severity criterion: —it would be
easy to ensure evidence that H is true, even
when H is false
Cox uncovered and gave a way to solve the
problem years ago: In a disjunctive experiment,
“if it is known which experiment produced the
data, inferences about ! are appropriately
drawn in terms of the sampling distribution of
the experiment known to have been performed
(Cox: Weak Conditionality Principle).
Nevertheless, we see in numerous books and
papers that the frequentist cannot really avail
herself of ways out (of reporting the irrelevant
average error rate)….
*!
Digging beneath these arguments exposes a
slightly different twist on the tenet #2:
If an account considers outcomes that could
have occurred in hypothetical repetitions of this
experiment (which we do in considering the
distribution of d(x))
then it must also consider experiments that could
have been run (in evaluating the experiment that
was run)….
*'
So if one is not prepared to average over
irrelevant experiments, then one cannot
consider any data set other than the one
observed,
It would entail no use of error probabilities.
Or, in other words, we are led to the (strong)
likelihood principle (LP)
Aside: There is a famous “radical” argument
purporting to show that error statistical
principles entail the Likelihood Principle (LP)
(Birnbaum, 1962), but the argument is
flawed—invalid or unsound (Mayo 2010).
*)
I think you are familiar with the LPiii:
One way to illustrate its violation in frequentist
statistics is via the “Optional Stopping Effect”.
We have a random sample from a Normal
distribution with mean µ and standard deviation ",
i.e.
Xi ~ N(µ,") and we test H0: µ=0, vs. H1: µ'0.
stopping rule:
Keep sampling until H is rejected at the .05 level
(i.e., keep sampling until | X | ( 1.96 "/ n ).
With this stopping rule the actual significance level
differs from, and will be greater than .05.
**
Fifty years ago, almost to the day, not far from
here (subjective) Bayesian Savage announced to
a group of eminent statisticians that “optional
stopping is no sin”.
The likelihood principle emphasized in Bayesian
statistics implies, … that the rules governing when
data collection stops are irrelevant to data
interpretation. (Edwards, Lindman, Savage 1963, p.
239).
It’s quite relevant to the error statistician: This
ensures high or maximal probability of error, —
violates weak severity
The LP violates the weak repeated sampling rule
(Cox and Hinkley)
*"
Equivalently with (2-sided) Confidence Intervals
Keep sampling until the 95% confidence
interval excludes 0
Berger and Wolpert (the Likelihood Principle, 1988)
concede that using this stopping rule
"has thus succeeded in getting the [Bayesian]
conditionalist to perceive that ! ! 0, and has
done so honestly. (pp. 80-81)
This is a striking admission—especially as they
assign a probability of .95 to the truth of the
interval estimate:
! = y + 1.96s/ n
“Bayesian credibility interval”
*While the stopping rule is determined by
intentions of the experimenter as to when to
stop, that doesn’t mean taking them into
account is to take intentions into account,
(“argument from intentions”)
*#
Savage’s Sleight of hand (quick sketch)
1.At the Savage Forum Armitage showed how the
same thing can happen for a Bayesian (where
the posterior = the confidence level)…and “thou
shalt be misled!” if thou dost not know the
stopping rule…
2. Savage does care, says no way, any more than
we can build a perpetual motion machine—!
"Savage switches to a very different case—where
the null and alternative are both (point)
hypotheses that have been fixed before the data,
and the test is restricted to these two preselected
values
"Here, the probability of erroneously finding
evidence in favor of the alternative is low, but
that’s not the optional stopping example
"To this day defenders of the LP do likewise
(Royale, Sober?iv)
*$
Stephen Senn, intriguingly, remarks:
The late and great George Barnard, through his
promotion of the likelihood principle, probably
did as much as any statistician in the second
half of the last century to undermine the
foundations of the then dominant Neyman-
Pearson framework and hence prepare the way
for the complete acceptance of Bayesian ideas
… by the De Finetti-Lindley limit of 2020.
(Senn 2004)
Many do view Barnard as having that effect, but
he himself rejected the LP (as he makes clear at
the Savage Forum—to Savage’s shock)
*%
2010 Update:
It has long been accepted that the Bayesian
foundational superiority over error statisticians
stems from Bayesians upholding, while
frequentists violate, the likelihood principle (LP),
Frequentists, concede that they (the
Bayesians) are coherent, we are not…
What then to say about leading default
Bayesians, (Bernardo, 2005, Berger, 2004)?
Admitting that “violation of principles such as
the likelihood principle is the price that has to
be paid for objectivity.” (Berger, 2004).
Although they are using very different
conceptions of “objectivity” objectivity — there
seems to be an odd sort of agreement between
them and the error statistician.
*&
(i) Do the concessions of reference Bayesians
bring them into the error statistical fold?
I think not, (though I will leave this as an open
question). I base my (tentative) no answer on:
(a) While they may still have some ability to
ensure low error probabilities in a long-
run—some might therefore say they may
be regarded as “frequentists”*
This would not make them error
statisticians….
(b) While relevant error probabilities (for
assessing well-testedness in the case at
hand) entails violating the LP, the
converse is not true.
“Bayesian statistics is about making probability statements,
frequentist statistics is about evaluating probability
statements. …. Bayesians can feel free, or even obliged, to
evaluate the frequency properties of their procedures.
Conversely, …. frequentists can .. use Bayesian methods to
derive statistical procedures with good frequency
properties. (Gelman 2004)
!"
(ii) What are its (Bayesian) foundations?
Impersonal priors may not be construed as
measuring beliefs or even probabilities—they
are often improper.
If prior probabilities represent background
information, why do they differ according to
the experimental model?
They are mere “reference points” for getting
posteriors, but how shall they be interpreted?
!#
Quick summary:
My diagnosis and Rx
Underlying the disagreements, criticisms and
attempted unifications are the following two
tenets (which are fleshed out in a variety of
ways) as to:
What we need
1.Probabilism: the role of probability in
inference is to assign a degree of belief,
support, confirmation (given by mathematical
probability)
What we get from “frequentist’ methods
2.Radical Behavioristic Construal: The
frequentist’s central aim is assuring low long-
run error
!$
I. The criticisms of frequentist statistical
methods take the form of one or both:
Error probabilities do not supply posterior
probabilities in hypotheses (tenet #1)
The radical behavioristic standpoint leads to
counterintuitive appraisals of inferences (tenet
#2)
!%
II. An adequate rescue (i.e., “unification”)
would be a way to satisfy #1, and avoid
counterintuitive example licensed from #2.
“Unifications” (e.g., between frequentists and
Bayesians) therefore, proceed by equating the
probabilities in #1 to the error probability
quantities in #2
!&
An alternative theory or philosophy replaces
these tenets with 2 other ones:
1*the role of probability in inference is to quantify
how reliably or severely claims have been tested
2* Reject radical behaviorism: control of long run
errors probabilities, while necessary is not sufficient
for good tests
the severity principle directs us to the
relevant error probabilities, thereby avoiding the
classic counterintuitive examples
Where differences remain (disagreement on
numbers) e.g., p-values and posteriors, we should
recognize the difference in the goals promoted
!!
Error (Probability) Statistics
Error probabilities may be used to quantify
probativeness or severity of tests (for a given
inference)
Its standpoint of the severe prober, or the
severity principle, directs us to obtain error
probabilities that are relevant to determining
well-testedness
The logic of severe testing (or corroboration)
is not probability logic
With our correction to tenets #1 and #2 a
completely different picture emerges
It may well be that there’s a role for both, but
that’s not to unify them, or claim the
differences don’t matter..
i “The Problem of Inductive Inference” Neyman, 1955) ii One must be careful to recognize that in the case of a
significant result x, the more powerful or sensitive the test,
!'
the smaller the discrepancy from the null that is warranted.
This is discussed at length elsewhere. iii According to Bayes’s theorem, P(x|µ) ... constitutes the
entire evidence of the experiment, that is, it tells all that the
experiment has to tell. More fully and more precisely, if y
is the datum of some other experiment, and if it happens
that P(x|µ) and P(y|µ) are proportional functions of µ (that
is, constant multiples of each other), then each of the two
data x and y have exactly the same thing to say about the
values of µ… (Savage 1962, p. 17.)
iv To be fair, they may be prepared to limit their accounts to
a comparative assessment of point against point, thereby
preventing one from ever having two-sided or even one-
sided tests, with complex alternatives). But this leaves the
account restricted to highly artificial examples, which also
suffers from the inadequacy of a purely comparative
account. Worse, by constructing the point alternative post
data, in order to perfectly fit the data (maximally likely
alternative), one is assured of rejecting the null in favor of
the maximally likely alternative, even if the null is true! So
we are back to violating severity (and the weak sampling
principle).