15
PHIL 6334 - Probability/Statistics Lecture Notes 5: Post-data severity evaluation Aris Spanos [ Spring 2014 ] 1 Introduction Fallacies of Acceptance and Rejection How is one supposed to interpret accept or reject the null? I Unfortunately, in elds like econometrics ‘accept 0 ’ is routinely, but erroneously, interpreted as ‘data x 0 provide ev- idence for 0 ’, and ‘reject 0 ’ is routinely but erroneously interpreted as ‘data x 0 provide evidence for some alternative 1 ’. The problem is that neither of these evidential claims can be justi ed, since they are both vulnerable to two classic fallacies. (a) The fallacy of acceptance: no evidence against 0 is mis- interpreted as evidence for 0 . This fallacy can easily arise in cases where the test in ques- tion has low power to detect discrepancies of interest, e.g. small sample size . (b) The fallacy of rejection: evidence against 0 is misinter- preted as evidence for a particular 1 . This fallacy can easily arise in cases where the power of a test is very high, e.g. the case of a very large sample size This renders N-P rejections, as well as tiny p-values, with large highly susceptible to this fallacy. In the statistics literature, as well as in the secondary liter- atures in several applied elds, there have been numerous at- tempts to circumvent these two fallacies, but none succeeded. 1

A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

Embed Size (px)

DESCRIPTION

PHIL 6334 (Spring 2014)- Probability/Statistics Lecture Notes 5: Post-data severity evaluation, (Fallacies of Acceptance and Rejection)

Citation preview

Page 1: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

PHIL 6334 - Probability/Statistics Lecture Notes 5:

Post-data severity evaluation

Aris Spanos [Spring 2014]

1 Introduction

Fallacies of Acceptance and Rejection

How is one supposed to interpret accept or reject the null?I Unfortunately, in fields like econometrics ‘accept 0’ is

routinely, but erroneously, interpreted as ‘data x0 provide ev-idence for 0’, and ‘reject 0’ is routinely but erroneouslyinterpreted as ‘data x0 provide evidence for some alternative1’.The problem is that neither of these evidential claims can be

justified, since they are both vulnerable to two classic fallacies.(a) The fallacy of acceptance: no evidence against 0 is mis-interpreted as evidence for 0.This fallacy can easily arise in cases where the test in ques-

tion has low power to detect discrepancies of interest, e.g.small sample size .(b) The fallacy of rejection: evidence against 0 is misinter-preted as evidence for a particular 1.This fallacy can easily arise in cases where the power of

a test is very high, e.g. the case of a very large sample size This renders N-P rejections, as well as tiny p-values, withlarge highly susceptible to this fallacy.In the statistics literature, as well as in the secondary liter-

atures in several applied fields, there have been numerous at-tempts to circumvent these two fallacies, but none succeeded.

1

Page 2: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

The first successful attempt was made by Mayo (1996) by in-troducing the notion of a post-data severity evaluation.

2 The post-data severity evaluation

2.1 The notion of post-data severity

The post-data severity assessment aims to supplement fre-quentist testing with a view to bridge the gap between the p-value and the accept/reject rules on one hand, and providingevidence for or against a hypothesis in the form of the dis-crepancy from the null warranted by data x0, on the other.I Its key difference from the Bayesian and likelihoodist

approaches to testing is that it takes into account the genericcapacity of the test in establishing .I The intuition behind this notion is that a rejection of

0 using a less (more) powerful test provides better (worse)evidence for a departure from0. Similarly, an acceptance of0 using a less (more) powerful test provides worse (better)evidence for no departure from 0The severity evaluation is a post-data appraisal of the ac-

cept/reject and p-value results with a view to provide an evi-dential interpretation. It can be used to address not only thefallacies or acceptance and rejection and several additionalcriticisms of N-P testing. The discussion that follows reliesheavily on Mayo and Spanos (2006).

¥ A hypothesis passes a severe test with data x0 if:(S-1) x0 accords with , and(S-2) with very high probability, test would have produceda result that accords less well with than x0 does, if werefalse.Severity can be viewed as an feature of a test as it re-

2

Page 3: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

lates to a particular data x0 and a specific claim beingconsidered. Hence, the severity function has three arguments, (x0 ) denoting the severity with which passes with x0.Example 1. Let us assume that the appropriate statisti-

cal model for data x0 is the simple (one parameter) Normalmodel, where is known (table 1).

Table 1 - Simple Normal (one parameter)Model

Statistical GM: = + ∈N={1 2 }[1] Normality: v N( ) ∈R[2] Constant mean: ()=[3] Constant variance: ()=

2 [known]

⎫⎬⎭ ∈N.

[4] Independence: { ∈N} independent process

Let us consider the hypotheses of interest:

0 : =0 vs. 1 : 0 (1)

in the context of the simple Normal model (table 1). Theoptimal (UMP) test for these hypotheses is:

={(X)=√(−0)

1()={x : (x) } (2)

where=1

P

=1 is the threshold rejection value. Giventhat:

(X)=√(−0)

=0v N(0 1) (3)

one can evaluate the type I error probability (significance level) using:

P((X) ; 0 true)=

where is the type I error; 0 1. To evaluate the type IIerror probability one needs to know the sampling distributionof (X) when 0 is false. However, since 0 is false refers to

3

Page 4: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

1 : 0 this evaluation will involve all values of greaterthan 0 (i.e. 10) :

(1)=P((X) ≤ ; 0 false)=P((X) ≤ ; =1) ∀(10)The relevant sampling distribution takes the form:

(X)=√(−0)

=1v N(1 1)

where 1=√(1−0)

for all 10

(4)

To use the Normal tables one needs to transform√(−0)

into√(−1)

using:

(X)z }| {√¡ − 0

¢

1z }| {√ (1 − 0)

=√(−1)

=1v N(0 1) for 10

(5)The power is defined by 1− (1) :

(1) =P((X) ; =1)=

=P(√(−1)

√(−1)

; =1)=

=P( −√(−1)

; =1) for all 1≥0

where is a generic standard Normal r.v., i.e. v N(0 1)0=12 =2 =025 (=196) =100

=1−0 1=√(1−0)

: (1)=P(

√(−1)

− 1;1)

=1 =5 (121)=P( 196− 5)=072

=2 =1 (122)=P( 196− 2)=169=3 =15 (133)=P( 196− 3)=323=5 =25 (135)=P( 196− 3)=705=7 =35 (137)=P( 196− 3)=938

4

Page 5: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

2.2 Severity in the case of reject 0

Consider the case where 0=12 =2 =100 =025 (=196)and =126Evaluating the test statistic yields:

(x0)=√100(126−12)

2=30

which results in rejecting 0: =12 The p-value confirms therejection since:

(x0)=P((X) (x0); =12)=0013

Evaluating the post-data severity in order the establish thediscrepancy from the null warranted by test and data x0(=126)(S-1). The severity ‘accordance’ condition (S-1) implies

that:the rejection of 0=12 with (x0)=30 accords with 1

and the relevant inferential claim is:

1=0+ for some ≥ 0 (6)

(S-2). To establish the particular discrepancy warranted bydata x0, the post-data severity ‘discordance’ condition:

5

Page 6: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

"(S-2): with very high probability, test would have pro-duced a result that accords less well with 1 than x0 does, if1 were false."calls for evaluating the probability of the tail events:

"outcomes x that accord less well with 1 than x0 does",

i.e. [x: (x) ≤ (x0)] giving rise to:

(; 1) =P((X) ≤ (x0); 1 is false)==P((X) ≤ (x0); ≤ 1 is true)==P((X) ≤ (x0);=1)

(7)

To evaluate this probability we need to use the same distri-bution under the alternative (4) as in the case of the power,but now instead of using as the threshold we will use (x0)and adjust it as in (5)

(x0)−1=√(−1)

(8)

For instance, for a discrepancy =1 the severity evaluationis:

(; 1=121) =P(√(−121)

≤√100(126−121)

2;=1)

=P( ≤ 25;=1)=994where v N(0 1). Similarly, for a discrepancy =5 theseverity evaluation is:

(; 1=125) =P(√(−125)

≤√100(126−125)

2;=1)

=P( ≤ 05;=1)=691Table 2 reports several such severity evaluations for differentdiscrepancies =1 10

6

Page 7: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

0=12 =2 =100 and =126Table 2: Reject 0: =12 vs. 1: 12

Relevant claim Severity

1=[12+] P(x: (X)≤(x0);1)10 121 994

20 122 977

30 123 933344 12344 900

40 124 841

50 125 69160 126 500

70 127 309

80 128 15990 129 067

10 130 023

 

7

Page 8: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

The idea of using the post-data severity evaluation in thecase of reject 0 is to establish the largest warranted discrep-ancy from the null at a certain high threshold, say .90. Inthis case the discrepancy is ≤ 344I How does the post-data severity evaluation address the

fallacy of rejection? By pointing out the warranted and un-warranted discrepancies from the null and specifying the rel-evant inferential claim.

2.3 Severity in the case of accept 0

Consider the case where 0=12 =2 =100 =025 (=196)and =121Evaluating the test statistic yields:

(x0)=√100(121−12)

2=5

which results in accepting 0: =12 The p-value confirmsthe acceptance since:

(x0)=P((X) (x0); =12)=309

Let us evaluate the post-data severity in order the establishthe discrepancy from the null warranted by test and datax0 yielding =121(S-1). The severity ‘accordance’ condition (S-1) implies

that:the acceptance of 0=12 with (x0)=5 accords with 0

and the relevant inferential claim is:

≤ 1=0+ for some ≥ 0 (9)

(S-2). To establish the particular discrepancy warranted bydata x0, the post-data severity ‘discordance’ condition:

8

Page 9: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

"(S-2): with very high probability, test would have pro-duced a result that accords less well with 0 than x0 does, if0 were false."calls for evaluating the probability of the tail events:

"outcomes x that accord less well with 0 than x0 does",

i.e. [x: (x) (x0)] giving rise to:

(; ≤ 1) =P((X) (x0);=0 is false)==P((X) (x0); 0 is true)==P((X) (x0);=1)

(10)

For a discrepancy =1 the severity evaluation is:

(; ≤ 1=121) =P(√(−121)

√100(121−121)

2;=1)

=P( 00;=1)=500

Similarly, for a discrepancy =5 the severity evaluation is:

(; ≤ 1=125) =P(√(−125)

√100(121−125)

2;=1)

=P( −20;=1)=691Table 3 reports several such severity evaluations for differentdiscrepancies =− 3 7

9

Page 10: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

0=12 =2 =100 and =121Table 3: Accept 0: =12 vs. 1: 12

Relevant claim Severity

≤1=[12+] P(x: (X)(x0);1)−3 ≤ 117 023

−2 ≤ 118 067

−1 ≤ 110 1590 ≤ 120 309

10 ≤ 121 500

20 ≤ 122 69130 ≤ 123 841

356 ≤ 12356 900

40 ≤ 124 93350 ≤ 125 977

60 ≤ 126 994

70 ≤ 127 999

 

10

Page 11: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

The idea of using the post-data severity evaluation in thecase of accept 0 is to establish the smallest warranted dis-crepancy from the null at a certain high threshold, say .90.In this case the discrepancy is ≥ 356I How does the post-data severity evaluation address the

fallacy of acceptance? By pointing out the warranted andunwarranted discrepancies from the null and specifying therelevant inferential claim.

2.4 The large n problem

The large problem was initially raised by Lindley (1957) inthe context of the simple Normal model (table 1) where thevariance 2 0 is assumed known, by pointing out:

[a] the large problem: frequentist testing is susceptibleto the "fallacious" result that there is always a large enoughsample size for which any point null, say 0: =0, will berejected by a frequentist -significance level test.

Lindley claimed that this result is paradoxical because, whenviewed from the Bayesian perspective, one can show:

[b] the Jeffreys-Lindley paradox: for certain choices ofthe prior, the posterior probability of 0 given a frequentist-significance level rejection, will approach 1 as →∞.Claims [a] and [b] contrast the behavior of a frequentist test(p-value) and the posterior probability of 0 as →∞, thathighlights a potential for conflict between the frequentist andBayesian accounts of evidence.[c] Bayesian charge: a hypothesis that is well-supportedby Bayes factor can be (misleadingly) rejected by a frequentisttest when is large; see Berger and Sellke (1987), pp. 112-3.

A paradox? No! From the error statistical perspective:

11

Page 12: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

(i) There is nothing fallacious about a small p-value, or arejection of 0 when is large [it is a feature of a consistentfrequentist test].What is paradoxical is why the posterior probability of 0

as →∞ goes to 1, irrespective of the truth or falsity of 0!I Hence, the real problem does not lie with the p-value or

the accept/reject rules as such, but with how such results aretransformed into evidence for or against a particular . Theproblem arises when such accept/reject results are detachedfrom the test itself, and are treated as providing the sameevidence for a particular alternative 1, regardless of the thepower of the test in question, which depends crucially on The large problem can be addressed using the post-data

severity evaluation.

 

To illustrate that, consider the case where=025 (=196) =1and the the observed value of the test statistic in (2) is (x0)=197In this case data x0 result in rejecting of 0: =12 and the

12

Page 13: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

p-value is:

(x0)=P((X) 197; =12)=024

In the traditional accounts of frequentist testing this resultwould be interpreted in the same way, irrespective of whetherthe sample size was =25 =100 =400. The post-dataseverity evaluation, however, takes that into account because affects the generic capacity (power) of the test. For instance,the severity of inferring 121 associated with the same(x0)=197, will be different for each sample size:

(;=25; 121) =P( ≤ 197−√25(121−12)

1)=93

(;=100; 121) =P( ≤ 197−√100(121−12)

1)=83

(;=400; 121) =P( ≤ 197−√400(121−12)

1)=49

2.5 The problem with the p-value

Viewing the p-value from the severity vantage point, it can bedefined as follows:‘the p-value is the probability of all possible outcomes x∈R

that accord less well with 0 than x0 does, if 0 were true.Hence, a small p-value can be related to1 passing a severe

test because the probability that test would have produceda result that accords less well with1 thanx0 does (x: (x) (x0)), if 1 were false (0 true):

Sev(;x0;0) =P((X) (x0);≤0) ==1−P((X)(x0);=0)

is very high, i.e. (x0 is very low.I Hence, the key problem with the p-value is that is estab-

lishes the existence of some discrepancy ≥ 0 from the null,but provides no information concerning its magnitude The

13

Page 14: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

severity evaluation remedies that because it revolves aroundthe discrepancy by being evaluated under different valuesassociated with the inferential claim ≥ 0+ In this sense,the p-value can be related to a severity evaluation associatedwith the inferential claim ≥ 0 where the implicit discrep-ancy is =0, i.e. in the case of the p-value, SEV is implicitlyevaluated under the null!

3 Conclusions

Neither Fisher’s p-value, nor the N-P accept/reject rules canprovide such an evidential interpretation, primarily becausethey are vulnerable to two serious fallacies.(a) Fallacy of acceptance: no evidence against the null is

misinterpreted as evidence for it.(b) Fallacy of rejection: evidence against the null is misin-

terpreted as evidence for a specific alternative.These fallacies can be circumvented by supplementing the

accept/reject rules (or the p-value) with a post-data evaluationof inference based on severe testing with a view to deter-mined the discrepancy from the null warranted by data x0This establishes the inferential claimwarranted [and thus, theunwarranted ones].The severity assessment enables one to address the crucial

fallacies of acceptance and rejection as well as the potentialarbitrariness and possible abuse of:[c] switching between one-sided, two-sided or simple-vs-

simple hypotheses,[d] interchanging the null and alternative hypotheses,[e] manipulating the level of significance in an attempt to

get the desired testing result,[f] the relevant p-value.

14

Page 15: A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

[g] observed confidence intervals vs. severity evaluations.Doesn’t the post-data severity evaluation change the origi-

nal threshold with a severity threshold? Aren’t both equallyarbitrary?No! Any choice that can be discussed in the particular

context between different modelers is neither arbitrary norsubjective, but it is debatable! The severity curve will provideall possible discrepancies from the null and the modeler candecide which threshold is appropriate in each case.

15