Download pdf - Applying the Generalized Partitioning Principle to Control the Generalized Familywise Error Rate

Applying the Generalized Partitioning Principleto Control the Generalized Familywise Error Rate

Haiyan Xu1 and Jason C. Hsu*; 2

1 E22504, Department of Clinical Biostatistics, Johnson & Johnson PRD, L.L.C., USA2 Department of Statistics, The Ohio State University, 1958 Neil Avenue, Columbus, Ohio 43210, USA

Received 14 February 2006, revised 17 July 2006, accepted 31 August 2006

Summary

In multiple testing, strong control of the familywise error rate (FWER) may be unnecessarily stringentin some situations such as bioinformatic studies. An alternative approach, discussed by Hommel andHoffmann (1988) and Lehmann and Romano (2005), is to control the generalized familywise error rate(gFWER), the probability of incorrectly rejecting more than m hypotheses. This article presents thegeneralized Partitioning Principle as a systematic technique of constructing gFWER-controlling teststhat can take the joint distribution of test statistics into account. The paper is structured as follows. Wefirst review classical partitioning principle, indicating its conditioning nature. Then the generalizedpartitioning principle is introduced, with a set of sufficient conditions that allows it to be executed as acomputationally more feasible step-down test. Finally, we show the importance of having some knowl-edge of the distribution of the observations in multiple testing. In particular, we show that step-downpermutation tests require an assumption on the joint distribution of the observations in order to controlthe familywise error rate.

Key words: Generalized familywise error rate; Generalized Partitioning Principle; MarginalsDetermine the Joint condition; Permutation tests; Step-down tests.

1 Error Rate in Testing a Single Null Hypothesis

Testing a statistical hypothesis is often part of a decision-making process. In such a setting, error rateshould be defined so that it reflects the probability that the testing procedure causes an incorrectdecision.

Consider the analysis of gene expression levels from high and low risk patients. One commonpurpose of such an analysis is to discover differentially expressed genes for intense further study. Forthis purpose, it is perhaps appropriate to define “differentially expressed” as “not identically distribu-ted”. Another common purpose of such an analysis is to train a prognostic algorithm for predictingrisk. Since training of classification algorithms in supervised machine learning proceeds by computingdistances between subjects in high and low risk groups, the genes included in the training set shouldbe those with large (standardized) mean distances between the two groups. For this purpose, it is moreuseful to define “differentially expressed” as “not equal in expression level on average”.

Suppose k genes are probed in comparing expression profiles between low risk and high riskgroups. Let FLi;FHi; i ¼ 1; . . . ; k; denote the (marginal) distributions of the expression level of the i-thgene of a randomly sampled patient from the low and high risk group respectively. LetmLi; mHi; i ¼ 1; . . . ; k; denote the (logarithms of) expected expression level of the i-th gene of a ran-domly sampled patient from the low and high risk group respectively. Let qi denote the difference ofthe expected (logarithms of) expression levels of the i-th gene between the two groups, qi ¼ mHi � mLi:

* Corresponding author: e-mail: [email protected], Phone: 01 614 292 7663; Fax: 01 614 292 2096

52 Biometrical Journal 49 (2007) 1, 52–67 DOI: 10.1002/bimj.200610307

# 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Then, in the discovery situation, the appropriate (marginal) null hypotheses are

HF0i: FHi ¼ FLi ð1Þ

for i ¼ 1; . . . ; k; while in the prognostic situation the appropriate (marginal) null hypotheses are

Hq0i : qi ¼ 0 ð2Þ

for i ¼ 1; . . . ; k:For the null hypothesis HF

0i; the definition of Type I error rate is unambiguous. But the null hypoth-esis Hq

0i: qi ¼ 0 is composite in two ways. First, in contrast to HF0i, so long as the expectations

are the same, it allows for the possibility that the variances, skewness, and other aspects of the distribu-tions are different. Second, the null hypothesis Hq

0i: qi ¼ 0 leaves unspecified the values ofqj; j ¼ 1; . . . ; k; j 6¼ i:

Type I error of testing Hq0i can thus be defined in different ways. Let q ¼ ðq1; . . . ; qkÞ; and let S

denote generically all nuisance parameters that the observed expression levels depend on (including,for example, covariance of the expression levels for each of the two groups). Let q0 ¼ ðq0

1; . . . ; q0kÞ

and S0 be the collection of all (unknown) true parameter values. One definition of Type I error rate,given in Pollard and van der Laan (2005) for example, is

Pq0;S0 fReject H0ig ; ð3Þ

where q0i ¼ 0: In the analysis of gene expression levels, for example, if H0i is the null hypothesis that

the i-th gene is not differentially expressed, then the values of expression levels for the other genes atwhich this probability Pq0;S0fReject H0ig is computed are the true expression levels for the groupsbeing compared, and the joint distributions under which they are computed are the true distributionsfor the groups being compared. But since q0

j ; j 6¼ i; and S0 are unknown, this probability is difficult tocompute directly. Therefore, the traditional definition of Type I error rate, Definition 8.3.3 on p. 361of Casella and Berger (1990), is

supqi¼0

Pq;S fReject H0ig ; ð4Þ

where the supremum is taken over all possible q and S subject to qi ¼ 0: In the analysis of geneexpression levels, for example, if H0i is the null hypothesis that the i-th gene is not differentiallyexpressed, then the supremum is taken over all possible true expression levels of the other genes, aswell as all possible joint distributions of expression levels of all genes (including all possible correla-tions). The definition we use is this traditional definition of Type I error rate, as it is more readilyverifiable in practice.

In practice, whether the hypothesis being tested is HF0i or Hq

0i affects whether the error rate iscontrolled, and therefore should be clearly stated. For example, as shown in Pollard and van der Laan(2005) and Huang, Xu, Calian, and Hsu (2006), permutation tests control the error rate only fortesting HF

0i but not Hq0i.

2 Error Rates in Multiple Testing

Let fH0i: i ¼ 1; . . . ; kg be a family of null hypotheses to be tested. In microarray studies, H0i is thenull hypothesis that the i-th gene is not differentially expressed between high and low risk patients.

Let V denote the number of incorrectly rejected true null hypotheses. Strong control of the familywiseerror rate (FWER) keeps the probability of rejecting any true null hypothesis at a pre-specified low level,

FWER ¼ supq;S

Pq;SfV > 0g ð5Þ

where the supremum is take over the entire parameter space, to account for all possible values ofparameters corresponding to false null hypotheses, as well as values of nuisance parameters.

Biometrical Journal 49 (2007) 1 53

# 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

In multiple testing for differential expressions, the consequence of errors in testing the individualhypotheses depends on whether the purpose is to understand the biological disease mechanism, or toselect genes to train a prognostic algorithm toward ultimately making a prognostics device. In theformer case, every mistake costs the investigator some time and effort. In the latter, what mattersmore is the algorithm has enough sensitivity/specificity, and the number of genes is small enough forthe device to be reasonably priced. In either case, controlling FWER strongly may be unnecessarilystringent. Controlling the generalized familywise error rate (gFWER)

gFWER ¼ supq;S

Pq;SfV > mg ð6Þ

(where again the supremum is take over the entire parameter space) keeps the probability of incor-rectly rejecting more than a fixed number m of true null hypotheses at a low level.

One approach to controlling gFWER is to use a method that controls the familywise error rate (i.e.,controlling gFWER at m ¼ 0), and then reject an additional m hypotheses. Indeed, the augmentationtest of van der Laan, Dudoit, and Pollard (2004) rejects the null hypotheses that would be rejected byan FWER-controlling method, and then augments the rejections by automatically rejecting the nullhypotheses associated with the next m extreme test statistics.

We take a different approach in this article, generalizing the Partitioning Principle of Stefansson,Kim, and Hsu (1988) and Finner and Strassburger (2002) to provide a technique for constructingmultiple tests that control gFWER. Our approach not only shows how to take the joint distribution ofthe test statistics into consideration in controlling gFWER, but also the conditioning nature of data-dependent thresholding by step-wise methods.

In the remaining part of this article, qi; i ¼ 1; . . . ; k; are univariate parameters of interest, and thenuisance parameter S is suppressed notationally.

3 Partitioning is Conditioning

A principle of multiple testing which is conditional in nature is the Partitioning Principle of Stefans-son, Kim, and Hsu (1988), which was further refined by Finner and Strassburger (2002). Holm’s step-down method and Hochberg’s (1988) step-up method, each comparing p-values against a data-depen-dent threshold, can be thought of as partition testing, obtained by applying the Bonferroni’s inequalityand a modification of Simes’ (1986) inequality respectively to component partitioning tests (seeHuang and Hsu, 2005). The step-down version of Dunnett’s (1955) method in Marcus, Peritz, andGabriel (1976) can be thought of as partition testing taking into account the multivariate t joint distri-bution of the test statistics (see Stefansson, Kim, and Hsu, 1988). The idea of partition testing is if, ina subspace of the parameter space only a subset of the null hypotheses are true, then tests for param-eter points in that subspace need to be adjusted for the multiplicity of only those true null hypotheses.Partition testing thus applies a different test to each subspace of the parameter space depending onhow many null hypotheses are true in that subspace, and then collates the results across the subspaces,as follows.

Let K ¼ f1; . . . ; kg and consider testing H0i: q 2 Qi; i ¼ 1; . . . ; k: To control FWER (gFWER withm ¼ 0), the Partitioning Principle states:

1. Partition[

i2KQi into disjoint Q*I ; I � K; as follows. For each I � K; let

Q*I ¼\

i2IQi \ ð

\j 62I

Qcj Þ: Then fQ*I ; I � Kg; including Q*; ¼

\j2K

Qcj ; partition the param-

eter space. Note that Q*I can be interpreted as the part of the parameter space in which exactly

H0i; i 2 I; are true, and H0j; j =2 I are false.2. Test each HP

0I: q 2 Q*I at level a: Since the null hypotheses are disjoint, at most one null hypoth-esis is true. Therefore, even though no multiplicity adjustment to the levels of the tests is made,the probability of rejecting at least one true null hypothesis is at most a:

54 H. Xu and J. C. Hsu: Partitioning to Control the Generalized Familywise Error Rate


3. For all I � K; infer q 62 QI if all HP0J such that I � J are rejected. That is, infer the intersection

null hypothesis H0I is false if all null hypotheses HP0J implying it are rejected. In terms of useful

scientific inference of rejecting the original null hypotheses H0i; i 3 K, since Qi ¼[

I2iQ*I ; the

partition test rejects H0i if all HP0I such that I 3 i are rejected.

For example, suppose the desired inferences are q1 > 0; or q2 > 0; or both. With the alternativehypotheses being Ha1: q1 > 0 and Ha2: q2 > 0; the proper null hypotheses are H01: q1 � 0 andH02: q2 � 0:

Partitioning forms the hypotheses

H\0 : q1 � 0 and q2 � 0

HP01: q1 � 0 and q2 > 0

HP02: q1 > 0 and q2 � 0

and tests each at level-a: Since the three hypotheses are disjoint, at most one null hypothesis is true.Therefore, even though no multiplicity adjustments to the levels of the tests are made, the probabilityof rejecting at least one null hypothesis is at most a: Applying the Partitioning Principle, statisticalinference proceeds as follows:

If H\0 : q1 � 0 and q2 � 0 is accepted, no inference is given.If only H\0 : q1 � 0 and q2 � 0 is rejected but not HP

01: q1 � 0 and q2 > 0 or HP02:

q1 > 0 and q2 � 0; the inference is that at least one of Ha1: q1 > 0 or Ha2: q2 > 0 is true, but whichone cannot be specified.

If H\0 : q1 � 0 and q2 � 0 and HP01: q1 � 0 and q2 > 0 are rejected but not HP

02: q1 > 0 and q2 � 0;the inference is q1 > 0.

If H\0 : q1 � 0 and q2 � 0 and HP02: q1 > 0 and q2 � 0 are rejected but not HP

01: q1 � 0 and q2 > 0;the inference is q2 > 0.

If H\0 : q1 � 0 and q2 � 0 and HP01: q1 � 0 and q2 > 0 and HP

02: q1 > 0 and q2 � 0 are all rejected,the inference is both q1 > 0 and q2 > 0.

Partitioning inference is conditional in the sense that it uses the data to decide in which subspacesthe true parameter may lie, and bases its inferences on tests appropriate for those subspaces. ThePartitioning Principle has been successfully applied to derive practically useful methods for bioequiva-lence (Berger and Hsu, 1996), dose-response studies (Hsu and Berger, 1999), and genetic linkageanalysis (Lin, Rogers, and Hsu, 2001).

The next section describes how the partitioning principle can be generalized to systematically con-struct methods controlling gFWER. A key difference between classical partitioning and generalizedpartitioning is that generalized partition testing must test each individual H0i; i 2 K in each Q*I .

4 The Generalized Partitioning Principle

The generalized Partitioning Principle states:

P1. Partition the parameter space[

i2KQi into disjoint sets: for each I � K ¼ f1; . . . ; kg; let

Q*I ¼\

i2IQi \

\j 62I

Qcj

� �:

P2. In each Q*I ; reject all H0i; i =2 I; and test fH0i: qi 2 Qi; i 2 Ig at gFWER level a, controllingsup

q2Q*I

PqðV > mÞ � a.P3. For i 2 I; define H0i to be I-rejected to mean H0i is rejected in Q*I : Reject H0i if H0i is

I-rejected for all I; I � K:

The generalized partitioning principle first partitions the parameter space into disjoint sets, so thatexactly one Q*I contains the true parameter q*. It then controls gFWER by controlling gFWER withineach Q*I .



Consider, for example k ¼ 3, m = 1, Qi ¼ fqi � 0g: Then there are 23 ¼ 8 Q*I’s. Each H0i: qi � 0 istested eight times, once in each Q*I , and is ultimately rejected only if it is rejected in each Q*I . Withineach Q*I , H0i for qi > 0 are automatically rejected, while H0i for qi � 0 are tested with multiplicityadjustments in such a way that gFWER is controlled at level-a within Q*I . For I ¼ f2; 3g say, H01 isautomatically rejected while H02 and H03 are tested so that the probability of rejecting both is nomore than a: If the test statistics are independent, then each can be tested at level-

ffiffiffiap

within Q�f2; 3g:(Traditional partition testing controlling FWER, or equivalently gFWER with m ¼ 0; would test H02

and H03 within Q�f2;3g so that the probability of rejecting at least one is no more than a:)To see that the generalized partitioning principle controls gFWER, suppose, for example,

q� ¼ ð1;�1;�1Þ: Then, with m ¼ 1; gFWER is controlled at level a if the probability of rejectingboth H02 and H03 is no more than a: Since H02 and H03 will be rejected if they are I-rejected in Q*Ifor all I � K; the probability that they will both be rejected is no more than the probability that bothare I-rejected for I ¼ f2; 3g; and that probability is no more than a by P2 of the generalized partition-ing principle.

Note that the generalized partitioning principle controlling gFWER with m > 0 differs from pre-vious partitioning principles (e.g., the General Partitioning Principle of Finner and Strassburger, 2002)controlling gFWER with m ¼ 0 in that generalized partition testing must test each individualH0i; i 2 K in each Q*I . While such tests are not unique, a reasonable way of testing is as follows.

Suppose TIi is a suitable test statistic for testing H0i; i 2 I: Let jIj be the cardinality of I and let

½1�; . . . ; ½jIj� be the indices such that TI½1� � � � � � TI

½jIj� for TIi ; i 2 I. Suppose the critical value cI satisfies

supq2Q*

I

PqðTI½jIj�m� > cIÞ � a : ð7Þ

Then the I-reject generalized partitioning test proceeds as follows:1. In all Q*I with jIj � m; I-reject all H0i; i 2 I:2. In Q*I with jIj > m;� I-reject HI

½jIj�;HI½jIj�1�; . . . ;HI

½jIj�mþ1� for sure;� I-reject HI

0i for all i 2 I if TI½jIj�m� > cI ;

� Do not I-reject any HI0i; i 2 Inf½jIj�; ½jIj � 1�; . . . ; ½jIj � mþ 1�g; if TI

½jIj�m� � cI .3. Reject H0i if H0i is I-rejected in all Q*I s.t. i 2 I:

Theorem 4.1 The I-reject partitioning test controls gFWER strongly at level a.

The proof of this theorem is given in Appendix A.

5 Step-Down Shortcut of Partition Testing

In principle, generalized partition testing calls for testingPk

j¼mþ1kj

� �null hypotheses. In bioinfor-

matics, the number of original hypotheses, k; is in the hundreds, thousands, or even tens of thousands.Thus, a full implementation of generalized partition testing is infeasible. For example, when k ¼ 100and m ¼ 5, the partitioning test requires more than 1030 tests.

Fortunately, there is often symmetry in the original hypotheses to allow for a step-down shortcut,reducing the maximum number of tests to be performed to no more than k � m, often even less. Fork ¼ 100 and m ¼ 5 for example, a step-down test requires at most 95 steps. A set of sufficient condi-tions for such a shortcut to be possible is as follows.

S1. All tests are based on a set of statistics fTi; i ¼ 1; . . . ; kg whose values do not depend on whichQ*I is being tested;

S2. The tests in each Q*I are of the form described in the previous section;S3. For jIj > m; critical values cI have the property that if I � J then cI � cJ and cI ¼ cJ ifjIj ¼ jJj:



For modeling-based analysis of gene expression levels from microarrays, this set of conditions canbe checked.

5.1 Step-down testing

Assume conditions S1–S3 are satisfied. The critical value cI then depends only on the cardinality jIjof I; and will be denoted by cjIj�m: Let ½1�; ½2�; . . . ; ½k� be the (random) indices such thatT½1� � T½2� � � � � � T½k�: Generalized step-down testing proceeds as follows.

Step 0

Reject H½k�;H½k�1�; . . . ;H½k�mþ1�;

Step 1

If T½k�m� � ck�m;

then stop and infer nothing else;

else reject H½k�m� and go to Step 2;

Step 2

If T½k�1�m� � ck�1�m;

then stop;

else reject H½k�1�m� and go to Step 3;...

Step k � m

If T½1� � c1;

then stop;

else reject H½1� and stop:

The elements of step-down testing are summarized in the Table 1.

Theorem 5.1 Conditions S1–S3 are sufficient for the partitioning test to have the step-down short-cut as described.

The proof of this theorem is given in Appendix B, and some geometric insight is given in Appen-dix C.


Table 1 gFWER-controlling step-down testing procedure.

Test statistic Critical value Step

T½k�m� ck�m 1T½k�m�1� ck�1�m 2... ..

. ...

T½i� ci k � m� iþ 1... ..

. ...

T½1� c1 k � m


5.2 Special case of the test of Lehmann and Romano

In this section, we show that the step-down test by Lehmann and Romano (2005) is a special case ofour generalized step-down testing.

For i ¼ 1; . . . ; k; let pi be a suitably defined p-value for testing H0i: Consider the test statisticsTi ¼ 1� pi; i ¼ 1; . . . ; k; and the critical values

cI ¼ 1� aLR ¼ 1� ðmþ 1Þ a

jIj :

These test statistics and critical values satisfy condition P2 of the generalized partitioning principle,because suppose q 2 Q*I ; then

PqðV > mÞ ¼ PqðV mþ 1Þ

� EðVÞmþ 1

ð8Þ

¼EPi2I

Iðpi � aLRÞ� �

mþ 1

¼Pi2I

Pðpi � aLRÞmþ 1

�Pi2I

ðmþ 1Þ a=jIjmþ 1

¼ a

where (8) follows from Markov’s inequality.They also clearly satisfy conditions S1–S3 for shortcutting the partitioning test. To describe the

shortcut version of the generalized partition test, let ð1Þ; . . . ; ðkÞ be the (random) indices such that

Tð1Þ � � � TðkÞ

or, equivalently,

pð1Þ � � � � � pðkÞ :

(Note that ðð1Þ; . . . ; ðkÞÞ ¼ ð½k�; . . . ; ½1�Þ in the previous section.) Suppose

ai ¼1 i < ðmþ 1Þðmþ1Þa

kþmþ1�i i ðmþ 1Þ

�ð9Þ

then step-down generalized partition testing proceeds as follows.If pð1Þ > a1, reject no null hypotheses. Otherwise, reject H0ð1Þ; . . . ;H0ðrÞ where r is the largest

integer satisfying

T½k� > ck; . . . ; T½k�rþ1� > ck�rþ1

or, equivalently,

pð1Þ < a1; . . . ; pðrÞ < ar :

This is indeed the test of Lehmann and Romano (2005), who define ai as

ai ¼ðmþ1Þa

k i < ðmþ 1Þðmþ1Þa

kþmþ1�i i ðmþ 1Þ

(



for the sake of monotonicity of ai in i but remark that the ai can also be defined as (9). This is inagreement with condition S3, which requires monotonicity only for jIj > m:

The augmentation test of van der Laan, Dudoit, and Pollard (2004), on the other hand, is not aspecial case of generalized partition testing. Their augmentation test rejects the null hypotheses thatwould be rejected with m ¼ 0; and then augments the rejections by automatically rejecting the nullhypotheses associated with the next m extreme test statistics. This is in contrast to the generalizedpartitioning test which automatically rejects the m null hypotheses associated with the m most extremetest statistics, and then compares the remaining test statistics to critical values ck�m; . . . ; c1 which aresmaller than the corresponding critical values for m ¼ 0:

Option to not reject. With any multiple testing method, there is the option to not reject nullhypotheses that can be rejected while keeping the error rate controlled. In particular, in executing thepartitioning step-down method controlling gFWER, one has the option to not reject some or all of thenull hypotheses corresponding to the m most extreme test statistics, either on scientific grounds or ifthe test statistics are less than ck�m; for example.

5.3 An example

To illustrate the partitioning step-down method, consider a microarray experiment comparing geneexpression levels in tumor samples from high risk patients to low risk patients, for the purpose oftraining a prognostic algorithm. After testing for no differential expressions between the risk groups,expression levels of genes found to be differentially expressed are used to train a prognostic algo-rithm. An example of such a 2-stage process is reported in van’t Veer et al. (2002).

For such purpose, it is a good idea to probe only genes already suspected to be involved in thedisease, so that each gene can potentially be probed multiple times from each sample, decreasingmeasurement error. Hsu et al. (2006), for example, reported on an experiment probing 200 genes sus-pected to be involved in breast cancer. For such an experiment involving genes already suspected tobe differentially expressed, it would not be surprising for a large proportion of the null hypotheses ofnon-differential expressions to be false. If one were to use a multiple testing method controlling theFalse Discovery Rate (FDR), the number of false rejections can be rather high even though the ex-pected proportion of false rejections is low (cf. Finner and Roter, 2001). A better strategy would be tocontrol the number of false rejections, using a gFWER-controlling method.

Suppose k ¼ 200; a ¼ :05; and the test statistics are independent and are standard normal under thenull hypotheses. For m ¼ 0; some of the critical values are:

cðm¼0Þ200 ¼ 3:656 ; cðm¼0Þ

151 ¼ 3:583 ; cðm¼0Þ101 ¼ 3:477 ;

while for m ¼ 5; some of the critical values are:

cðm¼5Þ195 ¼ 2:480 ; cðm¼5Þ

146 ¼ 2:377 ; cðm¼5Þ96 ¼ 2:223 :

Suppose, for i ¼ 146; . . . ; 195; T½i� > cðm¼5Þi ; but T½145� � cðm¼5Þ

145 : Then the partitioning step-downmethod controlling gFWER at m ¼ 5 can reject the null hypotheses corresponding to the 55 geneswith the largest test statistic values. If, for i ¼ 151; . . . ; 200; T½i� > cðm¼0Þ

i ; then the augmentation meth-od can reject at least the same null hypotheses as the partitioning step-down method, maybe more. If,on the other hand, it is not the case that for i ¼ 151; . . . ; 200; T½i� > cðm¼0Þ

i ; then the augmentationmethod will fail to reject some null hypotheses that can be rejected by the partitioning step-downmethod.

Note that, in our example, the effective threshold for controlling gFWER at m ¼ 5 iscðm¼5Þ

146 ¼ 2:377: In contrast to providing a data-dependent assessment of strength of inference as onewould with a fixed threshold, partition testing provides a data-dependent threshold with the aim ofkeeping the true error rate as close to the nominal error rate as possible.



6 Permutation Methods Require Distributional Assumption for Validity

The gFWER-controlling step-down method of Lehmann and Romano (2005), analogous to the FWER-controlling step-down method of Holm (1979), does not take into account the joint distribution of teststatistics. The generalized Partitioning Principle we propose provides a systematic technique of con-structing gFWER-controlling methods that make use of information on the joint distribution of teststatistics. Care needs to be taken in implementing any multiple testing method which takes the jointdistribution of test statistics into account. For example, permutation testing is often thought of as oneway to take the joint distribution into account. We will prove, however, permutation testing as iscommonly practiced controls FWER only with an assumption on the joint distribution of the teststatistics.

Suppose observations from nL patients in the low risk group and nH patients in the high risk groupsare available. Assume, as in the discovery situation, the individual null hypotheses to be tested are (1):HF

0i: FLi ¼ FHi; i ¼ 1; . . . ; k: Consider testing each HF0i with Ti ¼ j �YYLi � �YYHij; where �YYLi and �YYHi denote

the sample means of the (logarithm of) expression levels of gene i from the low risk group and thehigh risk group, respectively. Let ½1�; . . . ; ½k� denote the indices that order the observed test statistics

T½1� � � � � � T½k� :

One form of step-down permutation tests aimed to control FWER and gFWER proceed as in Section 5.1,with the critical value ck�iþ1 for the i-th step computed as the upper a quantiles of the maximum of thepermuted test statistics fTperm

½1� ; . . . ; Tperm½k�iþ1�g; where fTperm

½1� ; . . . ; Tperm½k�iþ1�g are the test statistics T½1�; . . . ; T½k�

re-computed from random permutations of the pooled sample of nL observed vectors from low risk pa-tients and nH high risk patients (with the monotonicity condition S3 enforced algorithmically).

Suppose, for example, k ¼ 2; nL ¼ 2; nH ¼ 8; and FWER is to be controlled (i.e., gFWER withm ¼ 0). Permutation testing randomly labels two of the ten observed bivariate vectors of observationsas from the low risk group and the remaining eight observed bivariate vectors as from the high riskgroup, performs this random relabeling a large number of times, computing the maximum of T1; T2

each time. The critical value cperm2 is then computed as the upper a quantile of the empirical distribu-

tion of such maximums. At Step 1 of permutation step-down testing, the individual null hypothesesHF

0½2�: FL½2� ¼ FH½2� is inferred to be false if T½2� > cperm2 : At Step 2, provided permutation step-down

testing has not yet stopped, the individual null hypotheses HF0½1�: FL½1� ¼ FH½1� is inferred to be false if

T½1� > cperm1 : We now give an example showing this step-down permutation test does not control the

FWER.Suppose FLf1;2g is bivariate normal with means zero, variances one, and correlation zero, while

FHf1;2g is bivariate normal with means zero, variances one, and correlation 0.9. Thus, both HF01 and

HF02 are true. The distribution of ðTperm

1 ; Tperm2 Þ; a weighted mixture of FLf1;2g and FHf1;2g; is given by

Theorem 2.1 of Huang, Xu, Calian, and Hsu (2006). With different correlations and nL 6¼ nH ; thedistribution of ðTperm

1 ; Tperm2 Þ is somewhat different from the distribution of ðT1; T2Þ: In fact, for a

target FWER of 5%; c2 ¼ 2:23 while cperm2 ¼ 2:15 (based on 100,000 simulations), and the FWER,

the probability of rejecting at least one of HF01;H

F02; turns out to be more than 6%:

To understand why permutation tests do not control FWER strongly, recall partition testing formsthe hypotheses

HP0I: FLi ¼ FHi for i 2 I and

FLj 6¼ FHj for j =2 I

for each I � f1; . . . ; kg and requires a level-a test for each of HP0I : It infers FLi 6¼ FHi if and only if

the level-a test for each of HP0I ; i 2 I; rejects. The null hypotheses being tested by permutation testing

are, however

Hperm0I : FLI ¼ FHI



where FLI and FHI are the joint distribution of the expression levels of genes with indices in I from the lowrisk and high risk groups, respectively. Specifically, the critical value cperm

k�i is computed so that the permuta-tion test for Hperm

0f½1�;...;½k�i�g is a level-a test. The point that seems to have been missed in current literature is alevel-a permutation test for Hperm

0I is not necessarily a level-a test for HP0I ; according to either definition of

Type I error rate (3) or (4), as our bivariate normal example shows. A level-a test for Hperm0I would be a

level-a test for HP0I only if the following marginals-determine-the-joint condition holds:

MDJ Let Ij; j ¼ 1; . . . ; n; be any collection of disjoint subsets of f1; . . . ; kg; Ij � f1; . . . ; kg: If themarginal distributions of the observations are identical between the two groups, FLIj ¼ FHIj

for all j ¼ 1; . . . ; n; then the joint distributions are identical as well, FLIU ¼ FHIU whereIU ¼

[j¼1;...;n

Ij:

The claim that step-down permutation tests strongly control the Familywise Error Rate is some-times justified by stating the subset pivotality condition given on page 42 of Westfall and Young(1993) is satisfied. In what follows,

\i2I

H0i abbreviates for the situation that all the null hypotheses

H0i; i 2 I; are true. The complete null hypothesis, denoted by Hc0; refers to the situation that all the

null hypotheses H0i; i ¼ 1; . . . ; k are true.

Subset Pivotality The distribution of fT1; . . . ; Tkg has the subset pivotality property if the jointdistribution of the subvector fTi: i 2 Jg is identical under the restriction

\i2J

H0i and Hc0, for all

subsets J ¼ fi1; . . . ; ijg of true null hypotheses.Unfortunately, without the MDJ condition, subset pivotality may not hold for permutation tests.

Continuing with our gene expressions example, suppose k ¼ 3; FLf1;2;3g is trivariate normal withmeans zero, variances one, and correlations zero, while FHf1;2;3g is trivariate normal with meansð0; 0; mH3Þ and variances one. If the correlation of FHf1;2g depends on mH3; then Theorem 2.1 inHuang, Xu, Calian, Hsu (2006) shows the joint distribution of the permuted test statistics for genes 1and 2 could be different under

\i2f1;2g H0i and Hc

0 (if nL 6¼ nHÞ; because without MDJ different

correlations of FHf1;2g are allowed under\

i2f1;2g H0i: This is an example for which subset pivotality

does not hold for permutation tests. In this example, condition MDJ holds if whenever a group ofgenes have the same means and variances between the low risk and high risk groups, they will havethe same correlations as well.

In our experience, the surest way to validate step-down testing is to model the data, check for appropri-ateness of the model using diagnostic techniques, and verify conditions S1–S3 hold for the model. Hsuet al. (2006) gives an example of successfully modeling gene expression data as a linear model with i.i.d.error vectors, facilitated by statistically designing microarrays and microarray experiments. In that situa-tion re-sampling of the residuals provides appropriate multiplicity adjustment for controlling FWER.Below we demonstrate potential benefits of using information on the joint distribution of test statisticsthat may be available from modeling in implementing methods controlling gFWER.

For a fixed gFWER level a; the sharpest possible critical value for testing each Q*I is

cI ¼ inf x 2 R j supq2Q*

I

PqðTI½jIj�m� > xÞ � a

� : ð10Þ

Suppose the shortcut condition S1 holds and the distribution of Ti when the null hypothesis H0i istrue is the same for all i. Denote by F the common marginal null distribution and by P0 the probabil-ity computed assuming the null hypotheses involved are true.

For m ¼ 0; a popular procedure is to bound P0ðT½jIj� > xÞ with expressions that depend only on F,so that critical values can be computed based on the univariate distribution F. For example, Holm’smethod is based on bounding P0ðT½jIj� > xÞ by Bonferroni’s inequality.

Let ac ¼ FðcjIj�mÞ: Markov’s inequality provides the upper bound

P0ðV > mÞ � jIj ac

mþ 1: ð11Þ



If the test statistics are independent, then in fact

P0ðV > mÞ ¼ 1�Pmj¼0

jIjj

� �aj

cð1� acÞjIj�j : ð12Þ

It can be shown that this quantity (12) is always smaller than the Bonferroni/Markov bound. Form ¼ 0; it can often be proven that (12) is also an upper bound on the true familywise error rate(FWER) if the test statistics have a joint distribution satisfying certain properties (Hsu, 1996, Chapter1). For m ¼ 0; it is also known that the difference between critical values based on (11) and (12) isslight (see p. 12 of Hsu, 1996). Consider testing 2-sided null hypotheses H0i: qi ¼ 0 when the marginaldistribution F of the test statistics is the standard normal distribution. For k ¼ 200 and m ¼ 0; Figure 1shows that critical values for the step-down method computed using the Bonferroni/Markov inequality(11) and using the independence assumption (12) differ only slightly.

However, for m > 0; the difference becomes larger. For k ¼ 200; Figure 2 shows that the criticalvalues computed using the Markov’s inequality (11) and using the independence assumption (12)


0 50 100 150 200

2.0

2.5

3.0

3.5

Step

Crit

ical

val

ues

m=0

IndependenceBonferronicorrelation=0.5

Figure 1 Critical values for step-down tests when k ¼ 200;m ¼ 0; and a ¼ 0:05. Test statistics are assumed to followthe standard normal distribution.

0 50 100 150 200

21

01

23

Step

Crit

ical

val

ues

m=5

IndependenceMarkovcorr=0.5

Figure 2 Critical values for step-down tests when k ¼ 200;m ¼ 5; and a ¼ 0:05. Test statistics are assumed to followthe standard normal distribution.


differ substantially for m ¼ 5: Thus, if the test statistics are independent by design or can be modeledas essentially independent, then the critical values should be computed based on (12).

If fTi; i 2 Ig have a tractable joint distribution when H0i; i 2 I; are true, then it may be possible tocompute cjIj exactly. Such is the case with multiple comparisons in the General Linear Model (GLM),as described in Chapter 7 of Hsu (1996). If the design is treatment variance-balanced, then the stan-dardized statistics

Ti ¼q̂qi�qi

SEq̂qi �qi

;

for multiple comparisons with a control have a multivariate-t distribution with a common correlation lfor the numerators (Hsu, 1996, page 200). For one-sided null hypotheses H0i: qi � 0; the sharpestcjIj�m is the solution to

ð10

ð1�1

PjIjj¼jIj�m

jIjj

� �F�lzþ cjIj�msffiffiffiffiffiffiffiffiffiffiffiffiffi

1� l2p

!" #j

ð13Þ

1�F�lzþ cjIj�msffiffiffiffiffiffiffiffiffiffiffiffiffi

1� l2p

!" #jIj�j

dFðzÞ gðsÞ ds ¼ 1� a ; ð14Þ

where F is the standard normal distribution function and g is the density of the square root of a Chi-square random variable divided by its degrees of freedom. For two-sided null hypotheses H0i: qi ¼ 0;letting

AðcjIj�m; s; lÞ ¼ F�lzþ cjIj�msffiffiffiffiffiffiffiffiffiffiffiffiffi

1� l2p

!�F

�lz� cjIj�msffiffiffiffiffiffiffiffiffiffiffiffiffi1� l2

p !

;

the sharpest cjIj�m is the solution to

Ð10

Ð1�1

PjIjj¼jIj�M

jIjj

� �½AðcjIj�m; s; lÞ�j ð15Þ

½1� AðcjIj�m; s; lÞ�jIj�j dFðzÞ nðsÞ ds ¼ 1� a : ð16Þ

In either case, the critical value cjIj�m can be computed numerically or by simulation. The distributionof test statistics in such a design has a so called 1-factor structure.

If the design is variance-balanced and the error degrees of freedom is large, then in the 2-sided casethe test statistics Ti; i ¼ 1; . . . ; k; are essentially absolute values of equally correlated multivariatenormal random variables with correlation 0.5. In this case, Figure 1 shows that, for k ¼ 200 andm ¼ 0; critical values for the step-down method computed for absolute values of equally correlatedmultivariate normal random variables with correlation 0.5 are substantially smaller than those com-puted under the independence assumption (12).

Surprisingly, for m > 0, critical values computed under the independence assumption (12) are notnecessarily conservative for m > 0; even for positively correlated multivariate normal test statistics.For test statistics which are absolute values of equally correlated multivariate normal random variableswith correlation 0.5, Figure 2 shows that, for k ¼ 200 and m ¼ 5; critical values for the step-downmethod computed for the multivariate normal distribution are substantially larger than those computedunder the independence assumption (12).

In summary, for m > 0; one approach is to use the Markov inequality (11) to compute conservativecritical values. Our approach is to model the data and compute the critical values based on the jointdistribution of the test statistics. A third approach, taken by van der Laan, Dudoit, and Pollard (2004),is to estimate the joint distribution of the tests statistics by re-sampling.



Appendix A: Proof of Theorem 4.1

Proof. Without loss of generality, let fH0i; i 2 K�g be the non-empty collection of true null hypoth-eses, that is, q 2 Q�K� . If jK�j � m, then it is certain that gFWER is zero. For jK�j > m, we show theI-reject partitioning test controls gFWER strongly at level a as follows.

gFWER ¼ supq2Q�K�

Pq fV > mg

¼ supq2Q�K�

Pq freject strictly more than m H0i’s; i 2 K�g

¼ supq2Q�K�

PqfI � reject H0i under Q*I 8I 3 i ; for strictly more than m H0i’s; i 2 K�g

� supq2Q�K�

PqfI � reject strictly more than m H0i’s under Q�K�g

¼ supq2Q�K�

PqfTK�½jK�j�m� > cK�g

� a

To illustrate the proof, let k ¼ 3;m ¼ 1; so the set of null hypotheses are

H01: q1 ¼ 0;

H02: q2 ¼ 0;

H03: q3 ¼ 0:

Suppose q� 2 Q�12; then

gFWER ¼ supq�2Q�12

Pqfreject H01 and reject H02g

¼ PfI-reject H01 in Q�12;Q�13;Q

�123 and I-reject H02 in Q�12;Q

�23;Q

�123g

� PfI-reject H01 and H02 in Q�12g� a :

Appendix B: Proof of Theorem 5.1

Proof. Given the sufficient conditions S1–S3, we have cI ¼ cJ if jIj ¼ jJj, so cI can be simply de-noted as cjIj.

1. Step 0. Reject all Hi; i ¼ ½k�; . . . ; ½k � mþ 1�; go to step 1.Proof We need to show that without comparing Ti with any critical values, we can rejectHi; i ¼ ½k�; . . . ; ½k � mþ 1�.If i 2 f½k�; . . . ; ½k � mþ 1�g and i 2 I, then Hi 2 fHI

½jIj�;HI½jIj�1�; . . . ;HI

½jIj�mþ1�g; so Hi is I-re-jected under Q*I, 8I 3 i, so Hi is rejected.

2. Step 1. If T½k�m� � ck�m, stop. Else, reject H½k�m�, go to step 2.Proof We need to show T½k�m� > ck�m ensures H½k�m� is I-rejected in each Q*I , 8I 3 ½k � m�.If I 6� f½k�; . . . ; ½k � mþ 1�g, then H½k�m� 2 fHI

½jIj�; . . . ;HI½jIj�mþ1�g, so H½k�m� is I-rejected.

If I � f½k�; . . . ; ½k � mþ 1�g, then TI½jIj�m� ¼ T½k�m� > ck�m, so all Hi; i 2 I are I-rejected.

3. Step 2. If T½k�m�1� � ck�m�1, stop. Else, reject H½k�m�1�, go to step 3.Proof We need to show T½k�m� > ck�m and T½k�m�1� > ck�m�1 ensure H½k�m�1� is I-rejected ineach Q*I , 8I 3 ½k � m� 1�.



If I � any subset of size m of f½k�; ½k � 1�; . . . ; ½k � m�g, then TI½jIj�m� > ck�m�1 by the fact

T½k�m� > ck�m and T½k�m�1� > ck�m�1, so H½k�m�1� is I-rejected.If I 6� any subset of size m of f½k�; ½k � 1� . . . ; ½k � m�g, then H½k�m�1� 2 fHI

½jIj�; . . . ;HI½jIj�mþ1�g,

so H½k�m�1� is I-rejected.4. Step (d þ 1). If T½k�m�d� � ck�m�d , stop. Else, reject H½k�m�d�, go to step (d þ 2).

Proof We need to show T½k�m�i� > ck�m�i; 8i ¼ 0; 1; . . . ; d ensure H½k�m�d� is I-rejected in eachQ*I , 8I 3 ½k � m� d�.If I � any subset of size m of f½k�; ½k � 1�; . . . ; ½k � m� ðd � 1Þ�g, then TI

½jIj�m�d� > ck�m�d bythe fact T½k�m�i� > ck�m�i; 8i ¼ 0; 1; . . . ; d, so H½k�m�1� is I-rejected.If I 6� any subset of size m of f½k�; ½k � 1�; . . . ; ½k � m� ðd � 1Þ�g, thenH½k�m�d� 2 fHI

½jIj�; . . . ;HI½jIj�mþ1�g, so H½k�m�d� is I-rejected. &

Appendix C: Geometry of Partition Testing (k ¼ 2)

We provide some geometric insights into step-down testing for both partition testing and generalizedpartition testing for k ¼ 2, when the shortcut conditions S1–S3 are satisfied.

Partition testing for H01: q1 � 0 and H02: q2 � 0 tests the three partitioning hypotheses

H\0 : q1 � 0 and q2 � 0

HP01: q1 � 0 and q2 > 0

HP02: q2 � 0 and q1 > 0

It infers q1 > 0 when both H\0 and HP01 are rejected, and it infers q2 > 0 when both H\0 and HP

02 arerejected.

For partition testing controlling FWER (i.e., m ¼ 0), Figure 3 explains why once H\0 is rejected,one can skip testing either HP

01 or HP02:


Reject H1 but not H2Reject H2 but not H1No rejection

θ1

θ2

c1

c1

c2

c2

0

Line2

Line1

Figure 3 Rejection regions for controlling FWERwhen k ¼ 2. c2 and c1 represent the critical values forthe first and second steps in the step-down test.


Suppose T1 > T2. Call the vertical line with value equal to c1 on the horizontal axis line 1. In termsof T1; the region to the right of line 1 in Figure 3, is the rejection region for HP

01: When H\0 isrejected, (T1, T2) is outside the dot-shaded region, so with T1 > T2, (T1, T2) must be to the right ofline 1, and HP

01 can be rejected without it actually being tested. Similarly, when T2 > T1; once H\0 isrejected, HP

02 can be rejected without it actually being tested.Thus, as shown in Figure 3, for FWER-controlling step-down testing, both HP

01 and HP01 are rejected

if ðT1; T2Þ is in the unshaded region, while only H01 is rejected if ðT1; T2Þ is in the triangle-shadedregion, and only H02 is rejected if ðT1; T2Þ is in the circle-shaded region.

Figure 4 displays the rejection regions for generalized partitioning test with k ¼ 2 and m ¼ 1. In

Figure 4, Tf1;2g½2� ¼ T2 and Tf1;2g½1� ¼ T1 above the 45 degree line, while Tf1;2g½2� ¼ T1 and Tf1;2g½1� ¼ T2

below the 45 degree line. Thus, for the region above the 45 degree line, HP02 is always rejected and

HP01 is rejected if T1 > c ¼ c1: For the region below the 45 degree line, HP

01 is always rejected and

HP02 is rejected if T2 > c1. Thus, as shown in Figure 4, for gFWER-controlling step-down testing with

m ¼ 1; both HP01 and HP

02 are rejected in the blank region, while only HP01 is rejected in the dot-shaded

region, while only HP02 is rejected in the dash-shaded region.

Acknowledgements We thank the reviewers, Martin Porsch, and Frank Bretz for very helpful comments, andVioleta Calian for most insightful discussions. This research is supported by NSF Grant Number DMS-0505519.

References

Berger, R. L. and Hsu, J. C. (1996). Bioequivalence trials, intersection-union tests, and equivalence confidencesets. Statistical Science 11, 283–315.

Casella, G. and Berger, R. L. (1990). Statistical Inference. Wadsworth, Pacific Grove, California.Finner, H. and Roter, M. (2001). On the False Discovery Rate and expected Type I errors. Biometrical Journal

43, 985–1005.Finner, H. and Strassburger, K. (2002). The partitioning principle: a powerful tool in multiple decision theory.

Annals of Statistics 30, 1194–1213.


θ1

θ2

c

c

0

Line2

Line1Reject H1 but not H2Reject H2 but not H1

Figure 4 Rejection regions for controlling gFWERwhen k ¼ 2; m ¼ 1. c represents the critical value.


Hommel, G. and Hoffmann, T. (1988). Controlled uncertainty. In Bauer, P. e.a. (ed.) Multiple Hypotheses Testing.Springer, Berlin, 154–161.

Hsu, J. C. (1996). Multiple Comparisons: Theory and Methods. Chapman & Hall, London.Hsu, J. C. and Berger, R.L. (1999). Stepwise confidence intervals without multiplicity adjustment for dose re-

sponse and toxicity studies. Journal of the American Statistical Association 94, 468–482.Hsu, J. C., Chang, J., Wang, T., Steingr�msson, E., Magn�sson, M. K., and Bergsteinsdottir, K. (2006). Statisti-

cally designing microarrays and microarray experiments to enhance sensitivity and specificity. Briefings inBioinformatics bbl023.

Huang, Y. and Hsu, J. C. (2006). Hochberg’s step-up method: Cutting corners off Holm’s step-down method.Technical Report 761, Department of Statistics, The Ohio State University.

Huang, Y., Xu, H., Calian, V., and Hsu, J. C. (2006). To permute or to not permute. Bioinformatics 22, 2244–2248.

Lehmann, E. L. and Romano, J. (2005). Generalizations of the familywise error rate. Annals of Statistics 33,1138–1154.

Lin, S., Rogers, J., and Hsu, J. C. (2001). A confidence set approach for finding tightly linked genomic regions.American Journal of Human Genetics 68, 1219–1228.

Marcus, R., Peritz, E., and Gabriel, K. R. (1976). On closed testing procedures with special reference to orderedanalysis of variance. Biometrika 63, 655–660.

Pollard, K. S. and van der Laan, M. J. (2005). Resampling-based multiple testing: Asymptotic control of type Ierror and applications to gene expression data. Journal of Statistical Planning and Inference 125, 85–100.

Stefansson, G., Kim, W., and Hsu, J. C. (1988). On confidence sets in multiple comparisons. In Gupta, S.S. andBerger, J. O. (eds.) Statistical Decision Theory and Related Topics IV, Vol. 2, 89–104. Springer-Verlag, NewYork.

van der Laan, M. J., Dudoit, S., and Pollard, K. S. (2004). Augmentation procedures for control of the generalizedfamily-wise error rate and tail probabilities for the proportion of false positives. Statistical Applications inGenetics and Molecular Biology 3(1).

van ’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy,K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards,R., and Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature415, 530–536.