# Applying the Generalized Partitioning Principle to Control the Generalized Familywise Error Rate

• Published on
06-Jun-2016

• View
212

0

Embed Size (px)

Transcript

• Applying the Generalized Partitioning Principleto Control the Generalized Familywise Error Rate

Haiyan Xu1 and Jason C. Hsu*; 2

1 E22504, Department of Clinical Biostatistics, Johnson & Johnson PRD, L.L.C., USA2 Department of Statistics, The Ohio State University, 1958 Neil Avenue, Columbus, Ohio 43210, USA

Received 14 February 2006, revised 17 July 2006, accepted 31 August 2006

Summary

In multiple testing, strong control of the familywise error rate (FWER) may be unnecessarily stringentin some situations such as bioinformatic studies. An alternative approach, discussed by Hommel andHoffmann (1988) and Lehmann and Romano (2005), is to control the generalized familywise error rate(gFWER), the probability of incorrectly rejecting more than m hypotheses. This article presents thegeneralized Partitioning Principle as a systematic technique of constructing gFWER-controlling teststhat can take the joint distribution of test statistics into account. The paper is structured as follows. Wefirst review classical partitioning principle, indicating its conditioning nature. Then the generalizedpartitioning principle is introduced, with a set of sufficient conditions that allows it to be executed as acomputationally more feasible step-down test. Finally, we show the importance of having some knowl-edge of the distribution of the observations in multiple testing. In particular, we show that step-downpermutation tests require an assumption on the joint distribution of the observations in order to controlthe familywise error rate.

Key words: Generalized familywise error rate; Generalized Partitioning Principle; MarginalsDetermine the Joint condition; Permutation tests; Step-down tests.

1 Error Rate in Testing a Single Null Hypothesis

Testing a statistical hypothesis is often part of a decision-making process. In such a setting, error rateshould be defined so that it reflects the probability that the testing procedure causes an incorrectdecision.Consider the analysis of gene expression levels from high and low risk patients. One common

purpose of such an analysis is to discover differentially expressed genes for intense further study. Forthis purpose, it is perhaps appropriate to define differentially expressed as not identically distribu-ted. Another common purpose of such an analysis is to train a prognostic algorithm for predictingrisk. Since training of classification algorithms in supervised machine learning proceeds by computingdistances between subjects in high and low risk groups, the genes included in the training set shouldbe those with large (standardized) mean distances between the two groups. For this purpose, it is moreuseful to define differentially expressed as not equal in expression level on average.Suppose k genes are probed in comparing expression profiles between low risk and high risk

groups. Let FLi;FHi; i 1; . . . ; k; denote the (marginal) distributions of the expression level of the i-thgene of a randomly sampled patient from the low and high risk group respectively. LetmLi; mHi; i 1; . . . ; k; denote the (logarithms of) expected expression level of the i-th gene of a ran-domly sampled patient from the low and high risk group respectively. Let qi denote the difference ofthe expected (logarithms of) expression levels of the i-th gene between the two groups, qi mHi mLi:

* Corresponding author: e-mail: Hsu.1@osu.edu, Phone: 01 614 292 7663; Fax: 01 614 292 2096

52 Biometrical Journal 49 (2007) 1, 5267 DOI: 10.1002/bimj.200610307

# 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

• Then, in the discovery situation, the appropriate (marginal) null hypotheses are

HF0i: FHi FLi 1for i 1; . . . ; k; while in the prognostic situation the appropriate (marginal) null hypotheses are

Hq0i : qi 0 2for i 1; . . . ; k:For the null hypothesis HF0i; the definition of Type I error rate is unambiguous. But the null hypoth-

esis Hq0i: qi 0 is composite in two ways. First, in contrast to HF0i, so long as the expectationsare the same, it allows for the possibility that the variances, skewness, and other aspects of the distribu-tions are different. Second, the null hypothesis Hq0i: qi 0 leaves unspecified the values ofqj; j 1; . . . ; k; j 6 i:Type I error of testing Hq0i can thus be defined in different ways. Let q q1; . . . ; qk; and let S

denote generically all nuisance parameters that the observed expression levels depend on (including,for example, covariance of the expression levels for each of the two groups). Let q0 q01; . . . ; q0kand S0 be the collection of all (unknown) true parameter values. One definition of Type I error rate,given in Pollard and van der Laan (2005) for example, is

Pq0;S0 fReject H0ig ; 3where q0i 0: In the analysis of gene expression levels, for example, if H0i is the null hypothesis thatthe i-th gene is not differentially expressed, then the values of expression levels for the other genes atwhich this probability Pq0;S0fReject H0ig is computed are the true expression levels for the groupsbeing compared, and the joint distributions under which they are computed are the true distributionsfor the groups being compared. But since q0j ; j 6 i; and S0 are unknown, this probability is difficult tocompute directly. Therefore, the traditional definition of Type I error rate, Definition 8.3.3 on p. 361of Casella and Berger (1990), is

supqi0

Pq;S fReject H0ig ; 4

where the supremum is taken over all possible q and S subject to qi 0: In the analysis of geneexpression levels, for example, if H0i is the null hypothesis that the i-th gene is not differentiallyexpressed, then the supremum is taken over all possible true expression levels of the other genes, aswell as all possible joint distributions of expression levels of all genes (including all possible correla-tions). The definition we use is this traditional definition of Type I error rate, as it is more readilyverifiable in practice.In practice, whether the hypothesis being tested is HF0i or H

q0i affects whether the error rate is

controlled, and therefore should be clearly stated. For example, as shown in Pollard and van der Laan(2005) and Huang, Xu, Calian, and Hsu (2006), permutation tests control the error rate only fortesting HF0i but not H

q0i.

2 Error Rates in Multiple Testing

Let fH0i: i 1; . . . ; kg be a family of null hypotheses to be tested. In microarray studies, H0i is thenull hypothesis that the i-th gene is not differentially expressed between high and low risk patients.Let V denote the number of incorrectly rejected true null hypotheses. Strong control of the familywise

error rate (FWER) keeps the probability of rejecting any true null hypothesis at a pre-specified low level,

FWER supq;S

Pq;SfV > 0g 5

where the supremum is take over the entire parameter space, to account for all possible values ofparameters corresponding to false null hypotheses, as well as values of nuisance parameters.

Biometrical Journal 49 (2007) 1 53

# 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

• In multiple testing for differential expressions, the consequence of errors in testing the individualhypotheses depends on whether the purpose is to understand the biological disease mechanism, or toselect genes to train a prognostic algorithm toward ultimately making a prognostics device. In theformer case, every mistake costs the investigator some time and effort. In the latter, what mattersmore is the algorithm has enough sensitivity/specificity, and the number of genes is small enough forthe device to be reasonably priced. In either case, controlling FWER strongly may be unnecessarilystringent. Controlling the generalized familywise error rate (gFWER)

gFWER supq;S

Pq;SfV > mg 6

(where again the supremum is take over the entire parameter space) keeps the probability of incor-rectly rejecting more than a fixed number m of true null hypotheses at a low level.One approach to controlling gFWER is to use a method that controls the familywise error rate (i.e.,

controlling gFWER at m 0), and then reject an additional m hypotheses. Indeed, the augmentationtest of van der Laan, Dudoit, and Pollard (2004) rejects the null hypotheses that would be rejected byan FWER-controlling method, and then augments the rejections by automatically rejecting the nullhypotheses associated with the next m extreme test statistics.We take a different approach in this article, generalizing the Partitioning Principle of Stefansson,

Kim, and Hsu (1988) and Finner and Strassburger (2002) to provide a technique for constructingmultiple tests that control gFWER. Our approach not only shows how to take the joint distribution ofthe test statistics into consideration in controlling gFWER, but also the conditioning nature of data-dependent thresholding by step-wise methods.In the remaining part of this article, qi; i 1; . . . ; k; are univariate parameters of interest, and the

nuisance parameter S is suppressed notationally.

3 Partitioning is Conditioning

A principle of multiple testing which is conditional in nature is the Partitioning Principle of Stefans-son, Kim, and Hsu (1988), which was further refined by Finner and Strassburger (2002). Holms step-down method and Hochbergs (1988) step-up method, each comparing p-values against a data-depen-dent threshold, can be thought of as partition testing, obtained by applying the Bonferronis inequalityand a modification of Simes (1986) inequality respectively to component partitioning tests (seeHuang and Hsu, 2005). The step-down version of Dunnetts (1955) method in Marcus, Peritz, andGabriel (1976) can be thought of as partition testing taking into account the multivariate t joint distri-bution of the test statistics (see Stefansson, Kim, and Hsu, 1988). The idea of partition testing is if, ina subspace of the parameter space only a subset of the null hypotheses are true, then tests for param-eter points in that subspace need to be adjusted for the multiplicity of only those true null hypotheses.Partition testing thus applies a different test to each subspace of the parameter space depending onhow many null hypotheses are true in that subspace, and then collates the results across the subspaces,as follows.Let K f1; . . . ; kg and consider testing H0i: q 2 Qi; i 1; . . . ; k: To control FWER (gFWER with

m 0), the Partitioning Principle states:1. Partition

[i2K Qi into disjoint Q

*I ; I K; as follows. For each I K; letQ*I

\i2I Qi \

\j 62I Q

cj : Then fQ*I ; I Kg; including Q*;

\j2K Q

cj ; partition the param-

eter space. Note that Q*I can be interpreted as the part of the parameter space in which exactly

H0i; i 2 I; are true, and H0j; j =2 I are false.2. Test each HP0I: q 2 Q*I at level a: Since the null hypotheses are disjoint, at most one null hypoth-

esis is true. Therefore, even though no multiplicity adjustment to the levels of the tests is made,the probability of rejecting at least one true null hypothesis is at most a:

54 H. Xu and J. C. Hsu: Partitioning to Control the Generalized Familywise Error Rate

# 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

• 3. For all I K; infer q 62 QI if all HP0J such that I J are rejected. That is, infer the intersectionnull hypothesis H0I is false if all null hypotheses H

scientific inference of rejecting the original null hypotheses H0i; i 3 K, since Qi [

I2i Q*I ; the

partition test rejects H0i if all HP0I such that I 3 i are rejected.

For example, suppose the desired inferences are q1 > 0; or q2 > 0; or both. With the alternativehypotheses being Ha1: q1 > 0 and Ha2: q2 > 0; the proper null hypotheses are H01: q1 0 andH02: q2 0:Partitioning forms the hypotheses

H\0 : q1 0 and q2 0HP01: q1 0 and q2 > 0HP02: q1 > 0 and q2 0

and tests each at level-a: Since the three hypotheses are disjoint, at most one null hypothesis is true.Therefore, even though no multiplicity adjustments to the levels of the tests are made, the probabilityof rejecting at least one null hypothesis is at most a: Applying the Partitioning Principle, statisticalinference proceeds as follows:

If H\0 : q1 0 and q2 0 is accepted, no inference is given.If only H\0 : q1 0 and q2 0 is rejected but not HP01: q1 0 and q2 > 0 or HP02:

q1 > 0 and q2 0; the inference is that at least one of Ha1: q1 > 0 or Ha2: q2 > 0 is true, but whichone cannot be specified.If H\0 : q1 0 and q2 0 and HP01: q1 0 and q2 > 0 are rejected but not HP02: q1 > 0 and q2 0;

the inference is q1 > 0.If H\0 : q1 0 and q2 0 and HP02: q1 > 0 and q2 0 are rejected but not HP01: q1 0 and q2 > 0;

the inference is q2 > 0.If H\0 : q1 0 and q2 0 and HP01: q1 0 and q2 > 0 and HP02: q1 > 0 and q2 0 are all rejected,

the inference is both q1 > 0 and q2 > 0.

Partitioning inference is conditional in the sense that it uses the data to decide in which subspacesthe true parameter may lie, and bases its inferences on tests appropriate for those subspaces. ThePartitioning Principle has been successfully applied to derive practically useful methods for bioequiva-lence (Berger and Hsu, 1996), dose-response studies (Hsu and Berger, 1999), and genetic linkageanalysis (Lin, Rogers, and Hsu, 2001).The next section describes how the partitioning principle can be generalized to systematically con-

struct methods controlling gFWER. A key difference between classical partitioning and generalizedpartitioning is that generalized partition testing must test each individual H0i; i 2 K in each Q*I .

4 The Generalized Partitioning Principle

The generalized Partitioning Principle states:

P1. Partition the parameter space[

i2K Qi into disjoint sets: for each I K f1; . . . ; kg; letQ*I

\i2I Qi \

\j 62I Q

cj

:

P2. In each Q*I ; reject all H0i; i =2 I; and test fH0i: qi 2 Qi; i 2 Ig at gFWER level a, controllingsup

q2Q*IPqV > m a.

P3. For i 2 I; define H0i to be I-rejected to mean H0i is rejected in Q*I : Reject H0i if H0i isI-rejected for all I; I K:

The generalized partitioning principle first partitions the parameter space into disjoint sets, so thatexactly one Q*I contains the true parameter q*. It then controls gFWER by controlling gFWER withineach Q*I .

Biometrical Journal 49 (2007) 1 55

# 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

• Consider, for example k 3, m = 1, Qi fqi 0g: Then there are 23 8Q*Is. Each H0i: qi 0 istested eight times, once in each Q*I , and is ultimately rejected only if it is rejected in each Q*I . Withineach Q*I , H0i for qi > 0 are automatically rejected, while H0i for qi 0 are tested with multiplicityadjustments in such a way that gFWER is controlled at level-a within Q*I . For I f2; 3g say, H01 isautomatically rejected while H02 and H03 are tested so that the probability of rejecting both is nomore than a: If the test statistics are independent, then each can be tested at level-

a

pwithin Qf2; 3g:

(Traditional partition testing controlling FWER, or equivalently gFWER with m 0; would test H02and H03 with...