7
The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery Ehud Aharoni, Hani Neuvirth, and Saharon Rosset Abstract—The common scenario in computational biology in which a community of researchers conduct multiple statistical tests on one shared database gives rise to the multiple hypothesis testing problem. Conventional procedures for solving this problem control the probability of false discovery by sacrificing some of the power of the tests. We suggest a scheme for controlling false discovery without any power loss by adding new samples for each use of the database and charging the user with the expenses. The crux of the scheme is a carefully crafted pricing system that fairly prices different user requests based on their demands while keeping the probability of false discovery bounded. We demonstrate this idea in the context of HIV treatment research, where multiple researchers conduct tests on a repository of HIV samples. Index Terms—Family-wise error rate, multiple comparisons, Bonferroni method. Ç 1 INTRODUCTION SCIENCE in the era of the Internet is characterized by enhanced collaboration between mutual-interest research groups around the world. In the case of applicative computational research this is carried out through publicly available databases and prediction web servers. Examples for such databases can be found in various bioinfor- matics subdomains. The HIVdb is maintained by Stanford University [1] for the community of anti-HIV treatment research- ers. The WTCCC organization collects large-scale data for whole- genome association studies providing them to selected research groups [2]. In addition to the large and profound databases, many labs publish their own data sets on the web, for the benefit of the scientific community. Recently, a step forward was suggested by ProMateus [3], a collaborative web server for automatic feature selection for protein binding-sites prediction. Such collaborative systems and publicly available databases raise statistical issues that are addressed by this paper. Publicly available databases serve a community of researchers. When a researcher in the community comes up with a new hypothesis, she can test it on the database. Consider Stanford’s HIVdb example. It contains samples of genetic sequences of viruses extracted from HIV patients together with their level of resistance to each of the anti-HIV drugs. These resistance levels are estimated using costly in vitro measurements. A researcher may propose a hypothesis that a certain mutation, or a combination of mutations, is associated with resistance to a specific drug. A statistical test can be designed to test this hypothesis. Then the researcher simply downloads the HIVdb relevant data from the Internet and executes this test (this is a purely statistical test. No experimentation needed). Naturally such a setting suffers from the well-known multiple testing problem: the more hypotheses are tested the more type-I errors (false discoveries) are expected to occur. This problem is related to the much-maligned publication bias phenomenon [4]. Standard ways for coping with this problem are the Bonferroni correction [5], procedures for controlling False Discovery Rate [6] and others. All of these approaches require lowering the significance level of the tests such that a bound can be established to some measure of the overall type-I error. The shortcoming of these approaches is that they result in a lower power of the tests (lower power means higher chance of a type-II error, where a “true” discovery is not confirmed). The need to control false discovery across the research community without sacrificing too much power is a widespread problem [7], [8]. In this paper, we propose a novel approach which we term a “Quality Preserving Database,” (QPD) aimed at encouraging collaboration while maintaining the quality of research results by addressing multiple testing issues. Our approach does not require specifying in advance the total number of statistical tests that will be executed, as in the Bonferroni correction and other procedures for controlling false discovery, but assumes an infinite series of tests will be sequentially performed on the database. It further allows each researcher an almost free hand in choosing the characteristics of his or her statistical tests, including power, irrespective of the characteristics of other tests, before or after in the series. In addition, our approach maintains control over the expected number of type-I errors, as in Bonferroni correction. Central to our approach is the idea of compensation. Imagine a certain researcher A would like to perform a single tail z-test with known standard deviation ' ¼ 10, significance level 0.05 and power 0.9. Say the database contains 857 samples which is exactly enough to perform this test. 1 However, another researcher B has just completed a different test on the same database with significance level 0.001 and power 0.44. In order to control type-I errors, the database manager tells researcher A to reduce the significance level of his test to 0.049. Ordinarily this would require researcher A to reduce the power to some value less than 0.9, which is undesirable. Since researcher A’s problem is due to researcher B’s test, researcher B is asked to compensate by adding six more samples to the database. 863 samples is exactly enough to allow researcher A execute his test with significance level 0.049 and the original power 0.9. Of course requiring actual new samples from a researcher is not always practical. Instead the required number of samples may be translated to currency. This amount will be the cost researcher B will be charged with for executing the test, and it will be used for purchasing six new samples from some third party. In the simplistic example just mentioned, the compensation requested for the first statistical test allows for a single additional test. Our goal, however, is to create a compensation system in which an infinite series of tests can be executed while controlling type-I errors. Furthermore, we require that the amount of compensation required for a test will not be influenced by the characteristics of other tests in the series. Thus, we define two requirements that such a system should fulfill: stability and fairness. Stability means that the cost charged for a particular test should always be sufficient to compensate for all possible future tests. In other words, it should not adversely affect the costs of future tests. The fairness property requires that users that request more difficult IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011 1431 . E. Aharoni and H. Neuvirth are with the Machine Learning and Data Mining group, IBM Research Laboratory in Haifa, Haifa University Campus, Mount Carmel, Haifa 31905, Israel. E-mail: {aehud, hani}@il.ibm.com. . S. Rosset is with the School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel. E-mail: [email protected]. Manuscript received 1 July 2009; revised 25 Feb. 2010; accepted 21 July 2010; published online 18 Oct. 2010. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-2009-07-0115. Digital Object Identifier no. 10.1109/TCBB.2010.105. 1. See Section 2.1 for an overview of the related nomenclature. A z-test is a hypothesis test about the mean of a normal distribution. Given the requirements for significance level and power one may compute the necessary number of samples, 857 in our case. 1545-5963/11/$26.00 ß 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery

  • Upload
    s

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery

The Quality Preserving Database:A Computational Framework for Encouraging

Collaboration, Enhancing Powerand Controlling False Discovery

Ehud Aharoni, Hani Neuvirth, and Saharon Rosset

Abstract—The common scenario in computational biology in which a community

of researchers conduct multiple statistical tests on one shared database gives rise

to the multiple hypothesis testing problem. Conventional procedures for solving

this problem control the probability of false discovery by sacrificing some of the

power of the tests. We suggest a scheme for controlling false discovery without

any power loss by adding new samples for each use of the database and charging

the user with the expenses. The crux of the scheme is a carefully crafted pricing

system that fairly prices different user requests based on their demands while

keeping the probability of false discovery bounded. We demonstrate this idea in

the context of HIV treatment research, where multiple researchers conduct tests

on a repository of HIV samples.

Index Terms—Family-wise error rate, multiple comparisons, Bonferroni method.

Ç

1 INTRODUCTION

SCIENCE in the era of the Internet is characterized by enhancedcollaboration between mutual-interest research groups around theworld. In the case of applicative computational research this iscarried out through publicly available databases and predictionweb servers.

Examples for such databases can be found in various bioinfor-

matics subdomains. The HIVdb is maintained by Stanford

University [1] for the community of anti-HIV treatment research-

ers. The WTCCC organization collects large-scale data for whole-

genome association studies providing them to selected research

groups [2]. In addition to the large and profound databases, many

labs publish their own data sets on the web, for the benefit of the

scientific community.Recently, a step forward was suggested by ProMateus [3], a

collaborative web server for automatic feature selection for protein

binding-sites prediction. Such collaborative systems and publicly

available databases raise statistical issues that are addressed by

this paper.Publicly available databases serve a community of researchers.

When a researcher in the community comes up with a new

hypothesis, she can test it on the database. Consider Stanford’s

HIVdb example. It contains samples of genetic sequences of

viruses extracted from HIV patients together with their level of

resistance to each of the anti-HIV drugs. These resistance levels are

estimated using costly in vitro measurements. A researcher may

propose a hypothesis that a certain mutation, or a combination of

mutations, is associated with resistance to a specific drug. A

statistical test can be designed to test this hypothesis. Then the

researcher simply downloads the HIVdb relevant data from the

Internet and executes this test (this is a purely statistical test. No

experimentation needed).Naturally such a setting suffers from the well-known multiple

testing problem: the more hypotheses are tested the more type-I

errors (false discoveries) are expected to occur. This problem is

related to the much-maligned publication bias phenomenon [4].

Standard ways for coping with this problem are the Bonferroni

correction [5], procedures for controlling False Discovery Rate [6]

and others. All of these approaches require lowering the

significance level of the tests such that a bound can be established

to some measure of the overall type-I error. The shortcoming of

these approaches is that they result in a lower power of the tests

(lower power means higher chance of a type-II error, where a

“true” discovery is not confirmed). The need to control false

discovery across the research community without sacrificing too

much power is a widespread problem [7], [8].In this paper, we propose a novel approach which we term a

“Quality Preserving Database,” (QPD) aimed at encouraging

collaboration while maintaining the quality of research results by

addressing multiple testing issues. Our approach does not require

specifying in advance the total number of statistical tests that will

be executed, as in the Bonferroni correction and other procedures

for controlling false discovery, but assumes an infinite series of

tests will be sequentially performed on the database. It further

allows each researcher an almost free hand in choosing the

characteristics of his or her statistical tests, including power,

irrespective of the characteristics of other tests, before or after in

the series. In addition, our approach maintains control over the

expected number of type-I errors, as in Bonferroni correction.Central to our approach is the idea of compensation. Imagine a

certain researcher A would like to perform a single tail z-test with

known standard deviation � ¼ 10, significance level 0.05 and

power 0.9. Say the database contains 857 samples which is exactly

enough to perform this test.1 However, another researcher B has

just completed a different test on the same database with

significance level 0.001 and power 0.44. In order to control type-I

errors, the database manager tells researcher A to reduce the

significance level of his test to 0.049. Ordinarily this would require

researcher A to reduce the power to some value less than 0.9,

which is undesirable. Since researcher A’s problem is due to

researcher B’s test, researcher B is asked to compensate by adding

six more samples to the database. 863 samples is exactly enough to

allow researcher A execute his test with significance level 0.049 and

the original power 0.9.Of course requiring actual new samples from a researcher is not

always practical. Instead the required number of samples may be

translated to currency. This amount will be the cost researcher B

will be charged with for executing the test, and it will be used for

purchasing six new samples from some third party.In the simplistic example just mentioned, the compensation

requested for the first statistical test allows for a single additional

test. Our goal, however, is to create a compensation system in

which an infinite series of tests can be executed while controlling

type-I errors. Furthermore, we require that the amount of

compensation required for a test will not be influenced by the

characteristics of other tests in the series. Thus, we define two

requirements that such a system should fulfill: stability and fairness.

Stability means that the cost charged for a particular test should

always be sufficient to compensate for all possible future tests. In

other words, it should not adversely affect the costs of future tests.

The fairness property requires that users that request more difficult

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011 1431

. E. Aharoni and H. Neuvirth are with the Machine Learning and DataMining group, IBM Research Laboratory in Haifa, Haifa UniversityCampus, Mount Carmel, Haifa 31905, Israel.E-mail: {aehud, hani}@il.ibm.com.

. S. Rosset is with the School of Mathematical Sciences, Tel Aviv University,Tel Aviv 69978, Israel. E-mail: [email protected].

Manuscript received 1 July 2009; revised 25 Feb. 2010; accepted 21 July 2010;published online 18 Oct. 2010.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-2009-07-0115.Digital Object Identifier no. 10.1109/TCBB.2010.105.

1. See Section 2.1 for an overview of the related nomenclature. A z-test isa hypothesis test about the mean of a normal distribution. Given therequirements for significance level and power one may compute thenecessary number of samples, 857 in our case.

1545-5963/11/$26.00 � 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Page 2: The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery

tests (e.g., with higher power demands) will be assigned highercosts. We define these requirements more formally below.

We show in this paper two methods that satisfy both stabilityand fairness, thus both are capable of servicing an infinite series ofdiverse statistical tests. Furthermore, we prove that one gains more(in terms of testing power) by using the data one has as “payment”for gaining access to a central database than by keeping it to one’sself. Thus, our results encourage sound collaboration betweenresearchers as a means of simultaneously controlling publicationbias and increasing the likelihood of making new discoveries.

In this paper we provide a formal definition of the QPD, anddiscuss its implications in practice. Finally, to demonstrate aconcrete possible application we elaborate on the HIV treatmentresearch example.

2 FORMAL DEFINITION OF THE QUALITY PRESERVING

DATABASE

2.1 Hypothesis Testing Nomenclature

This paper deals with testing of hypotheses about unknownparameters. Let S be some unknown parameter (which can be, forexample, the mean of some probability distribution). The “com-mon wisdom” null hypothesis H0 may state that S has someparticular value S ¼ s. A researcher would like to prove theopposite, the “Novel” hypothesis H1, i.e, that S 6¼ s. An effect-size �must be defined, which determines how large a deviation from H0

is considered interesting from a scientific point of view. (It is alsopossible the claims are directed, i.e., that H0 claims that S � s andH1 claims that S > s.)

The test itself is conducted by collecting n samples andcalculating from them some statistic S. If S falls within a certainpredefined range, called the rejection region, then H0 is rejected andH1 is considered proven. Otherwise, H0 is accepted by default. If Sis the mean of a probability distribution, then S would usually bethe empirical mean of samples drawn from this distribution, andthe rejection region would typically be ½�1; s� z�

S½sþ z;1� for

some threshold z.Since the test is conducted on random samples a probability for

an erroneous conclusion exists. This probability can be bounded bycareful analysis of the test at hand and by making someassumptions on the distribution of the samples. In fact two typesof errors can occur. The first type, termed type-I error or falsediscovery, is when H0 is erroneously rejected. The level of the test isa bound on this type of error, i.e., a level p test guarantees that atype-I error has at most probability p of occurring. When multipletests are conducted their type-I errors accumulate. Bonferroni-typeprocedures may be employed to control family-wise error (FWE)[5], the probability of any type-I error, by controlling the sum of thelevels of all tests. More sophisticated procedures exist forcontrolling false discovery rate (FDR) [6], the percentage ofdiscoveries that are false, thus allowing rejection of more nullhypotheses. Another measure is called the power of the test. Itmeasures the probability of correctly rejecting H0, i.e., if a test haspower � and jS � sj � �, then H0 will be rejected with probabilityat least �. The probability of a type-II error, where H0 is notrejected when it should be, is bounded by 1� power.

Once the desired effect size, level, and power of a test have beenestablished, and given assumptions on the distribution of the datasamples, then the number of samples n required for this test andthe rejection region of the statistic can be calculated. In theapproach presented in this paper, the responsibility for establish-ing these parameters is distributed between the researcher and thedatabase manager. The researcher, who is mainly interested in newdiscoveries, chooses the effect size and the power. The databasemanager, who is trusted with the reliability of the results, thenallocates a level for this test and makes sure the database containsenough samples to support all these parameters.

2.2 Definition of QPD

An instance of a “Quality Preserving Database” contains samplesof a certain phenomenon. The database sequentially serves userrequests for testing hypotheses. For each test the user is required topay a certain amount c of new database samples. In this section weassume integer costs, however the methods described here can beeasily extended to support fractions. The exact process of serving arequest is as follows:

1. The user files a request. This request contains: the statisticused to determine the test, the desired power and effectsize, and the type of probability distribution the data areassumed to have. It is important to emphasize that therequest does not contain a desired level. Rather, the level ofthe test will be allocated by the database manager based onthe strategy for managing type-I errors, as explained below.

2. The database manager uses the details in the request tocompute what we term the “level-sample” function Lð�Þ,which defines the level of the test as a function of thenumber of samples n. (Note that given the details of therequest the level of the test is completely determined bythe number of samples).

3. The database manager uses Lð�Þ and the database’s pricingsystem to calculate the cost c which is reported back to theuser. The concrete pricing systems are defined in thesections below. This process also produces the actual levelL which will be allocated for this test. This detail, however,does not concern the user (as control of type-I errors is theresponsibility of the database manager).

4. Steps 1-3 can be repeated until the user is satisfied with thecost c offered to her and she pays the agreed amount.

5. Now all the details of the test are known, including thenumber of samples that will be added to the database andthe level L allocated for it. At this stage the rejection regionof the statistic can be computed and the test can be definedas a formal procedure.

We propose two versions of how the tests are actually executed. Inone version, which we term persistent, the test is immediatelyexecuted after step 5 above. The result, either “accept” or “reject,”is returned to the user for her to publish if she so wishes. We termthe second version volatile. In this version, the user’s test isrepeatedly executed for every addition of new samples to thedatabase. Since the number of samples keeps changing, the levelsallocated for the tests also change and their rejection regions arekept updated accordingly. Hence all the results are indeed volatile,a hypothesis that passed the test at a certain point in time may failit at another time and vice versa. However, this version requireslower costs as will be demonstrated below.

The following sections define the pricing systems of thepersistent and the volatile versions. We require two propertiesfrom a pricing system. The first property, which we term fairness,states that if one request has level-sample function Lað�Þ and asecond request has level-sample function Lbð�Þ and 8n : LaðnÞ <LbðnÞ then at any particular point in time the first request will beassigned with a lower cost than the second request. Plainly put, if amore difficult statistical test is requested it will cost more.

The second property we term stability. It states that the costassigned to a particular request a can be bounded by some amountca. The stability property prevents the undesirable scenario in whichthe cost of a request diverges with time. As shown in Section 3, theproposed schemes actually yield costs that decay with time. Thiscertainly satisfies the stability property. Furthermore, we find thisan agreeable behavior, since users who delay their requests risklosing the novelty of their hypotheses, but are compensated withlower costs.

Both of our pricing systems aim at bounding the expectednumber of type-I errors. This is achieved by keeping the sum of the

1432 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011

Page 3: The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery

levels of all tests executed on the database below a configurableparameter �. This is also a bound on the FWE [5], the probability ofhaving at least one type-I error.

2.3 Persistent Version Pricing System

In this version, the database manager maintains a bank B ofpotential type-I errors. This bank B is initialized with �. For eachuser the database manager “withdraws” a certain amount from B

to be used as the level of the test, thus guaranteeing the sum of thelevels of all the tests never exceeds �.

This is achieved as follows: We define the bank B to be afunction of the current number n of database samples: B ¼ �qnwhere 0 < q < 1 is a configurable database parameter. If a userpays c new samples then the bank B will be updated to a smallervalue �qnþc, and the difference �qnð1� qcÞ will be allocated for theuser’s test. If the user’s “level-sample” function is Lð�Þ, the cost forconducting this particular test will be obtained by finding theminimal c that satisfies

Lðnþ cÞ � �qnð1� qcÞ: ð1Þ

This equation guarantees that the level allocated for this test is justenough to perform it (and maintain the power requirement) giventhe amount of samples in the database. Proposition 2.1 relates tothe fairness and stability of this pricing system.

Proposition 2.1. The persistent pricing system with parameter 0 < q <

1 satisfies the fairness requirement for any pair of requests and the

stability requirement for any request with “level-sample” function

satisfying 8n : LðnÞ � bqn for some constant b.

Proof. Fairness follows immediately from the definition of thepricing system. Let Lð�Þ be a “level-sample” function of a requestsatisfying 8n : LðnÞ � bqn and c0 ¼ logqð �bþ�Þ. (c0 is a positivenumber for 0 < q < 1 and b > 0). Since �ð1� qc0 Þ ¼ b�

bþ� ¼ bqc0 ,we can rewrite (1) as Lðnþ c0Þ � bqnþc0 . This proves c0 is anupper bound on the cost associated with this request. tu

We show in Section 3 that “level-sample” function decay isindeed exponential in several common cases, including normallydistributed data.

2.4 Volatile Version Pricing System

In the volatile version, every time the number of samples n

increases all the requests that had been submitted so far are re-executed. Each test is allocated the lowest possible level that istolerable by the current database size and given the test’spredetermined requirements in terms of power and effect size.

The database manager needs to keep the following invarianttrue at all times:

Pti¼1 LiðnÞ � �, where L1ðnÞ; L2ðnÞ; . . . ; LtðnÞ are

the “level-sample” functions of all requests submitted so far, t theirnumber and n is the current number of database samples. Whenthe price of a new request is being negotiated, the databasemanager will calculate how much n should be increased in order tomaintain

Ptþ1i¼1 LiðnÞ � �, where the new request’s “level-sample”

function is Ltþ1ðnÞ. This increase will be the cost c required for it.Note that if all the “level-sample” functions are the same (for

example, if all users use the same statistic and have the same powerand effect-size requirements) then each test is performed withlevel �t as in the Bonferroni correction for dealing with the multipletesting problem. The cost c is then simply calculated to compensatefor the level reduction when t increases. The scheme we describe ismore general as it can set different costs for different kinds of tests.

We now proceed to show this pricing system satisfies thefairness and stability requirements. As before, we must makeassumptions in order to prove stability. Here, though, we assume aslower decay rate of the “level-sample” function. Section 3 explainswhy our assumptions are very weak.

Proposition 2.2. The volatile pricing system satisfies the fairness

requirement for any pair of requests and the stability requirement

if all requests have “level-sample” functions satisfying 8n :

Lðnþ 1Þ � nnþ1LðnÞ.

Proof. Fairness follows immediately from the definition of thevolatile pricing system.

Let L1ðnÞ; L2ðnÞ; . . .LtðnÞ be the “level-sample” functions ofall previously submitted requests. A new request with “level-sample” function Ltþ1ðnÞ is now being submitted. Let n0 bethe smallest integer that satisfies Ltþ1ðn0Þ � �, i.e., n0 is thenumber of samples that would have been required in order toperform just this test at level �. We prove stability by showingthat the cost of submitting this tþ 1’th request is bounded byn0. We show that by proving that if n0 new samples are addedto the database then the sum of levels of all tests (includingthe new one Ltþ1ðnþ n0Þ) is bounded by �

Xt1

Liðnþ n0Þ þ Ltþ1ðnþ n0Þ

� n

nþ n0Xt

1

LiðnÞ þn0

nþ n0 Ltþ1ðn0Þ ð2Þ

� n

nþ n0 �þn0

nþ n0 � ¼ �: ð3Þ

Inequality (2) uses the assumption that all “level-sample”functions satisfy 8n : Lðnþ 1Þ � n

nþ1LðnÞ. Inequality (3) relieson Ltþ1ðn0Þ � � and on the volatile pricing system invariantPt

i¼1 LiðnÞ � �. tu

This result implies that if one collects enough data to support acertain test at level �, then using this data as currency to access aQPD configured to maintain level � has two advantages. First, itcan only enhance the level or power of the test, as is shown in theabove proof. Second, the level � will not only bound the type-Ierror of a single test, but the FWE of the entire community sharingthe database. Thus, our approach encourages collaboration anddata sharing.

3 RESULTS

In this section, we provide some theoretical observations regardingthe applicability of the QPD scheme, as well as simulation results.

3.1 Applicable Statistical Tests

A crucial point for applying the QPD is to characterize what kindof statistical tests can be used in this scheme. In Section 2.3, wehave shown that in the persistent version fairness is guaranteed forany kind of test. Stability, however, is only guaranteed for testswhose “level-sample” functions Lð�Þ decay exponentially. We shallnow show several classes of tests that satisfy these requirements.

We begin by stating a lemma closely related to the Chernoff-Stein Lemma [9], reformulated to suit our purposes.

Lemma 3.1. Let X1; X2; . . . ; Xn be i.i.d. samples drawn from probability

distribution Q. Let two hypotheses H0 and H1 state that Q ¼ P0 andQ ¼ P1, respectively. A likelihood ratio test is used to decide between

them. If LðnÞ is the “level-sample” function of the test for a given

power requirement � then limn!11n logLðnÞ ¼ �DðP1kP0Þ where D

is the KL-divergence between the two distributions.

Proof. The result follows immediately from Chernoff-Stein Lemma[9] and Neyman-Pearson Lemma [10]. tu

Corollary 3.2. Likelihood ratio tests for simple hypotheses are usable with

the persistent version.

Proof. Lemma 3.1 proves the “level-sample” function LðnÞ of sucha test asymptotically tends to e�nk for some constant k. Thus,

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011 1433

Page 4: The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery

clearly there exists constant b and 0 < w < 1 such that

8n : LðnÞ � bwn, satisfying the requirement for stability. tu

The following theorem extends this result to uniformly most

powerful tests [10].

Theorem 3.3. Let Q be a probability distribution with unknown

parameter �. A statistical test with a null hypothesis H0 : � � �0 and

an alternative H1 : � > �0 that is uniformly most powerful is usable

with the persistent version.

Proof. Let L1ðnÞ be the “level-sample” function of this test having

power � measured at the point determined by the effect size

� ¼ �. Let L2ðnÞ be the “level-sample” function of a likelihood

ratio test for hypotheses H 00 ¼ �0 and H 01 ¼ � with the same

power �. If L1ðnÞ > L2ðnÞ then the threshold of the likelihood

ratio test can be modified to achieve a significance level of L1ðnÞbut with power greater than �. This contradicts the fact that

L1ðnÞ is the “level-sample” function of a uniformly most

powerful test with power � at � ¼ �. Thus, L1ðnÞ � L2ðnÞ and

since Corollary 3.2 has already established L2ðnÞ satisfies the

requirements of the persistent version the result follows. tuCorollary 3.4. Testing hypotheses about the mean of a normal

distribution with known variance is usable with the persistent version.

Proof. The single-tail version of this type of tests is known to be

uniformly most powerful [10], hence the result follows from

Theorem 3.3. Since the normal distribution is symmetric around

the mean it is easy to show that the “level-sample” function of

the double-tail version is bounded by 2LðnÞ, where LðnÞ is the

“level-sample” function of the single-tail version. tuCorollary 3.5. Single-tail tests about the mean of a Bernoulli distribution

are usable with the persistent version.

Proof. Tests of this type are known to be uniformly most powerful

[10], hence the result follows from Theorem 3.3. tu

As for the volatile version, we show in Section 2.4 that in order

to reach stability we must have 8n : Lðnþ 1Þ � nnþ1LðnÞ. Since this

requirement is weaker than an exponential decay, a wider range of

tests can satisfy it. The following proposition proves that given

some weak assumptions, a statistical test is usable with the volatile

version. The assumptions are satisfied for the mean statistic of any

distribution with finite variance and in many other cases.

Proposition 3.6. If Lð�Þ is the “level-sample” function of a statistical test

with null hypothesis H0 : S ¼ 0 for some parameter S with given

power and effect-size requirements, and the test is determined by a

statistic S which is an unbiased estimator of S and varðSÞ � an for

some constant a, then there exists some b such that 8n : LðnÞ � bn .

Proof. Let s be the required effect size. We will implement a

suboptimal test procedure in which H0 is rejected if jSj > s2 .

The power of this test procedure is P ðjSj > s2 jS ¼ �sÞ �

P ðjS � EðSÞj < s2Þ. Using the Chebyshev inequality [10], we can

bound it from below by 1� 4ans2 . Since limn!1 1� 4a

ns2 ¼ 1 thereexists an n0 such that for all n > n0 the power requirement issatisfied.

The level of this test procedure is P ðjSj > s2 jS ¼ 0Þ. Using the

Chebyshev inequality we can bound it by 4ans2 . Since for n > n0

the choice of the rejection region threshold is not optimal,8n > n0 : LðnÞ � 4a

ns2 . This result can be easily extended to 8n :LðnÞ � b

n by a proper choice of b. tu

Proposition 3.6 means that as a worst-case scenario the database

manager can use ~LðnÞ ¼ bn (which clearly satisfies the requirement

for stability 8n : ~Lðnþ 1Þ � nnþ1

~LðnÞ) in place of the actual “level-

sample” function just for the sake of cost calculations.

Many other kinds of tests can be shown empirically to workwell with both persistent and volatile versions. Testing the meanof a normal distribution when the variance is unknown uses thet-distribution which converges to a normal distribution as n

grows. Empirical tests (see Fig. 1) demonstrate it yieldspractically the same costs as in the known variance case for bothversions. The Pearson’s Correlation coefficient can be transformedusing the Fisher’s Z transformation [11] to an approximatelynormally distributed statistic which suggests it should fit bothversions as well. Section 3.2 shows empirical results with thesekinds of tests.

Another candidate for the QPD is the Kolmogorov-Smirnov test[10]. It has two variants. One variant is used to determine whetherthe underlying distribution of the samples differs from ahypothesized one. The second variant is used to determine iftwo underlying distributions of two samples differ. It is used, forexample, for selecting continuous features in ProMateus [3]. Wewill describe here the first variant though our claims can be easilyextended to the second variant as well.

The test is performed as follows: Let F ðxÞ be the hypothesizedcumulative distribution function. The true underlying distributionis estimated from n samples, resulting with FnðxÞ. The test statisticis Dn ¼ supxjFnðxÞ � F ðxÞj. The null hypothesis H0 states that thesamples are indeed drawn from the distribution F ðxÞ, and it isrejected if

ffiffiffinp

Dn > z for some threshold z.It has been shown that if H0 is true then the cumulative

distribution function offfiffiffinp

Dn converges to KðxÞ ¼ 1�2P1

i¼1ð�1Þi�1e�2i2x2[10]. It is the common practice to use KðxÞ

to estimate the level of a Kolmogorov-Smirnov test for large valuesof n. Proposition 3.7 demonstrates that under this approximation,Kolmogorov-Smirnov tests satisfy the requirements for stability ofthe persistent version.

Proposition 3.7. If Lð�Þ is the “level-sample” function of a Kolmogorov-

Smirnov test with given power and effect-size requirements, when thelevel of the test is estimated using the KðxÞ distribution, then there

exists b and 0 < w < 1 such that 8n : LðnÞ � bwn.

Proof. Let FnðxÞ be the empirical distribution function estimatedfrom n samples. The test statistic is Dn ¼ supxjFnðxÞ � F ðxÞj.Let � and s be the desired power and effect size, respectively.We will implement a suboptimal test procedure in which H0 isrejected if Dn >

s2 .

We will now prove that there exists n0 such that for alln > n0 the power of this test procedure satisfies the powerrequirement p. Let GðxÞ be the real cumulative distributionfunction from which the samples were drawn, and let

1434 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011

Fig. 1. Simulation results.

Page 5: The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery

supxjGðxÞ � F ðxÞj ¼ s. The Glivenko-Cantelli lemma [10] statesthat if n!1 then supxjFnðxÞ �GðxÞj converges to 0 inprobability. Thus, there exists n0 such that 8n > n0:P ðsupxjFnðxÞ �GðxÞj < s

2Þ > �. The event supxjFnðxÞ �GðxÞj <s2 implies the event supxjF ðxÞ � FnðxÞj > s

2 since

supxjF ðxÞ � FnðxÞj ¼ supxjF ðxÞ �GðxÞ þGðxÞ � FnðxÞj� supxðjF ðxÞ �GðxÞj � jFnðxÞ �GðxÞjÞ

> supx jF ðxÞ �GðxÞj �s

2

� �� s

2

(using elementary inequality jaþ bj � jaj � jbj). Thus, P ðDn >s2Þ > � and the power requirement is satisfied.

Estimating the level assuming the KðxÞ distribution for thestatistic

ffiffiffinp

Dn we get 2P1

i¼1ð�1Þi�1e�i2ns

2

2 . Since the e�i2ns

2

2

elements are monotonically decreasing with i and since we sumthem with alternating signs then clearly the level is bounded bythe first element in the sum: 2e�n

s2

2 . Since for all n > n0 thechoice of rejection region threshold s

2 is suboptimal, then8n > n0 : LðnÞ � 2e�n

s2

2 . This result can be easily extended to8n : LðnÞ � bwn by a proper choice of b and w. tu

Fig. 1 shows the costs associated with various kinds ofstatistical tests. The x-axis can be considered a time line, in whichat time i the ith test was submitted. The y-axis shows the costassociated with the test. The simulation was performed using thepersistent version with the q-factor set to 0.999 and � ¼ 0:05,starting with 1,000 initial samples drawn from a normal distribu-tion. The power of all requests was set to 0.9. Four kinds of testswere simulated: Kolmogorov-Smirnov tests, correlation tests,mean tests with known variance, and mean tests with unknownvariance. The effect size was set to 0.1 for all tests except theKolmogorov-Smirnov test, for which the power was measured byempirically testing the frequency of correctly detecting thedifference between two normal distributions with the same meanand two different variances: 1 and 1.6.

The graph shows all these tests are assigned with very similarcosts that are decreasing with time, confirming our assertion thatthey behave “like” the normal test.

3.2 Simulation for HIV Drugs Resistance

The Acquired Immunodeficiency Syndrome (AIDS) is a fataldisease caused by the HIV virus. A major obstacle of the anti-HIVtreatment is the high rate at which this virus replicates andmutates. If a patient is treated with a drug combination that is notpotent enough, then instead of reducing the viral count in thispatient’s blood it may allow for new, even more resistant, viralgenerations. Hence, confronting the virus with the best availabletreatment at its earliest stages is crucial.

One of the methods for identifying the most potent treatment isphenotypic testing. In this approach, the level of resistance of thevirus against each of the available drugs is determined in vitro.Since this method is very expensive (on the order of $1,000 pertest), there exist several tools that aim to predict the phenotypicresistance from genotypic information. A publicly availabledatabase that is often used for this task is the HIVdb maintainedby Stanford University [1].

We created a simulation of the QPD with � ¼ 0:05 applied tofeature selection in the Stanford database. Examples for features inthe context would be specific subsets of mutations in the viralgenome that potentially show correlation with the measured drugresistance. We simulated requests for correlation tests with power0.95 when detecting a correlation coefficient of 0.1. The initialnumber of samples was set to n ¼ 2;250 which is the number ofsamples in the Stanford database to date. The level-samplefunction Lð�Þ of the requests was estimated using the Fisher Ztransformation [11].

We simulated 100 requests of this sort. Fig. 2 shows the results.The x-axis can be considered a time line, in which at time i the ithtest was submitted. The y-axis shows the cost associated with thistest. In the persistent version, the q factor is set to 0.999. The first testcosts $33,660 (Assuming $1,000 are charged per sample) and the100th test costs $3,405. The simulation ends with 3,144 samples inthe database. In the volatile version 27 tests can be executed beforethe invariant

Pti¼1 LiðnÞ � � is violated, so we consider them free of

charge. The 28th test costs $10,154. The price drops to $2,754 for the100th test, and it leaves the database with 2,614 samples. In practice,it is possible to charge the first free 27 tests with some of thedatabase initialization expenses, or some of the expected costs ofsubsequent tests.

4 DISCUSSION

We believe that the formal scheme presented above can beimplemented in many areas where misleading results due to lackof quality controls is repeatedly acknowledged [4], [7], [8].Specifically in the field of bioinformatics several examples maybe mentioned, especially those related to high throughput,extensively analyzed genomic data such as GWAS studies andthe Stanford HIV database mentioned above. Another example isthe NIH Influenza virus resource [12] often used to identifycomplex dependencies in the virus genome. All these sourcescurrently suffer from serious quality related issues due to theirunrestrained use. In this section, we discuss some issues that arisewhen considering applying the theory in practice.

4.1 Mixing Different Types of Requests

In Section 3.1 above, we present simulation results performedseparately for four kinds of tests, where each simulation included40 identically distributed requests. One might wonder whathappens if different types of requests are mixed in the samesimulation. From (1), it can be deduced that the persistent versionis in fact memoryless. The cost of a request only depends on its“level-sample” function Lð�Þ, current number of samples n, and thepredefined q factor. Let us assume a certain user A repeatedly filesidentically distributed requests and is charged costs with adecaying curve like those shown in Fig. 1. If other requests arebeing serviced between servicing user A’s requests, those otherrequests will increase the number of samples in the database. UserA will only benefit from it as it will advance his costs faster alongthe decaying curve. Note that user A will benefit from any kind ofrequest serviced before his requests, even requests whose “level-sample” functions do not decay exponentially.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011 1435

Fig. 2. Correlation simulation results.

Page 6: The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery

The volatile version is more complicated in that respect, sinceits pricing system depends on all previous requests’ “level-sample” functions. We have already shown, however, that thestability property holds if the database will only service requestswhose “level-sample” function obeys the weak requirement inProposition 2.2.

4.2 Data Management

Clearly, using actual samples as the coin for purchasing the right touse the database has drawbacks. Many databases suffer from dataquality issues resulting of combining samples from differentsources. If the QPD will be a collection of small amounts ofsamples from a multitude of sources it will aggravate the dataquality problem severely. All the more so when not all potentialusers of the database are trustworthy as suppliers of samples andsince the samples provided by a user may influence the results ofher own tests. Thus it is sensible that the database manager willpurchase samples from some authorized third parties, and chargethe database users with standard currency. This arrangement willseparate data quality management from the pricing system andwill allow the database manager to handle that issue as in anyother database.

Also, the pricing systems described in Section 2.2 can be easilyextended to support fractional values. This is very desirable incases where the cost of acquiring a single sample is high, in whichcase we would like to tune the pricing system so each user willhave to pay for a small fraction of a sample.

Another possible pitfall is that acquiring a new sample may bea lengthy process. In the ProMateus example mentioned in theintroduction, acquiring a new sample, namely, a new proteinstructure may take as much as 3 months. The pricing systemsdescribed in Section 2.2, however, require that the samplesprovided per use of the database be available immediately. Thisproblem can be solved by managing the database stock just likeany other business, i.e., estimating future activity and buyingsamples in advance.

Finally, issues of confidentiality and trust should be considered.In some communities, it may be practical to reveal the data of theQPD and trust its users to report the tests they conduct on it andpay for them. Where this is not practical, the data must be keptconfidential, and the users will have to submit the tests they wish toperform and receive the results without actually getting the data.

4.3 Persistent versus Volatile

In the persistent version, the researcher immediately receives thetest result. If this result is published in a scientific forum it can bestated that the result was obtained on a QPD configured to keepthe FWE at level �. This offers protection from the publication biasphenomenon [4].

The volatile version offers a cheaper alternative that supports awider variety of statistical tests, because of the slower decay in thelevel allocation available for each test. The obvious disadvantage isof course that results are volatile: the same discovery that seemed topass the test at a certain time may fail it at later time and vice versa.This may pose difficulties when trying to publish such results inscientific papers. However, since this database serves a communityof researchers, it may be appropriate that the database manager willpublish a yearly report listing currently supported discoveriesalong with their authors, instead of individual publications.

4.4 Deriving “True” Results

Recently, some high-profile papers have addressed the overallcorrectness of research findings based on statistical tests.Ioannidis [13] claims that lack of proper false discovery controlessentially implies that “most published research findings arefalse.” While this issue can, in principle, be addressed by multiplecomparison corrections, in practice it is impossible to require

scientists working independently to correct for each other’stesting. Moonesinghe et al. [14] propose replication as a “sanitychecking” mechanism for revalidating findings and discardingthe nonreproducible one. Our framework is conceptually differ-ent, but can accomplish the goal of ensuring validity. Importantly,our results indicate that in a QPD framework, all researchers willbenefit from the collaboration in terms of their chances formaking a real discovery (i.e., testing power), while ensuring thevalidity of their results (i.e., controlling FWE).

4.5 Funding a QPD

Certainly the situation where payment is required for performingstatistical tests will not be easily accepted in the researchcommunity. In this section, we address this issue.

In many domains obtaining samples for research is a costlyendeavor. The QPD can be viewed as a mechanism fordistributing those costs fairly between the consumers of this data.Consider, for example, two researchers: researcher A thatfrequently performs high-power tests and researcher B thatperforms a smaller amount of weaker tests. Due to their activity,new samples must be obtained or false discoveries will accumu-late unchecked. Thus, it is fair to request the two researchers toparticipate in financing this task, in an amount proportional to thevolume of their activity, as does the QPD.

While the scheme presented in the paper lays all the expensespurely on the database users that perform statistical tests, in reallife the expenses may be further distributed between otherbeneficiaries of the accumulated data as well as various fundsthat sponsor research in general. This will result with smaller coststo the researchers which can be viewed as a tax constraining thenumber of tests performed by the community’s ability to obtainnew samples.

5 CONCLUSIONS

In this paper, we have shown a scheme for employing a wellknown technique for controlling the probability of false discoverywithin a mechanism that fairly distributes the costs involvedamong the interested parties. We have shown an example wheresuch a technique might prove beneficial and many more suchopportunities exists.

This work can be extended in many respects. Our approachuses the simplest mechanism for controlling false discovery. Itshould be elaborated to include other techniques for controllingerror rates. In particular, if many of the tested “discoveries” are infact correct (and hence the null H0 should often be rejected), thenprocedures for controlling FDR [6] offer an approach that couldreduce costs significantly. Applying them in conjunction with thepricing systems we presented in this paper is not straight forwardbut we believe it is possible, especially with the volatile version.

Known techniques for alpha spending and alpha investing [15]are also natural candidates for inclusion within the QPD frame-work, as all of these include the notion of an initial pool of “alphawealth” that is distributed between tests that are sequentiallyperformed. Combining this with the innovative idea of requiringnew samples as a fee would most certainly yield a more powerfuland flexible tool, capable of controlling both FWE and variants ofFDR for lower fees. This may also moderate the currently sharpslope of price decay at the beginning of the database usage,resulting in a more balanced cost sequence.

Furthermore, our simple Bonferroni-like approach is best suitedfor a series of tests which are independent of each other. In realisticscenarios it is expected that tests will be dependent, all the more sosince they rely on mostly the same data. While our results still holdin such cases, an improved scheme may be devised to exploit thosedependencies and result with reduced costs.

1436 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011

Page 7: The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery

Finally, we feel that broader and more powerful claims can bestated regarding the applicability of various statistical tests to theseschemes, and tighter bounds on their costs can be achieved. Also,while the error probabilities of classical statistical tests are wellstudied, this is not the case for some modern and more elaborateprocedures, particularly for nonparametric methods. Further workis required before such procedures may be incorporated within aQPD framework.

REFERENCES

[1] S.-Y. Rhee, M.J. Gonzales, R. Kantor, B.J. Betts, J. Ravela, and R.W. Shafer,“Human Immunodeficiency Virus Reverse Transcriptase and ProteaseSequence Database,” Nucleic Acids Research, vol. 31, no. 1, pp. 298-303, 2003.

[2] Wellcome Trust Case Control Consortium, “Genome-Wide AssociationStudy of 14,000 Cases of Seven Common Diseases and 3,000 SharedControls,” Nature, vol. 447, no. 7145, pp. 661-678, http://dx.doi.org/10.1038/nature05911, June 2007.

[3] H. Neuvirth, U. Heinemann, D. Birnbaum, N. Tishby, and G. Schreiber,“Promateus—An Open Research Approach to Protein-Binding SitesAnalysis,” Nucleic Acids Research, vol. 35, Web Server issue, pp. 543-548,2007.

[4] R.J. Simes, “Publication Bias: The Case for an International Registry ofClinical Trials,” J. Clinical Oncology, vol. 4, pp. 1529-1541, 1986.

[5] Y. Hochberg and A.C. Tamhane, Multiple Comparison Procedures. Wiley,1987.

[6] Y. Benjamini and Y. Hochberg, “Controlling the False Discovery Rate: APractical and Powerful Approach to Multiple Testing,” J. Royal StatisticalSoc. Series B, vol. 57, pp. 289-300, 1995.

[7] H. Tang, J. Peng, P. Wang, M. Coram, and L. Hsu, “Combining MultipleFamily-Based Association Studies,” BMC Proc., vol. 1, no. Suppl 1, p. S162,http://www.biomedcentral.com/1753-6561/1/S1/S162, 2007.

[8] E.J.C.G. van den Oord and P.F. Sullivan, “False Discoveries and Models forGene Discovery,” Trends in Genetics, vol. 19, no. 10, pp. 537-542, 2003.

[9] T.M. Cover and J.A. Thomas, Elements of Information Theory. Wiley-Interscience, Aug. 1991.

[10] M.H. DeGroot and M.J. Schervish, Probability and Statistics. AddisonWesley, 2002.

[11] D.L. Hawkins, “Using u Statistics to Derive the Asymptotic Distribution ofFisher’s z Statistic,” The Am. Statistician, vol. 43, pp. 235-237, 1989.

[12] Y. Bao, P. Bolotov, D. Dernovoy, B. Kiryutin, L. Zaslavsky, T. Tatusova, J.Ostell, and D. Lipman, “The Influenza Virus Resource at the NationalCenter for Biotechnology Information,” J. Virology, vol. 82, no. 2, pp. 596-601, http://jvi.asm.org, 2008.

[13] J.P.A. Ioannidis, “Why Most Published Research Findings Are False,” PLoSMedicine, vol. 2, p. e124, 2005.

[14] R. Moonesinghe, M. Khoury, and A. Janssens, “Most Published ResearchFindings Are False—But a Little Replication Goes a Long Way,” PLoSMedicine, vol. 4, p. e28, 2007.

[15] D.P. Foster and R.A. Stine, “Alpha-Investing: A Procedure for SequentialControl of Expected False Discoveries,” J. Royal Statistical Soc.: Series B(Statistical Methodology), vol. 70, no. 2, pp. 429-444, http://dx.doi.org/10.1111/j.1467-9868.2007.00643.x, Jan. 2008.

. For more information on this or any other computing topic, please visit ourDigital Library at www.computer.org/publications/dlib.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011 1437