24
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 10, Number 2, 2003 © Mary Ann Liebert, Inc. Pp. 119–142 Estimating Dataset Size Requirements for Classifying DNA Microarray Data SAYAN MUKHERJEE, 1;3 PABLO TAMAYO, 1 SIMON ROGERS, 4 RYAN RIFKIN, 1;3 ANNA ENGLE, 3 COLIN CAMPBELL, 4 TODD R. GOLUB, 1;2 and JILL P. MESIROV 1 ABSTRACT A statistical methodology for estimating dataset size requirements for classifying microarray data using learning curves is introduced. The goal is to use existing classi cation results to estimate dataset size requirements for future classi cation experiments and to evaluate the gain in accuracy and signi cance of classi ers built with additional data. The method is based on tting inverse power-law models to construct empirical learning curves. It also includes a permutation test procedure to assess the statistical signi cance of classi cation performance for a given dataset size. This procedure is applied to several molecular classi- cation problems representing a broad spectrum of levels of complexity. Key words: gene expression pro ling, molecular pattern recognition, DNA microarrays, micro- array analysis, sample size estimation. 1. INTRODUCTION O ver the last few years the routine use of DNA microarrays has made possible the creation of large datasets of molecular information characterizing complex biological systems. Molecular classi cation approaches based on machine learning algorithms applied to DNA microarray data have been shown to have statistical and clinical relevance for a variety of tumor types: leukemia (Golub et al. , 1999), lymphoma (Shipp et al., 2001), brain cancer (Pomeroy et al. , 2002), lung cancer (Bhattacharjee et al. , 2001) and the classi cation of multiple primary tumors (Ramaswamy et al. , 2001, 2002; Yeang et al. , 2001). In this context, after having obtained initial or preliminary classi cation results for a given biological system, one is often left pondering the possibility of embarking on a larger and more systematic study using additional samples. This is usually the case when one tries to improve the accuracy of the original classi er or to provide a more rigorous statistical validation of the existing prediction results. As the process of obtaining additional biological samples is often expensive, involved, and time consuming, 1 Whitehead Institute/Massachusetts Institute of Technology Center for Genome Research, Cambridge, MA 02139. 2 Department of Pediatric Oncology, Dana-Farber Cancer Institute, Boston, MA 02115. 3 McGovern Institute and CBCL, Massachusetts Institute of Technology, Cambridge, MA 02139. 4 Department of Engineering Mathematics, University of Bristol, UK. 119

Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

JOURNAL OF COMPUTATIONAL BIOLOGYVolume 10 Number 2 2003copy Mary Ann Liebert IncPp 119ndash142

Estimating Dataset Size Requirements forClassifying DNA Microarray Data

SAYAN MUKHERJEE13 PABLO TAMAYO1 SIMON ROGERS4

RYAN RIFKIN13 ANNA ENGLE3 COLIN CAMPBELL4

TODD R GOLUB12 and JILL P MESIROV1

ABSTRACT

A statistical methodology for estimating dataset size requirements for classifying microarraydata using learning curves is introduced The goal is to use existing classi cation results toestimate dataset size requirements for future classi cation experiments and to evaluate thegain in accuracy and signi cance of classi ers built with additional data The method isbased on tting inverse power-law models to construct empirical learning curves It alsoincludes a permutation test procedure to assess the statistical signi cance of classi cationperformance for a given dataset size This procedure is applied to several molecular classi- cation problems representing a broad spectrum of levels of complexity

Key words gene expression pro ling molecular pattern recognition DNA microarrays micro-array analysis sample size estimation

1 INTRODUCTION

Over the last few years the routine use of DNA microarrays has made possible the creationof large datasets of molecular information characterizing complex biological systems Molecular

classi cation approaches based on machine learning algorithms applied to DNA microarray data havebeen shown to have statistical and clinical relevance for a variety of tumor types leukemia (Golub et al1999) lymphoma (Shipp et al 2001) brain cancer (Pomeroy et al 2002) lung cancer (Bhattacharjeeet al 2001) and the classi cation of multiple primary tumors (Ramaswamy et al 2001 2002 Yeanget al 2001) In this context after having obtained initial or preliminary classi cation results for a givenbiological system one is often left pondering the possibility of embarking on a larger and more systematicstudy using additional samples This is usually the case when one tries to improve the accuracy of theoriginal classi er or to provide a more rigorous statistical validation of the existing prediction results Asthe process of obtaining additional biological samples is often expensive involved and time consuming

1Whitehead InstituteMassachusetts Institute of Technology Center for Genome Research Cambridge MA02139

2Department of Pediatric Oncology Dana-Farber Cancer Institute Boston MA 021153McGovern Institute and CBCL Massachusetts Institute of Technology Cambridge MA 021394Department of Engineering Mathematics University of Bristol UK

119

120 MUKHERJEE ET AL

it is desirable to be able to estimate the performance of a classi er for yet unseen larger dataset sizes Inthis situation one has to address two sets of questions

1 For a given number of samples how signi cant is the performance of a classi er ie are the resultsbetter than what one would expect by chance

2 If we know the answers to (1) for a range of dataset sizes can we predict the performance of the classi erwhen trained with additional samples Will the accuracy of the classi er improve signi cantly Is theeffort to collect additional samples worthwhile

These two questions arise in other classi cation tasks with high dimensional data and few samples such asclassifying functional MRI images of patients with neural dysfunction (Golland et al 2002) In this paperwe develop a methodology for assessing the signi cance of a classi errsquos performance via a permutationtest We then t an inverse power law model to construct a learning curve with error rates estimated froman existing dataset and use this learning curve to extrapolate error statistics for larger datasets Powercalculations (Adcock 1997) are a standard approach to estimate the number of data samples requiredHowever these approaches do not address our data set size estimation problem for two reasons First theassumptions that the underlying data comes from a Gaussian distribution and that variables are independentdo not hold Second the question addressed by power calculations is given a particular data set size howcon dent are we of our empirical error estimate This is very different from asking how the error ratemight decrease given more data

A nontrivialclassi er changes its structure as more trainingdata becomeavailableand therefore determininghow the error rate might decrease becomes a problem of function extrapolation rather than convergence esti-mation In this regard it is important not to confuse this problem with the more standard problem of estimatingthe condence of an error estimate as a functionof training set size ie estimating the variance in an observedquantity the error estimate as a function of the number of measurements In general this latter problem isaddressed using power calculations or deviation bounds (Adcock 1997 Guyon et al 1998) These methodscomputeboundsor estimatesof a given quantityrsquos deviationfrom its expectedvalueas a functionof the numberof observations or in this case samples Other methods study the variation produced by technical factors thatcan be addressed by experimental design or by replicating sample preparation or array hybridization (Chengand Wong2001Tseng et al 2001Kerr and Churchill 2001ab)There are also methods to model differentialexpression across experiments (Lee and Whitmore 2002) that assess the effect of replication and sample sizein increasing the statisticalpower of ANOVA models In the contextof our problem these approaches can onlyhelp to nd boundson the deviationbetween the misclassi cation error rate and its expectedvalueas a functionof the number of measurements ie the realizations of the classi er for a given xed classi cation datasetsize These standard error estimation methods are therefore not particularly useful in estimating the futureperformance of a classi er as a function of increasing dataset size with yet unseen additionaldata We test ourmethodology on eight data sets which represent a range of dif culty or complexity of classi cation In somecases the distinction is quite dramatic while in others it is more subtle The examples are drawn from existingcancer classi cation data sets where discriminatingthe morphologyof a sample ( ve sets) represents the ldquoeas-ierrdquo end of the range and predicting treatment outcome (three sets) lies at the other extreme This hierarchyof dif culty will be re ected by the increase in the data set size requirements we estimate for these predictionproblemsOur results give an indicationof the minimal number of samples that are needed to obtain signi cantperformancedataand to extrapolatethe improvementonemightgetby buildingtheclassi erona largerdataset

In the next section we will give some background on general approaches addressing the problem ofestimating classi er performance and learning rates In Section 3 we describe our methodology in moredetail The results of applying our methodology to molecular classi cation problems are contained inSection 4 Section 5 summarizes the results of our tests The proofs and technical details have beencollected in the appendices

2 BACKGROUND AND GENERAL APPROACH

The problem of estimating performance of a classi er for larger yet unseen sets of examples is a dif cultanalytical problem It amounts to developing a model to compute how fast a given classi er ldquolearnsrdquo orimproves its ldquo ttingrdquo to the data as a function of dataset size

ESTIMATING DATASET SIZE REQUIREMENTS 121

In machine learning a natural way to study classi cation accuracy as a function of training set size isby building empirical scaling models called learning curves (Cortes et al 1994) Learning curves estimatethe empirical error rate as a function of training set size for a given classi er and dataset The advantageof this approach is that one avoids making assumptions about the distribution generating the dataset orthe distribution of the classi cation errors These learning curves are usually well characterized by inversepower-laws

en D aniexclreg C b (1)

The variables are the expected error rate en given n training samples the learning rate a the decayrate reg and the Bayes error b which is the minimum error rate achievable (Devroye et al 1997 Dudaet al 2000) Notice that the value of the constants a b and reg will change according to the classi er anddataset being studied Based on this scaling model as the size of the dataset increases the misclassi cationerror of a classi er will asymptotically approach b This inverse-power-law ldquolearningrdquo behavior appearsto be universal and is observed for many classi ers and types of datasets (Cortes et al 1994 Shrageret al 1988) It is in fact observed not only in machine learning but also in human and animal learning(Anderson et al 1983) It is common to nd empirical reg values around or less than 1 Besides itsempirical prevalence the power-law model can be motivated analytically and in some cases derived withinthe statistical mechanics approach to learning The basic idea behind this approach is to formulate theaverage error as a set of equations which are then solved via a statistical mechanics replica approach(Hertz et al 1991) involving integration over the parameters of the classi er This approach has beenapplied to various classi cation algorithms such as Support Vector Machines (Dietrich et al 2000) LargeMargin Perceptrons (Opper et al 1995) and Adaline and other classi ers based upon Hebbian rules (Opperet al 1990) The resulting analysis of the classi cation errors for all of the above algorithms results ininverse power laws of the form (1)

Using this power-law scaling model as a basis one can use the empirical error rates of a classi er overa range of training set sizes drawn from a dataset to t an inverse-power-law model and then use thismodel to extrapolate the error rate to larger datasets In order to make this a practical approach one alsoneeds a statistical test for classi er signi cance as a function of training set size The reason for this isthat the inverse-power-law model usually breaks down for small training set sizes where the model lacksenough data to give accurate predictions In this case the error rates are large and not signi cant If agiven classi errsquos results are not signi cant then it is better to exclude them when tting the learningcurve To directly address this problem we have included a permutation test for the signi cance of aclassi er as part of our methodology This test compares the performance of the actual classi er with theperformance of random classi ers trained to predict data whose target labels are permuted (randomized)A classi er that is able to ldquo ndrdquo structure in the data and produce signi cant results should outperform itsrandom counterparts most of the time By xing a signi cance level (005) we can produce an effectivetest to eliminate classi ers that are not signi cant from the tting of the learning curve Since the classi erperformance usually improves with increasing training set size this signi cance test also allows us to ndthe minimum number of samples that produced signi cant classi ers

3 METHODOLOGY FOR ESTIMATING ERROR RATES AS AFUNCTION OF DATASET SIZE

Given an arbitrary input dataset and classi cation algorithm the methodology we will describe providesthe following

1) A measure of the statistical signi cance of the classi er built at each training set size Using this onecan nd the minimum training set size for which the classi cation performance of the classi ers isstatistically signi cant

2) An analytic expression (power law) of the error rate as a function of the increasing dataset size as wellas similar expressions for the 25th and 75th error rate quantiles These provide a means to extrapolatethe error bar ldquoenveloperdquo for the error rate for larger yet unseen data sets

122 MUKHERJEE ET AL

As described in Section 2 a signi cance test is needed to know at which training set size error ratesare reliable enough to accurately extrapolate the error rate as a function of dataset size The 25th and 75thquantiles are used to compute the analog of error bars for the estimated error rates as a function of datasetsize Figure 1 shows a pictorial summary of the method The procedure can be broken down into two maincomputational tasks the rst involving random subsampling (traintest) and a signi cance permutation testto evaluate the classi ers and the second consisting of tting learning curves to the error rates that passedthe signi cance test

We rst describe the rationale behind tting the learning curves We want to t the inverse power curvesto the true average error and the true 25th and 75th quantile error of a classi er trained with varioustraining set sizes Our rst step is to estimate the true average error rate and 25th and 75th error quantilesfor a range of training set sizes For a dataset of size ` with xed training set size n and test set size ` iexcl nT1 traintest realizations were constructed by subsampling the dataset For each of these T1 realizations anerror rate eni is computed and the average of the sequence

copyen1 enT1

ordf en D 1

T1

PT1iD1 eni is used

as an estimate of the true average error en The average error rate is an unbiased estimator of the true errorrate of a classi er trained with n samples

ED`

1T1

T1X

iD1

eni D EDnen

where EDnen is the probability of classifying a new sample incorrectly when the classi er was trained

with n samples (see Appendix 1a for proof and details) The 25th and 75th percentile of the sequencecopyen1 enT1

ordfwere also tted to an inverse power law so that we could also estimate the variation in

error rates as a function of dataset size The 25th and 75th percentile of the sequencecopyen1 enT1

ordfare

good approximations of the 25th and 75th quantiles of error rates of classi ers trained with n samples (seeAppendix 1c for proof and details) We did not use the variance of the error rates eni because this statisticis not an unbiased estimator of the variance of the error rate of classi ers trained with n samples andtested on a new sample Indeed one can prove that the variance of this statistic is in general optimisticthe variance of the error rates eni is less than the variance of classi ers trained with n samples and testedon a new sample (see Appendix 1b for proof and details)

As described in the introduction theoretical justi cations for the use of inverse power laws can be madeusing analyses of classi cation accuracy based upon techniques from statistical mechanics (Opper et al1990 1995) and approximation theory (Niyogi et al 1996) as described in more detail in Appendix 2

FIG 1 The statistical signi cance for the ctitious dataset example with (a) 15 samples and (b) 45 samples Theline is the empirical distribution function for the random classi ers and the gray circle is the average error rate for theclassi er with randomization of labels

ESTIMATING DATASET SIZE REQUIREMENTS 123

Fitting the parameters of the learning curve by minimizing

minregab

MX

lD1

aniexclregl C b iexcl enl

2 subject to reg a b cedil 0

is a convex optimization problem when b is xed For a xed b one can estimate reg and a by takinglogarithms and solving the following equivalent linear minimization problem

minregab

MX

lD1

lna iexcl regnl C lnb iexcl enl2 subject to reg a b cedil 0

Solving this linear problem for various values of b followed by line search gives us our estimate of reg aand b

As described in Section 2 the tted learning curve does not extrapolate accurately when error rates arelarge and not statistically signi cant This motivates a procedure to determine at what training set size theerror rate is statistically signi cant when compared to the null hypothesis of the error rate of a randomclassi er

H0 py D 1jx fx1 y1 xn yng D py D iexcl1jx fx1 y1 xn yngI

the null hypothesis states that given a particular training set the conditional probability of a label being 1 oriexcl1 is equal A random classi er is built from the same input data with the class labels of the data randomlypermuted This addresses the question of how well the classi er can learn the mapping f x y whenthe y values are random y D f1 iexcl1g In essence we ask how well a classi er trained on randomlylabeled data can classify correctly labeled data The permutation procedure outlined above helps to answerthis question For each traintest realization for which an error rate eni was computed we construct T2

randomized realizations where the labels of the training set are randomly permuted We build classi erson these randomized training sets and test on the corresponding test set This results in a set of error ratesenij for training set size n From these error rates we construct an empirical distribution function for therandom classi er

P rann x D 1

T1 pound T2

T1X

iD1

T2X

jD1

microx iexcl enij

where microz D 1 if z cedil 0 and 0 otherwise The signi cance of the classi er is P rann en which is the

percentage of random classi ers with error rate smaller than en The procedure is illustrated with thefollowing two examples for a ctitious dataset

Example 1 e15 D 37 (Error rate of the classi er with 15 samples) e15ij D f215 260 290 320

366 388 395 408 420 495g (Error rates of the random classi ers with 15 samples)P ran

15 e15 D 50 Since the p-value is greater than 05 the error rate of the classi er isnot statistically signi cant (see Fig 2a)

Example 2 e45 D 1 (Error rate of the classi er with 45 samples) e45ij D f205 270 333 337

370 392 399 406 425 499g (Error rates of the random classi ers with 45 samples)P ran

45 e45 D 0 Since the p-value is less than 05 the error rate of the classi er is statis-tically signi cant (see Fig 2b)

The detailed description of the entire methodology for a two-class problem is as follows

1) Subsampling and signi cance permutation testA Subsampling procedure

i Given `c1 samples from class 1 and `c2 samples from class 2 the total number of samples is` D `c1 C `c2 where 1 cedil 10

ii Select 10 training set sizes n1 nj n10 over the interval [10 ` iexcl 10]

124 MUKHERJEE ET AL

FIG 2 Dataset size estimation statistical methodology

1 For each training set size nj run the following subsampling procedure T1 D 50 times indexedby i D 1 T1a Randomly split the dataset into a training set with nj samples and a test set with ` iexcl nj

samples subject to the requirement thatnc2nc1

frac14 `c2`c1

where nc2 and nc1 are the number of

samples from class 1 and class 2 in the training set Call the two datasets generated Sni b Train a classi er on each of the training sets and measure its error rate on its corresponding

test set call each of these error rates eni B Permutation test

i For each subsampled traintest split Sni run the following permutation procedure T2 gt 50times indexed by j D 1 T21 Randomly permute the labels of the samples in the training set (leave the test set alone) call

the dataset generated Srannij

2 Train a classi er on the training set and measure its error on the test set call this error rate erannij

C Signi cance calculation1 For each training set size n construct an empirical distribution function from the error rates of

the permuted datasets P rann x D 1

T1poundT2

PT1iD1

PT2jD1 microx iexcl eran

nij where microz D 1 if z cedil 0 and0 otherwise

ESTIMATING DATASET SIZE REQUIREMENTS 125

2 Given the above empirical distribution function compute for each en the value tn D P rann en

statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p

2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M

take the sequence of error rates and compute the following quantities for each training set size n gt n0

for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1

PT1iD1 eni

the 25th and 75th quantiles of the vector of error ratescopyen1 enT1

ordf

B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-

mization procedure minregab

PMlD1aniexclreg

l Cbiexclenl2 subject to reg a b cedil 0 designate the values

as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size

Lm en D amniexclregm C bm

ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1

enT1 g

4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS

The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results

The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section

Tumor versus normal tissue

This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)

126 MUKHERJEE ET AL

Tabl

e1

Sum

ma

ryof

Lea

rnin

gC

urv

ePa

ram

eter

sa

nd

Extr

apo

late

dE

rro

rR

ates

Lea

rnin

gcu

rve

para

met

ers

b

areg

asym

ple

arni

ngde

cay

Bay

esD

atas

etp

robl

emra

tera

teer

ror

Pre

dict

eder

ror

rate

at40

0sa

mpl

es

25th

quar

tile

Mea

n75

thqu

arti

le

Mul

tipl

etis

sues

47

73

89

25th

quan

tile

189

iexcl0

630

003

Can

cer

vs

norm

alm

ean

142

iexcl0

520

010

75th

quan

tile

117

iexcl0

430

000

Leu

kem

ia0

62

53

425

thqu

anti

le39

8iexcl

23

000

6A

ML

vs

AL

Lm

ean

771

iexcl0

650

009

75th

quan

tile

884

iexcl0

60

010

Col

onca

ncer

48

90

124

25

thqu

anti

le4

39iexcl

037

000

Can

cer

vs

norm

alm

ean

480

iexcl0

280

0075

thqu

anti

le5

00iexcl

023

000

Ova

rian

canc

er0

31

21

825

thqu

anti

le10

7iexcl

176

000

Can

cer

vs

norm

alm

ean

736

iexcl0

690

0075

thqu

anti

le9

95iexcl

067

000

Lym

phom

a0

09

15

25th

quan

tile

899

iexcl2

480

00D

LB

Cvs

Fo

llicu

lar

mea

n5

7iexcl

070

000

175

thqu

anti

le6

71iexcl

064

000

Bra

inca

ncer

145

16

1

238

25

thqu

antil

e1

06iexcl

037

000

4T

reat

men

tou

tcom

em

ean

112

iexcl0

330

006

75th

quan

tile

980

iexcl0

240

00Ly

mph

oma

112

17

24

3

25th

quan

tile

123

iexcl0

410

008

Tre

atm

ent

outc

ome

mea

n9

43iexcl

030

001

75th

quan

tile

872

iexcl0

210

00B

reas

tca

ncer

243

32

3

404

25

thqu

anti

le5

32iexcl

014

001

Tre

atm

ent

outc

ome

mea

n4

85iexcl

007

001

75th

quan

tile

429

iexcl0

010

00

ESTIMATING DATASET SIZE REQUIREMENTS 127

Tabl

e2

Su

mm

ary

of

Res

ult

sfo

rD

atase

tsIn

clu

ded

inth

eSt

ud

y

Max

imum

Size

toac

hiev

eLe

arni

ngtr

aini

ngcl

assi

errsquo

scu

rve

set

size

used

stat

istic

alt

ted

with

int

ting

Dat

aset

sign

ica

nce

sign

ica

ntle

arni

ngD

atas

etp

robl

emsi

ze(p

-val

lt0

05)

data

curv

eC

oncl

usio

ns

Cla

ssi

catio

ner

ror

Lear

ning

curv

eA

ctua

l(e

xtra

pola

ted)

E

rror

Mul

tiple

tissu

esC

ance

rvs

no

rmal

280

15ye

s21

07

08

62

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

10

Eve

nlo

wer

erro

rra

tes

lt0

07ar

eac

hiev

able

wit

hgt

400

sam

ples

L

euke

mia

AM

Lvs

A

LL

7310

yes

350

05

65

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

05

Add

itio

nal

sam

ples

may

low

erer

ror

toraquo

0C

olon

canc

erC

ance

rvs

no

rmal

6210

yes

5016

3

152

1

1D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

aner

ror

rate

ofraquo

016

L

ower

erro

rra

tes

lt0

09)

are

achi

evab

lew

ithgt

400

sam

ples

O

vari

anca

ncer

Can

cer

vs

norm

al54

10ye

s40

56

48

08

Dat

aset

size

isla

rge

enou

ghfo

rex

trap

olat

ion

and

toac

hiev

ea

low

erro

rra

teraquo

005

)A

dditi

onal

sam

ples

may

low

erer

ror

rate

toraquo

0L

ymph

oma

DL

BC

vs

Foll

icul

ar53

5ye

s40

44

70

7D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

035)

A

ddit

iona

lsa

mpl

esm

aylo

wer

erro

rra

teto

raquo0

Bra

inca

ncer

Tre

atm

ent

outc

ome

6045

no40

22

296

7

6D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

14w

ith

gt40

0sa

mpl

es

Lym

phom

aT

reat

men

tou

tcom

e58

50no

4023

29

5

65

Dat

aset

isno

tla

rge

enou

ghfo

rex

trap

olat

ion

Fitt

ing

ofle

arni

ngcu

rve

isin

accu

rate

but

sugg

ests

erro

rra

teco

uld

belt

017

wit

hgt

400

sam

ples

B

reas

tca

ncer

Tre

atm

ent

outc

ome

7865

no70

30

363

6

3D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

32w

ith

gt40

0sa

mpl

es

128 MUKHERJEE ET AL

with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279

Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15

To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is

errorn D 142niexcl052 C 0098

The error rates for the 25th and 75th quantiles are

error25n D 189niexcl063 C 0032

error75n D 117niexcl043 C 000

Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope

FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er

ESTIMATING DATASET SIZE REQUIREMENTS 129

FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes

is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows

error400 D 142400iexcl052 C 0098 D 73

error25400 D 189400iexcl063 C 0032 D 47

error75400 D 117400iexcl043 C 000 D 89

The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)

Leukemia morphology

The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is

errorn D 7706niexcl65 C 009

In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset

130 MUKHERJEE ET AL

FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 2: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

120 MUKHERJEE ET AL

it is desirable to be able to estimate the performance of a classi er for yet unseen larger dataset sizes Inthis situation one has to address two sets of questions

1 For a given number of samples how signi cant is the performance of a classi er ie are the resultsbetter than what one would expect by chance

2 If we know the answers to (1) for a range of dataset sizes can we predict the performance of the classi erwhen trained with additional samples Will the accuracy of the classi er improve signi cantly Is theeffort to collect additional samples worthwhile

These two questions arise in other classi cation tasks with high dimensional data and few samples such asclassifying functional MRI images of patients with neural dysfunction (Golland et al 2002) In this paperwe develop a methodology for assessing the signi cance of a classi errsquos performance via a permutationtest We then t an inverse power law model to construct a learning curve with error rates estimated froman existing dataset and use this learning curve to extrapolate error statistics for larger datasets Powercalculations (Adcock 1997) are a standard approach to estimate the number of data samples requiredHowever these approaches do not address our data set size estimation problem for two reasons First theassumptions that the underlying data comes from a Gaussian distribution and that variables are independentdo not hold Second the question addressed by power calculations is given a particular data set size howcon dent are we of our empirical error estimate This is very different from asking how the error ratemight decrease given more data

A nontrivialclassi er changes its structure as more trainingdata becomeavailableand therefore determininghow the error rate might decrease becomes a problem of function extrapolation rather than convergence esti-mation In this regard it is important not to confuse this problem with the more standard problem of estimatingthe condence of an error estimate as a functionof training set size ie estimating the variance in an observedquantity the error estimate as a function of the number of measurements In general this latter problem isaddressed using power calculations or deviation bounds (Adcock 1997 Guyon et al 1998) These methodscomputeboundsor estimatesof a given quantityrsquos deviationfrom its expectedvalueas a functionof the numberof observations or in this case samples Other methods study the variation produced by technical factors thatcan be addressed by experimental design or by replicating sample preparation or array hybridization (Chengand Wong2001Tseng et al 2001Kerr and Churchill 2001ab)There are also methods to model differentialexpression across experiments (Lee and Whitmore 2002) that assess the effect of replication and sample sizein increasing the statisticalpower of ANOVA models In the contextof our problem these approaches can onlyhelp to nd boundson the deviationbetween the misclassi cation error rate and its expectedvalueas a functionof the number of measurements ie the realizations of the classi er for a given xed classi cation datasetsize These standard error estimation methods are therefore not particularly useful in estimating the futureperformance of a classi er as a function of increasing dataset size with yet unseen additionaldata We test ourmethodology on eight data sets which represent a range of dif culty or complexity of classi cation In somecases the distinction is quite dramatic while in others it is more subtle The examples are drawn from existingcancer classi cation data sets where discriminatingthe morphologyof a sample ( ve sets) represents the ldquoeas-ierrdquo end of the range and predicting treatment outcome (three sets) lies at the other extreme This hierarchyof dif culty will be re ected by the increase in the data set size requirements we estimate for these predictionproblemsOur results give an indicationof the minimal number of samples that are needed to obtain signi cantperformancedataand to extrapolatethe improvementonemightgetby buildingtheclassi erona largerdataset

In the next section we will give some background on general approaches addressing the problem ofestimating classi er performance and learning rates In Section 3 we describe our methodology in moredetail The results of applying our methodology to molecular classi cation problems are contained inSection 4 Section 5 summarizes the results of our tests The proofs and technical details have beencollected in the appendices

2 BACKGROUND AND GENERAL APPROACH

The problem of estimating performance of a classi er for larger yet unseen sets of examples is a dif cultanalytical problem It amounts to developing a model to compute how fast a given classi er ldquolearnsrdquo orimproves its ldquo ttingrdquo to the data as a function of dataset size

ESTIMATING DATASET SIZE REQUIREMENTS 121

In machine learning a natural way to study classi cation accuracy as a function of training set size isby building empirical scaling models called learning curves (Cortes et al 1994) Learning curves estimatethe empirical error rate as a function of training set size for a given classi er and dataset The advantageof this approach is that one avoids making assumptions about the distribution generating the dataset orthe distribution of the classi cation errors These learning curves are usually well characterized by inversepower-laws

en D aniexclreg C b (1)

The variables are the expected error rate en given n training samples the learning rate a the decayrate reg and the Bayes error b which is the minimum error rate achievable (Devroye et al 1997 Dudaet al 2000) Notice that the value of the constants a b and reg will change according to the classi er anddataset being studied Based on this scaling model as the size of the dataset increases the misclassi cationerror of a classi er will asymptotically approach b This inverse-power-law ldquolearningrdquo behavior appearsto be universal and is observed for many classi ers and types of datasets (Cortes et al 1994 Shrageret al 1988) It is in fact observed not only in machine learning but also in human and animal learning(Anderson et al 1983) It is common to nd empirical reg values around or less than 1 Besides itsempirical prevalence the power-law model can be motivated analytically and in some cases derived withinthe statistical mechanics approach to learning The basic idea behind this approach is to formulate theaverage error as a set of equations which are then solved via a statistical mechanics replica approach(Hertz et al 1991) involving integration over the parameters of the classi er This approach has beenapplied to various classi cation algorithms such as Support Vector Machines (Dietrich et al 2000) LargeMargin Perceptrons (Opper et al 1995) and Adaline and other classi ers based upon Hebbian rules (Opperet al 1990) The resulting analysis of the classi cation errors for all of the above algorithms results ininverse power laws of the form (1)

Using this power-law scaling model as a basis one can use the empirical error rates of a classi er overa range of training set sizes drawn from a dataset to t an inverse-power-law model and then use thismodel to extrapolate the error rate to larger datasets In order to make this a practical approach one alsoneeds a statistical test for classi er signi cance as a function of training set size The reason for this isthat the inverse-power-law model usually breaks down for small training set sizes where the model lacksenough data to give accurate predictions In this case the error rates are large and not signi cant If agiven classi errsquos results are not signi cant then it is better to exclude them when tting the learningcurve To directly address this problem we have included a permutation test for the signi cance of aclassi er as part of our methodology This test compares the performance of the actual classi er with theperformance of random classi ers trained to predict data whose target labels are permuted (randomized)A classi er that is able to ldquo ndrdquo structure in the data and produce signi cant results should outperform itsrandom counterparts most of the time By xing a signi cance level (005) we can produce an effectivetest to eliminate classi ers that are not signi cant from the tting of the learning curve Since the classi erperformance usually improves with increasing training set size this signi cance test also allows us to ndthe minimum number of samples that produced signi cant classi ers

3 METHODOLOGY FOR ESTIMATING ERROR RATES AS AFUNCTION OF DATASET SIZE

Given an arbitrary input dataset and classi cation algorithm the methodology we will describe providesthe following

1) A measure of the statistical signi cance of the classi er built at each training set size Using this onecan nd the minimum training set size for which the classi cation performance of the classi ers isstatistically signi cant

2) An analytic expression (power law) of the error rate as a function of the increasing dataset size as wellas similar expressions for the 25th and 75th error rate quantiles These provide a means to extrapolatethe error bar ldquoenveloperdquo for the error rate for larger yet unseen data sets

122 MUKHERJEE ET AL

As described in Section 2 a signi cance test is needed to know at which training set size error ratesare reliable enough to accurately extrapolate the error rate as a function of dataset size The 25th and 75thquantiles are used to compute the analog of error bars for the estimated error rates as a function of datasetsize Figure 1 shows a pictorial summary of the method The procedure can be broken down into two maincomputational tasks the rst involving random subsampling (traintest) and a signi cance permutation testto evaluate the classi ers and the second consisting of tting learning curves to the error rates that passedthe signi cance test

We rst describe the rationale behind tting the learning curves We want to t the inverse power curvesto the true average error and the true 25th and 75th quantile error of a classi er trained with varioustraining set sizes Our rst step is to estimate the true average error rate and 25th and 75th error quantilesfor a range of training set sizes For a dataset of size ` with xed training set size n and test set size ` iexcl nT1 traintest realizations were constructed by subsampling the dataset For each of these T1 realizations anerror rate eni is computed and the average of the sequence

copyen1 enT1

ordf en D 1

T1

PT1iD1 eni is used

as an estimate of the true average error en The average error rate is an unbiased estimator of the true errorrate of a classi er trained with n samples

ED`

1T1

T1X

iD1

eni D EDnen

where EDnen is the probability of classifying a new sample incorrectly when the classi er was trained

with n samples (see Appendix 1a for proof and details) The 25th and 75th percentile of the sequencecopyen1 enT1

ordfwere also tted to an inverse power law so that we could also estimate the variation in

error rates as a function of dataset size The 25th and 75th percentile of the sequencecopyen1 enT1

ordfare

good approximations of the 25th and 75th quantiles of error rates of classi ers trained with n samples (seeAppendix 1c for proof and details) We did not use the variance of the error rates eni because this statisticis not an unbiased estimator of the variance of the error rate of classi ers trained with n samples andtested on a new sample Indeed one can prove that the variance of this statistic is in general optimisticthe variance of the error rates eni is less than the variance of classi ers trained with n samples and testedon a new sample (see Appendix 1b for proof and details)

As described in the introduction theoretical justi cations for the use of inverse power laws can be madeusing analyses of classi cation accuracy based upon techniques from statistical mechanics (Opper et al1990 1995) and approximation theory (Niyogi et al 1996) as described in more detail in Appendix 2

FIG 1 The statistical signi cance for the ctitious dataset example with (a) 15 samples and (b) 45 samples Theline is the empirical distribution function for the random classi ers and the gray circle is the average error rate for theclassi er with randomization of labels

ESTIMATING DATASET SIZE REQUIREMENTS 123

Fitting the parameters of the learning curve by minimizing

minregab

MX

lD1

aniexclregl C b iexcl enl

2 subject to reg a b cedil 0

is a convex optimization problem when b is xed For a xed b one can estimate reg and a by takinglogarithms and solving the following equivalent linear minimization problem

minregab

MX

lD1

lna iexcl regnl C lnb iexcl enl2 subject to reg a b cedil 0

Solving this linear problem for various values of b followed by line search gives us our estimate of reg aand b

As described in Section 2 the tted learning curve does not extrapolate accurately when error rates arelarge and not statistically signi cant This motivates a procedure to determine at what training set size theerror rate is statistically signi cant when compared to the null hypothesis of the error rate of a randomclassi er

H0 py D 1jx fx1 y1 xn yng D py D iexcl1jx fx1 y1 xn yngI

the null hypothesis states that given a particular training set the conditional probability of a label being 1 oriexcl1 is equal A random classi er is built from the same input data with the class labels of the data randomlypermuted This addresses the question of how well the classi er can learn the mapping f x y whenthe y values are random y D f1 iexcl1g In essence we ask how well a classi er trained on randomlylabeled data can classify correctly labeled data The permutation procedure outlined above helps to answerthis question For each traintest realization for which an error rate eni was computed we construct T2

randomized realizations where the labels of the training set are randomly permuted We build classi erson these randomized training sets and test on the corresponding test set This results in a set of error ratesenij for training set size n From these error rates we construct an empirical distribution function for therandom classi er

P rann x D 1

T1 pound T2

T1X

iD1

T2X

jD1

microx iexcl enij

where microz D 1 if z cedil 0 and 0 otherwise The signi cance of the classi er is P rann en which is the

percentage of random classi ers with error rate smaller than en The procedure is illustrated with thefollowing two examples for a ctitious dataset

Example 1 e15 D 37 (Error rate of the classi er with 15 samples) e15ij D f215 260 290 320

366 388 395 408 420 495g (Error rates of the random classi ers with 15 samples)P ran

15 e15 D 50 Since the p-value is greater than 05 the error rate of the classi er isnot statistically signi cant (see Fig 2a)

Example 2 e45 D 1 (Error rate of the classi er with 45 samples) e45ij D f205 270 333 337

370 392 399 406 425 499g (Error rates of the random classi ers with 45 samples)P ran

45 e45 D 0 Since the p-value is less than 05 the error rate of the classi er is statis-tically signi cant (see Fig 2b)

The detailed description of the entire methodology for a two-class problem is as follows

1) Subsampling and signi cance permutation testA Subsampling procedure

i Given `c1 samples from class 1 and `c2 samples from class 2 the total number of samples is` D `c1 C `c2 where 1 cedil 10

ii Select 10 training set sizes n1 nj n10 over the interval [10 ` iexcl 10]

124 MUKHERJEE ET AL

FIG 2 Dataset size estimation statistical methodology

1 For each training set size nj run the following subsampling procedure T1 D 50 times indexedby i D 1 T1a Randomly split the dataset into a training set with nj samples and a test set with ` iexcl nj

samples subject to the requirement thatnc2nc1

frac14 `c2`c1

where nc2 and nc1 are the number of

samples from class 1 and class 2 in the training set Call the two datasets generated Sni b Train a classi er on each of the training sets and measure its error rate on its corresponding

test set call each of these error rates eni B Permutation test

i For each subsampled traintest split Sni run the following permutation procedure T2 gt 50times indexed by j D 1 T21 Randomly permute the labels of the samples in the training set (leave the test set alone) call

the dataset generated Srannij

2 Train a classi er on the training set and measure its error on the test set call this error rate erannij

C Signi cance calculation1 For each training set size n construct an empirical distribution function from the error rates of

the permuted datasets P rann x D 1

T1poundT2

PT1iD1

PT2jD1 microx iexcl eran

nij where microz D 1 if z cedil 0 and0 otherwise

ESTIMATING DATASET SIZE REQUIREMENTS 125

2 Given the above empirical distribution function compute for each en the value tn D P rann en

statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p

2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M

take the sequence of error rates and compute the following quantities for each training set size n gt n0

for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1

PT1iD1 eni

the 25th and 75th quantiles of the vector of error ratescopyen1 enT1

ordf

B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-

mization procedure minregab

PMlD1aniexclreg

l Cbiexclenl2 subject to reg a b cedil 0 designate the values

as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size

Lm en D amniexclregm C bm

ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1

enT1 g

4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS

The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results

The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section

Tumor versus normal tissue

This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)

126 MUKHERJEE ET AL

Tabl

e1

Sum

ma

ryof

Lea

rnin

gC

urv

ePa

ram

eter

sa

nd

Extr

apo

late

dE

rro

rR

ates

Lea

rnin

gcu

rve

para

met

ers

b

areg

asym

ple

arni

ngde

cay

Bay

esD

atas

etp

robl

emra

tera

teer

ror

Pre

dict

eder

ror

rate

at40

0sa

mpl

es

25th

quar

tile

Mea

n75

thqu

arti

le

Mul

tipl

etis

sues

47

73

89

25th

quan

tile

189

iexcl0

630

003

Can

cer

vs

norm

alm

ean

142

iexcl0

520

010

75th

quan

tile

117

iexcl0

430

000

Leu

kem

ia0

62

53

425

thqu

anti

le39

8iexcl

23

000

6A

ML

vs

AL

Lm

ean

771

iexcl0

650

009

75th

quan

tile

884

iexcl0

60

010

Col

onca

ncer

48

90

124

25

thqu

anti

le4

39iexcl

037

000

Can

cer

vs

norm

alm

ean

480

iexcl0

280

0075

thqu

anti

le5

00iexcl

023

000

Ova

rian

canc

er0

31

21

825

thqu

anti

le10

7iexcl

176

000

Can

cer

vs

norm

alm

ean

736

iexcl0

690

0075

thqu

anti

le9

95iexcl

067

000

Lym

phom

a0

09

15

25th

quan

tile

899

iexcl2

480

00D

LB

Cvs

Fo

llicu

lar

mea

n5

7iexcl

070

000

175

thqu

anti

le6

71iexcl

064

000

Bra

inca

ncer

145

16

1

238

25

thqu

antil

e1

06iexcl

037

000

4T

reat

men

tou

tcom

em

ean

112

iexcl0

330

006

75th

quan

tile

980

iexcl0

240

00Ly

mph

oma

112

17

24

3

25th

quan

tile

123

iexcl0

410

008

Tre

atm

ent

outc

ome

mea

n9

43iexcl

030

001

75th

quan

tile

872

iexcl0

210

00B

reas

tca

ncer

243

32

3

404

25

thqu

anti

le5

32iexcl

014

001

Tre

atm

ent

outc

ome

mea

n4

85iexcl

007

001

75th

quan

tile

429

iexcl0

010

00

ESTIMATING DATASET SIZE REQUIREMENTS 127

Tabl

e2

Su

mm

ary

of

Res

ult

sfo

rD

atase

tsIn

clu

ded

inth

eSt

ud

y

Max

imum

Size

toac

hiev

eLe

arni

ngtr

aini

ngcl

assi

errsquo

scu

rve

set

size

used

stat

istic

alt

ted

with

int

ting

Dat

aset

sign

ica

nce

sign

ica

ntle

arni

ngD

atas

etp

robl

emsi

ze(p

-val

lt0

05)

data

curv

eC

oncl

usio

ns

Cla

ssi

catio

ner

ror

Lear

ning

curv

eA

ctua

l(e

xtra

pola

ted)

E

rror

Mul

tiple

tissu

esC

ance

rvs

no

rmal

280

15ye

s21

07

08

62

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

10

Eve

nlo

wer

erro

rra

tes

lt0

07ar

eac

hiev

able

wit

hgt

400

sam

ples

L

euke

mia

AM

Lvs

A

LL

7310

yes

350

05

65

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

05

Add

itio

nal

sam

ples

may

low

erer

ror

toraquo

0C

olon

canc

erC

ance

rvs

no

rmal

6210

yes

5016

3

152

1

1D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

aner

ror

rate

ofraquo

016

L

ower

erro

rra

tes

lt0

09)

are

achi

evab

lew

ithgt

400

sam

ples

O

vari

anca

ncer

Can

cer

vs

norm

al54

10ye

s40

56

48

08

Dat

aset

size

isla

rge

enou

ghfo

rex

trap

olat

ion

and

toac

hiev

ea

low

erro

rra

teraquo

005

)A

dditi

onal

sam

ples

may

low

erer

ror

rate

toraquo

0L

ymph

oma

DL

BC

vs

Foll

icul

ar53

5ye

s40

44

70

7D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

035)

A

ddit

iona

lsa

mpl

esm

aylo

wer

erro

rra

teto

raquo0

Bra

inca

ncer

Tre

atm

ent

outc

ome

6045

no40

22

296

7

6D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

14w

ith

gt40

0sa

mpl

es

Lym

phom

aT

reat

men

tou

tcom

e58

50no

4023

29

5

65

Dat

aset

isno

tla

rge

enou

ghfo

rex

trap

olat

ion

Fitt

ing

ofle

arni

ngcu

rve

isin

accu

rate

but

sugg

ests

erro

rra

teco

uld

belt

017

wit

hgt

400

sam

ples

B

reas

tca

ncer

Tre

atm

ent

outc

ome

7865

no70

30

363

6

3D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

32w

ith

gt40

0sa

mpl

es

128 MUKHERJEE ET AL

with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279

Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15

To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is

errorn D 142niexcl052 C 0098

The error rates for the 25th and 75th quantiles are

error25n D 189niexcl063 C 0032

error75n D 117niexcl043 C 000

Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope

FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er

ESTIMATING DATASET SIZE REQUIREMENTS 129

FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes

is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows

error400 D 142400iexcl052 C 0098 D 73

error25400 D 189400iexcl063 C 0032 D 47

error75400 D 117400iexcl043 C 000 D 89

The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)

Leukemia morphology

The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is

errorn D 7706niexcl65 C 009

In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset

130 MUKHERJEE ET AL

FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 3: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

ESTIMATING DATASET SIZE REQUIREMENTS 121

In machine learning a natural way to study classi cation accuracy as a function of training set size isby building empirical scaling models called learning curves (Cortes et al 1994) Learning curves estimatethe empirical error rate as a function of training set size for a given classi er and dataset The advantageof this approach is that one avoids making assumptions about the distribution generating the dataset orthe distribution of the classi cation errors These learning curves are usually well characterized by inversepower-laws

en D aniexclreg C b (1)

The variables are the expected error rate en given n training samples the learning rate a the decayrate reg and the Bayes error b which is the minimum error rate achievable (Devroye et al 1997 Dudaet al 2000) Notice that the value of the constants a b and reg will change according to the classi er anddataset being studied Based on this scaling model as the size of the dataset increases the misclassi cationerror of a classi er will asymptotically approach b This inverse-power-law ldquolearningrdquo behavior appearsto be universal and is observed for many classi ers and types of datasets (Cortes et al 1994 Shrageret al 1988) It is in fact observed not only in machine learning but also in human and animal learning(Anderson et al 1983) It is common to nd empirical reg values around or less than 1 Besides itsempirical prevalence the power-law model can be motivated analytically and in some cases derived withinthe statistical mechanics approach to learning The basic idea behind this approach is to formulate theaverage error as a set of equations which are then solved via a statistical mechanics replica approach(Hertz et al 1991) involving integration over the parameters of the classi er This approach has beenapplied to various classi cation algorithms such as Support Vector Machines (Dietrich et al 2000) LargeMargin Perceptrons (Opper et al 1995) and Adaline and other classi ers based upon Hebbian rules (Opperet al 1990) The resulting analysis of the classi cation errors for all of the above algorithms results ininverse power laws of the form (1)

Using this power-law scaling model as a basis one can use the empirical error rates of a classi er overa range of training set sizes drawn from a dataset to t an inverse-power-law model and then use thismodel to extrapolate the error rate to larger datasets In order to make this a practical approach one alsoneeds a statistical test for classi er signi cance as a function of training set size The reason for this isthat the inverse-power-law model usually breaks down for small training set sizes where the model lacksenough data to give accurate predictions In this case the error rates are large and not signi cant If agiven classi errsquos results are not signi cant then it is better to exclude them when tting the learningcurve To directly address this problem we have included a permutation test for the signi cance of aclassi er as part of our methodology This test compares the performance of the actual classi er with theperformance of random classi ers trained to predict data whose target labels are permuted (randomized)A classi er that is able to ldquo ndrdquo structure in the data and produce signi cant results should outperform itsrandom counterparts most of the time By xing a signi cance level (005) we can produce an effectivetest to eliminate classi ers that are not signi cant from the tting of the learning curve Since the classi erperformance usually improves with increasing training set size this signi cance test also allows us to ndthe minimum number of samples that produced signi cant classi ers

3 METHODOLOGY FOR ESTIMATING ERROR RATES AS AFUNCTION OF DATASET SIZE

Given an arbitrary input dataset and classi cation algorithm the methodology we will describe providesthe following

1) A measure of the statistical signi cance of the classi er built at each training set size Using this onecan nd the minimum training set size for which the classi cation performance of the classi ers isstatistically signi cant

2) An analytic expression (power law) of the error rate as a function of the increasing dataset size as wellas similar expressions for the 25th and 75th error rate quantiles These provide a means to extrapolatethe error bar ldquoenveloperdquo for the error rate for larger yet unseen data sets

122 MUKHERJEE ET AL

As described in Section 2 a signi cance test is needed to know at which training set size error ratesare reliable enough to accurately extrapolate the error rate as a function of dataset size The 25th and 75thquantiles are used to compute the analog of error bars for the estimated error rates as a function of datasetsize Figure 1 shows a pictorial summary of the method The procedure can be broken down into two maincomputational tasks the rst involving random subsampling (traintest) and a signi cance permutation testto evaluate the classi ers and the second consisting of tting learning curves to the error rates that passedthe signi cance test

We rst describe the rationale behind tting the learning curves We want to t the inverse power curvesto the true average error and the true 25th and 75th quantile error of a classi er trained with varioustraining set sizes Our rst step is to estimate the true average error rate and 25th and 75th error quantilesfor a range of training set sizes For a dataset of size ` with xed training set size n and test set size ` iexcl nT1 traintest realizations were constructed by subsampling the dataset For each of these T1 realizations anerror rate eni is computed and the average of the sequence

copyen1 enT1

ordf en D 1

T1

PT1iD1 eni is used

as an estimate of the true average error en The average error rate is an unbiased estimator of the true errorrate of a classi er trained with n samples

ED`

1T1

T1X

iD1

eni D EDnen

where EDnen is the probability of classifying a new sample incorrectly when the classi er was trained

with n samples (see Appendix 1a for proof and details) The 25th and 75th percentile of the sequencecopyen1 enT1

ordfwere also tted to an inverse power law so that we could also estimate the variation in

error rates as a function of dataset size The 25th and 75th percentile of the sequencecopyen1 enT1

ordfare

good approximations of the 25th and 75th quantiles of error rates of classi ers trained with n samples (seeAppendix 1c for proof and details) We did not use the variance of the error rates eni because this statisticis not an unbiased estimator of the variance of the error rate of classi ers trained with n samples andtested on a new sample Indeed one can prove that the variance of this statistic is in general optimisticthe variance of the error rates eni is less than the variance of classi ers trained with n samples and testedon a new sample (see Appendix 1b for proof and details)

As described in the introduction theoretical justi cations for the use of inverse power laws can be madeusing analyses of classi cation accuracy based upon techniques from statistical mechanics (Opper et al1990 1995) and approximation theory (Niyogi et al 1996) as described in more detail in Appendix 2

FIG 1 The statistical signi cance for the ctitious dataset example with (a) 15 samples and (b) 45 samples Theline is the empirical distribution function for the random classi ers and the gray circle is the average error rate for theclassi er with randomization of labels

ESTIMATING DATASET SIZE REQUIREMENTS 123

Fitting the parameters of the learning curve by minimizing

minregab

MX

lD1

aniexclregl C b iexcl enl

2 subject to reg a b cedil 0

is a convex optimization problem when b is xed For a xed b one can estimate reg and a by takinglogarithms and solving the following equivalent linear minimization problem

minregab

MX

lD1

lna iexcl regnl C lnb iexcl enl2 subject to reg a b cedil 0

Solving this linear problem for various values of b followed by line search gives us our estimate of reg aand b

As described in Section 2 the tted learning curve does not extrapolate accurately when error rates arelarge and not statistically signi cant This motivates a procedure to determine at what training set size theerror rate is statistically signi cant when compared to the null hypothesis of the error rate of a randomclassi er

H0 py D 1jx fx1 y1 xn yng D py D iexcl1jx fx1 y1 xn yngI

the null hypothesis states that given a particular training set the conditional probability of a label being 1 oriexcl1 is equal A random classi er is built from the same input data with the class labels of the data randomlypermuted This addresses the question of how well the classi er can learn the mapping f x y whenthe y values are random y D f1 iexcl1g In essence we ask how well a classi er trained on randomlylabeled data can classify correctly labeled data The permutation procedure outlined above helps to answerthis question For each traintest realization for which an error rate eni was computed we construct T2

randomized realizations where the labels of the training set are randomly permuted We build classi erson these randomized training sets and test on the corresponding test set This results in a set of error ratesenij for training set size n From these error rates we construct an empirical distribution function for therandom classi er

P rann x D 1

T1 pound T2

T1X

iD1

T2X

jD1

microx iexcl enij

where microz D 1 if z cedil 0 and 0 otherwise The signi cance of the classi er is P rann en which is the

percentage of random classi ers with error rate smaller than en The procedure is illustrated with thefollowing two examples for a ctitious dataset

Example 1 e15 D 37 (Error rate of the classi er with 15 samples) e15ij D f215 260 290 320

366 388 395 408 420 495g (Error rates of the random classi ers with 15 samples)P ran

15 e15 D 50 Since the p-value is greater than 05 the error rate of the classi er isnot statistically signi cant (see Fig 2a)

Example 2 e45 D 1 (Error rate of the classi er with 45 samples) e45ij D f205 270 333 337

370 392 399 406 425 499g (Error rates of the random classi ers with 45 samples)P ran

45 e45 D 0 Since the p-value is less than 05 the error rate of the classi er is statis-tically signi cant (see Fig 2b)

The detailed description of the entire methodology for a two-class problem is as follows

1) Subsampling and signi cance permutation testA Subsampling procedure

i Given `c1 samples from class 1 and `c2 samples from class 2 the total number of samples is` D `c1 C `c2 where 1 cedil 10

ii Select 10 training set sizes n1 nj n10 over the interval [10 ` iexcl 10]

124 MUKHERJEE ET AL

FIG 2 Dataset size estimation statistical methodology

1 For each training set size nj run the following subsampling procedure T1 D 50 times indexedby i D 1 T1a Randomly split the dataset into a training set with nj samples and a test set with ` iexcl nj

samples subject to the requirement thatnc2nc1

frac14 `c2`c1

where nc2 and nc1 are the number of

samples from class 1 and class 2 in the training set Call the two datasets generated Sni b Train a classi er on each of the training sets and measure its error rate on its corresponding

test set call each of these error rates eni B Permutation test

i For each subsampled traintest split Sni run the following permutation procedure T2 gt 50times indexed by j D 1 T21 Randomly permute the labels of the samples in the training set (leave the test set alone) call

the dataset generated Srannij

2 Train a classi er on the training set and measure its error on the test set call this error rate erannij

C Signi cance calculation1 For each training set size n construct an empirical distribution function from the error rates of

the permuted datasets P rann x D 1

T1poundT2

PT1iD1

PT2jD1 microx iexcl eran

nij where microz D 1 if z cedil 0 and0 otherwise

ESTIMATING DATASET SIZE REQUIREMENTS 125

2 Given the above empirical distribution function compute for each en the value tn D P rann en

statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p

2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M

take the sequence of error rates and compute the following quantities for each training set size n gt n0

for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1

PT1iD1 eni

the 25th and 75th quantiles of the vector of error ratescopyen1 enT1

ordf

B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-

mization procedure minregab

PMlD1aniexclreg

l Cbiexclenl2 subject to reg a b cedil 0 designate the values

as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size

Lm en D amniexclregm C bm

ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1

enT1 g

4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS

The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results

The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section

Tumor versus normal tissue

This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)

126 MUKHERJEE ET AL

Tabl

e1

Sum

ma

ryof

Lea

rnin

gC

urv

ePa

ram

eter

sa

nd

Extr

apo

late

dE

rro

rR

ates

Lea

rnin

gcu

rve

para

met

ers

b

areg

asym

ple

arni

ngde

cay

Bay

esD

atas

etp

robl

emra

tera

teer

ror

Pre

dict

eder

ror

rate

at40

0sa

mpl

es

25th

quar

tile

Mea

n75

thqu

arti

le

Mul

tipl

etis

sues

47

73

89

25th

quan

tile

189

iexcl0

630

003

Can

cer

vs

norm

alm

ean

142

iexcl0

520

010

75th

quan

tile

117

iexcl0

430

000

Leu

kem

ia0

62

53

425

thqu

anti

le39

8iexcl

23

000

6A

ML

vs

AL

Lm

ean

771

iexcl0

650

009

75th

quan

tile

884

iexcl0

60

010

Col

onca

ncer

48

90

124

25

thqu

anti

le4

39iexcl

037

000

Can

cer

vs

norm

alm

ean

480

iexcl0

280

0075

thqu

anti

le5

00iexcl

023

000

Ova

rian

canc

er0

31

21

825

thqu

anti

le10

7iexcl

176

000

Can

cer

vs

norm

alm

ean

736

iexcl0

690

0075

thqu

anti

le9

95iexcl

067

000

Lym

phom

a0

09

15

25th

quan

tile

899

iexcl2

480

00D

LB

Cvs

Fo

llicu

lar

mea

n5

7iexcl

070

000

175

thqu

anti

le6

71iexcl

064

000

Bra

inca

ncer

145

16

1

238

25

thqu

antil

e1

06iexcl

037

000

4T

reat

men

tou

tcom

em

ean

112

iexcl0

330

006

75th

quan

tile

980

iexcl0

240

00Ly

mph

oma

112

17

24

3

25th

quan

tile

123

iexcl0

410

008

Tre

atm

ent

outc

ome

mea

n9

43iexcl

030

001

75th

quan

tile

872

iexcl0

210

00B

reas

tca

ncer

243

32

3

404

25

thqu

anti

le5

32iexcl

014

001

Tre

atm

ent

outc

ome

mea

n4

85iexcl

007

001

75th

quan

tile

429

iexcl0

010

00

ESTIMATING DATASET SIZE REQUIREMENTS 127

Tabl

e2

Su

mm

ary

of

Res

ult

sfo

rD

atase

tsIn

clu

ded

inth

eSt

ud

y

Max

imum

Size

toac

hiev

eLe

arni

ngtr

aini

ngcl

assi

errsquo

scu

rve

set

size

used

stat

istic

alt

ted

with

int

ting

Dat

aset

sign

ica

nce

sign

ica

ntle

arni

ngD

atas

etp

robl

emsi

ze(p

-val

lt0

05)

data

curv

eC

oncl

usio

ns

Cla

ssi

catio

ner

ror

Lear

ning

curv

eA

ctua

l(e

xtra

pola

ted)

E

rror

Mul

tiple

tissu

esC

ance

rvs

no

rmal

280

15ye

s21

07

08

62

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

10

Eve

nlo

wer

erro

rra

tes

lt0

07ar

eac

hiev

able

wit

hgt

400

sam

ples

L

euke

mia

AM

Lvs

A

LL

7310

yes

350

05

65

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

05

Add

itio

nal

sam

ples

may

low

erer

ror

toraquo

0C

olon

canc

erC

ance

rvs

no

rmal

6210

yes

5016

3

152

1

1D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

aner

ror

rate

ofraquo

016

L

ower

erro

rra

tes

lt0

09)

are

achi

evab

lew

ithgt

400

sam

ples

O

vari

anca

ncer

Can

cer

vs

norm

al54

10ye

s40

56

48

08

Dat

aset

size

isla

rge

enou

ghfo

rex

trap

olat

ion

and

toac

hiev

ea

low

erro

rra

teraquo

005

)A

dditi

onal

sam

ples

may

low

erer

ror

rate

toraquo

0L

ymph

oma

DL

BC

vs

Foll

icul

ar53

5ye

s40

44

70

7D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

035)

A

ddit

iona

lsa

mpl

esm

aylo

wer

erro

rra

teto

raquo0

Bra

inca

ncer

Tre

atm

ent

outc

ome

6045

no40

22

296

7

6D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

14w

ith

gt40

0sa

mpl

es

Lym

phom

aT

reat

men

tou

tcom

e58

50no

4023

29

5

65

Dat

aset

isno

tla

rge

enou

ghfo

rex

trap

olat

ion

Fitt

ing

ofle

arni

ngcu

rve

isin

accu

rate

but

sugg

ests

erro

rra

teco

uld

belt

017

wit

hgt

400

sam

ples

B

reas

tca

ncer

Tre

atm

ent

outc

ome

7865

no70

30

363

6

3D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

32w

ith

gt40

0sa

mpl

es

128 MUKHERJEE ET AL

with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279

Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15

To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is

errorn D 142niexcl052 C 0098

The error rates for the 25th and 75th quantiles are

error25n D 189niexcl063 C 0032

error75n D 117niexcl043 C 000

Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope

FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er

ESTIMATING DATASET SIZE REQUIREMENTS 129

FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes

is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows

error400 D 142400iexcl052 C 0098 D 73

error25400 D 189400iexcl063 C 0032 D 47

error75400 D 117400iexcl043 C 000 D 89

The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)

Leukemia morphology

The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is

errorn D 7706niexcl65 C 009

In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset

130 MUKHERJEE ET AL

FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 4: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

122 MUKHERJEE ET AL

As described in Section 2 a signi cance test is needed to know at which training set size error ratesare reliable enough to accurately extrapolate the error rate as a function of dataset size The 25th and 75thquantiles are used to compute the analog of error bars for the estimated error rates as a function of datasetsize Figure 1 shows a pictorial summary of the method The procedure can be broken down into two maincomputational tasks the rst involving random subsampling (traintest) and a signi cance permutation testto evaluate the classi ers and the second consisting of tting learning curves to the error rates that passedthe signi cance test

We rst describe the rationale behind tting the learning curves We want to t the inverse power curvesto the true average error and the true 25th and 75th quantile error of a classi er trained with varioustraining set sizes Our rst step is to estimate the true average error rate and 25th and 75th error quantilesfor a range of training set sizes For a dataset of size ` with xed training set size n and test set size ` iexcl nT1 traintest realizations were constructed by subsampling the dataset For each of these T1 realizations anerror rate eni is computed and the average of the sequence

copyen1 enT1

ordf en D 1

T1

PT1iD1 eni is used

as an estimate of the true average error en The average error rate is an unbiased estimator of the true errorrate of a classi er trained with n samples

ED`

1T1

T1X

iD1

eni D EDnen

where EDnen is the probability of classifying a new sample incorrectly when the classi er was trained

with n samples (see Appendix 1a for proof and details) The 25th and 75th percentile of the sequencecopyen1 enT1

ordfwere also tted to an inverse power law so that we could also estimate the variation in

error rates as a function of dataset size The 25th and 75th percentile of the sequencecopyen1 enT1

ordfare

good approximations of the 25th and 75th quantiles of error rates of classi ers trained with n samples (seeAppendix 1c for proof and details) We did not use the variance of the error rates eni because this statisticis not an unbiased estimator of the variance of the error rate of classi ers trained with n samples andtested on a new sample Indeed one can prove that the variance of this statistic is in general optimisticthe variance of the error rates eni is less than the variance of classi ers trained with n samples and testedon a new sample (see Appendix 1b for proof and details)

As described in the introduction theoretical justi cations for the use of inverse power laws can be madeusing analyses of classi cation accuracy based upon techniques from statistical mechanics (Opper et al1990 1995) and approximation theory (Niyogi et al 1996) as described in more detail in Appendix 2

FIG 1 The statistical signi cance for the ctitious dataset example with (a) 15 samples and (b) 45 samples Theline is the empirical distribution function for the random classi ers and the gray circle is the average error rate for theclassi er with randomization of labels

ESTIMATING DATASET SIZE REQUIREMENTS 123

Fitting the parameters of the learning curve by minimizing

minregab

MX

lD1

aniexclregl C b iexcl enl

2 subject to reg a b cedil 0

is a convex optimization problem when b is xed For a xed b one can estimate reg and a by takinglogarithms and solving the following equivalent linear minimization problem

minregab

MX

lD1

lna iexcl regnl C lnb iexcl enl2 subject to reg a b cedil 0

Solving this linear problem for various values of b followed by line search gives us our estimate of reg aand b

As described in Section 2 the tted learning curve does not extrapolate accurately when error rates arelarge and not statistically signi cant This motivates a procedure to determine at what training set size theerror rate is statistically signi cant when compared to the null hypothesis of the error rate of a randomclassi er

H0 py D 1jx fx1 y1 xn yng D py D iexcl1jx fx1 y1 xn yngI

the null hypothesis states that given a particular training set the conditional probability of a label being 1 oriexcl1 is equal A random classi er is built from the same input data with the class labels of the data randomlypermuted This addresses the question of how well the classi er can learn the mapping f x y whenthe y values are random y D f1 iexcl1g In essence we ask how well a classi er trained on randomlylabeled data can classify correctly labeled data The permutation procedure outlined above helps to answerthis question For each traintest realization for which an error rate eni was computed we construct T2

randomized realizations where the labels of the training set are randomly permuted We build classi erson these randomized training sets and test on the corresponding test set This results in a set of error ratesenij for training set size n From these error rates we construct an empirical distribution function for therandom classi er

P rann x D 1

T1 pound T2

T1X

iD1

T2X

jD1

microx iexcl enij

where microz D 1 if z cedil 0 and 0 otherwise The signi cance of the classi er is P rann en which is the

percentage of random classi ers with error rate smaller than en The procedure is illustrated with thefollowing two examples for a ctitious dataset

Example 1 e15 D 37 (Error rate of the classi er with 15 samples) e15ij D f215 260 290 320

366 388 395 408 420 495g (Error rates of the random classi ers with 15 samples)P ran

15 e15 D 50 Since the p-value is greater than 05 the error rate of the classi er isnot statistically signi cant (see Fig 2a)

Example 2 e45 D 1 (Error rate of the classi er with 45 samples) e45ij D f205 270 333 337

370 392 399 406 425 499g (Error rates of the random classi ers with 45 samples)P ran

45 e45 D 0 Since the p-value is less than 05 the error rate of the classi er is statis-tically signi cant (see Fig 2b)

The detailed description of the entire methodology for a two-class problem is as follows

1) Subsampling and signi cance permutation testA Subsampling procedure

i Given `c1 samples from class 1 and `c2 samples from class 2 the total number of samples is` D `c1 C `c2 where 1 cedil 10

ii Select 10 training set sizes n1 nj n10 over the interval [10 ` iexcl 10]

124 MUKHERJEE ET AL

FIG 2 Dataset size estimation statistical methodology

1 For each training set size nj run the following subsampling procedure T1 D 50 times indexedby i D 1 T1a Randomly split the dataset into a training set with nj samples and a test set with ` iexcl nj

samples subject to the requirement thatnc2nc1

frac14 `c2`c1

where nc2 and nc1 are the number of

samples from class 1 and class 2 in the training set Call the two datasets generated Sni b Train a classi er on each of the training sets and measure its error rate on its corresponding

test set call each of these error rates eni B Permutation test

i For each subsampled traintest split Sni run the following permutation procedure T2 gt 50times indexed by j D 1 T21 Randomly permute the labels of the samples in the training set (leave the test set alone) call

the dataset generated Srannij

2 Train a classi er on the training set and measure its error on the test set call this error rate erannij

C Signi cance calculation1 For each training set size n construct an empirical distribution function from the error rates of

the permuted datasets P rann x D 1

T1poundT2

PT1iD1

PT2jD1 microx iexcl eran

nij where microz D 1 if z cedil 0 and0 otherwise

ESTIMATING DATASET SIZE REQUIREMENTS 125

2 Given the above empirical distribution function compute for each en the value tn D P rann en

statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p

2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M

take the sequence of error rates and compute the following quantities for each training set size n gt n0

for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1

PT1iD1 eni

the 25th and 75th quantiles of the vector of error ratescopyen1 enT1

ordf

B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-

mization procedure minregab

PMlD1aniexclreg

l Cbiexclenl2 subject to reg a b cedil 0 designate the values

as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size

Lm en D amniexclregm C bm

ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1

enT1 g

4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS

The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results

The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section

Tumor versus normal tissue

This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)

126 MUKHERJEE ET AL

Tabl

e1

Sum

ma

ryof

Lea

rnin

gC

urv

ePa

ram

eter

sa

nd

Extr

apo

late

dE

rro

rR

ates

Lea

rnin

gcu

rve

para

met

ers

b

areg

asym

ple

arni

ngde

cay

Bay

esD

atas

etp

robl

emra

tera

teer

ror

Pre

dict

eder

ror

rate

at40

0sa

mpl

es

25th

quar

tile

Mea

n75

thqu

arti

le

Mul

tipl

etis

sues

47

73

89

25th

quan

tile

189

iexcl0

630

003

Can

cer

vs

norm

alm

ean

142

iexcl0

520

010

75th

quan

tile

117

iexcl0

430

000

Leu

kem

ia0

62

53

425

thqu

anti

le39

8iexcl

23

000

6A

ML

vs

AL

Lm

ean

771

iexcl0

650

009

75th

quan

tile

884

iexcl0

60

010

Col

onca

ncer

48

90

124

25

thqu

anti

le4

39iexcl

037

000

Can

cer

vs

norm

alm

ean

480

iexcl0

280

0075

thqu

anti

le5

00iexcl

023

000

Ova

rian

canc

er0

31

21

825

thqu

anti

le10

7iexcl

176

000

Can

cer

vs

norm

alm

ean

736

iexcl0

690

0075

thqu

anti

le9

95iexcl

067

000

Lym

phom

a0

09

15

25th

quan

tile

899

iexcl2

480

00D

LB

Cvs

Fo

llicu

lar

mea

n5

7iexcl

070

000

175

thqu

anti

le6

71iexcl

064

000

Bra

inca

ncer

145

16

1

238

25

thqu

antil

e1

06iexcl

037

000

4T

reat

men

tou

tcom

em

ean

112

iexcl0

330

006

75th

quan

tile

980

iexcl0

240

00Ly

mph

oma

112

17

24

3

25th

quan

tile

123

iexcl0

410

008

Tre

atm

ent

outc

ome

mea

n9

43iexcl

030

001

75th

quan

tile

872

iexcl0

210

00B

reas

tca

ncer

243

32

3

404

25

thqu

anti

le5

32iexcl

014

001

Tre

atm

ent

outc

ome

mea

n4

85iexcl

007

001

75th

quan

tile

429

iexcl0

010

00

ESTIMATING DATASET SIZE REQUIREMENTS 127

Tabl

e2

Su

mm

ary

of

Res

ult

sfo

rD

atase

tsIn

clu

ded

inth

eSt

ud

y

Max

imum

Size

toac

hiev

eLe

arni

ngtr

aini

ngcl

assi

errsquo

scu

rve

set

size

used

stat

istic

alt

ted

with

int

ting

Dat

aset

sign

ica

nce

sign

ica

ntle

arni

ngD

atas

etp

robl

emsi

ze(p

-val

lt0

05)

data

curv

eC

oncl

usio

ns

Cla

ssi

catio

ner

ror

Lear

ning

curv

eA

ctua

l(e

xtra

pola

ted)

E

rror

Mul

tiple

tissu

esC

ance

rvs

no

rmal

280

15ye

s21

07

08

62

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

10

Eve

nlo

wer

erro

rra

tes

lt0

07ar

eac

hiev

able

wit

hgt

400

sam

ples

L

euke

mia

AM

Lvs

A

LL

7310

yes

350

05

65

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

05

Add

itio

nal

sam

ples

may

low

erer

ror

toraquo

0C

olon

canc

erC

ance

rvs

no

rmal

6210

yes

5016

3

152

1

1D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

aner

ror

rate

ofraquo

016

L

ower

erro

rra

tes

lt0

09)

are

achi

evab

lew

ithgt

400

sam

ples

O

vari

anca

ncer

Can

cer

vs

norm

al54

10ye

s40

56

48

08

Dat

aset

size

isla

rge

enou

ghfo

rex

trap

olat

ion

and

toac

hiev

ea

low

erro

rra

teraquo

005

)A

dditi

onal

sam

ples

may

low

erer

ror

rate

toraquo

0L

ymph

oma

DL

BC

vs

Foll

icul

ar53

5ye

s40

44

70

7D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

035)

A

ddit

iona

lsa

mpl

esm

aylo

wer

erro

rra

teto

raquo0

Bra

inca

ncer

Tre

atm

ent

outc

ome

6045

no40

22

296

7

6D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

14w

ith

gt40

0sa

mpl

es

Lym

phom

aT

reat

men

tou

tcom

e58

50no

4023

29

5

65

Dat

aset

isno

tla

rge

enou

ghfo

rex

trap

olat

ion

Fitt

ing

ofle

arni

ngcu

rve

isin

accu

rate

but

sugg

ests

erro

rra

teco

uld

belt

017

wit

hgt

400

sam

ples

B

reas

tca

ncer

Tre

atm

ent

outc

ome

7865

no70

30

363

6

3D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

32w

ith

gt40

0sa

mpl

es

128 MUKHERJEE ET AL

with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279

Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15

To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is

errorn D 142niexcl052 C 0098

The error rates for the 25th and 75th quantiles are

error25n D 189niexcl063 C 0032

error75n D 117niexcl043 C 000

Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope

FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er

ESTIMATING DATASET SIZE REQUIREMENTS 129

FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes

is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows

error400 D 142400iexcl052 C 0098 D 73

error25400 D 189400iexcl063 C 0032 D 47

error75400 D 117400iexcl043 C 000 D 89

The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)

Leukemia morphology

The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is

errorn D 7706niexcl65 C 009

In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset

130 MUKHERJEE ET AL

FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 5: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

ESTIMATING DATASET SIZE REQUIREMENTS 123

Fitting the parameters of the learning curve by minimizing

minregab

MX

lD1

aniexclregl C b iexcl enl

2 subject to reg a b cedil 0

is a convex optimization problem when b is xed For a xed b one can estimate reg and a by takinglogarithms and solving the following equivalent linear minimization problem

minregab

MX

lD1

lna iexcl regnl C lnb iexcl enl2 subject to reg a b cedil 0

Solving this linear problem for various values of b followed by line search gives us our estimate of reg aand b

As described in Section 2 the tted learning curve does not extrapolate accurately when error rates arelarge and not statistically signi cant This motivates a procedure to determine at what training set size theerror rate is statistically signi cant when compared to the null hypothesis of the error rate of a randomclassi er

H0 py D 1jx fx1 y1 xn yng D py D iexcl1jx fx1 y1 xn yngI

the null hypothesis states that given a particular training set the conditional probability of a label being 1 oriexcl1 is equal A random classi er is built from the same input data with the class labels of the data randomlypermuted This addresses the question of how well the classi er can learn the mapping f x y whenthe y values are random y D f1 iexcl1g In essence we ask how well a classi er trained on randomlylabeled data can classify correctly labeled data The permutation procedure outlined above helps to answerthis question For each traintest realization for which an error rate eni was computed we construct T2

randomized realizations where the labels of the training set are randomly permuted We build classi erson these randomized training sets and test on the corresponding test set This results in a set of error ratesenij for training set size n From these error rates we construct an empirical distribution function for therandom classi er

P rann x D 1

T1 pound T2

T1X

iD1

T2X

jD1

microx iexcl enij

where microz D 1 if z cedil 0 and 0 otherwise The signi cance of the classi er is P rann en which is the

percentage of random classi ers with error rate smaller than en The procedure is illustrated with thefollowing two examples for a ctitious dataset

Example 1 e15 D 37 (Error rate of the classi er with 15 samples) e15ij D f215 260 290 320

366 388 395 408 420 495g (Error rates of the random classi ers with 15 samples)P ran

15 e15 D 50 Since the p-value is greater than 05 the error rate of the classi er isnot statistically signi cant (see Fig 2a)

Example 2 e45 D 1 (Error rate of the classi er with 45 samples) e45ij D f205 270 333 337

370 392 399 406 425 499g (Error rates of the random classi ers with 45 samples)P ran

45 e45 D 0 Since the p-value is less than 05 the error rate of the classi er is statis-tically signi cant (see Fig 2b)

The detailed description of the entire methodology for a two-class problem is as follows

1) Subsampling and signi cance permutation testA Subsampling procedure

i Given `c1 samples from class 1 and `c2 samples from class 2 the total number of samples is` D `c1 C `c2 where 1 cedil 10

ii Select 10 training set sizes n1 nj n10 over the interval [10 ` iexcl 10]

124 MUKHERJEE ET AL

FIG 2 Dataset size estimation statistical methodology

1 For each training set size nj run the following subsampling procedure T1 D 50 times indexedby i D 1 T1a Randomly split the dataset into a training set with nj samples and a test set with ` iexcl nj

samples subject to the requirement thatnc2nc1

frac14 `c2`c1

where nc2 and nc1 are the number of

samples from class 1 and class 2 in the training set Call the two datasets generated Sni b Train a classi er on each of the training sets and measure its error rate on its corresponding

test set call each of these error rates eni B Permutation test

i For each subsampled traintest split Sni run the following permutation procedure T2 gt 50times indexed by j D 1 T21 Randomly permute the labels of the samples in the training set (leave the test set alone) call

the dataset generated Srannij

2 Train a classi er on the training set and measure its error on the test set call this error rate erannij

C Signi cance calculation1 For each training set size n construct an empirical distribution function from the error rates of

the permuted datasets P rann x D 1

T1poundT2

PT1iD1

PT2jD1 microx iexcl eran

nij where microz D 1 if z cedil 0 and0 otherwise

ESTIMATING DATASET SIZE REQUIREMENTS 125

2 Given the above empirical distribution function compute for each en the value tn D P rann en

statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p

2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M

take the sequence of error rates and compute the following quantities for each training set size n gt n0

for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1

PT1iD1 eni

the 25th and 75th quantiles of the vector of error ratescopyen1 enT1

ordf

B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-

mization procedure minregab

PMlD1aniexclreg

l Cbiexclenl2 subject to reg a b cedil 0 designate the values

as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size

Lm en D amniexclregm C bm

ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1

enT1 g

4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS

The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results

The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section

Tumor versus normal tissue

This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)

126 MUKHERJEE ET AL

Tabl

e1

Sum

ma

ryof

Lea

rnin

gC

urv

ePa

ram

eter

sa

nd

Extr

apo

late

dE

rro

rR

ates

Lea

rnin

gcu

rve

para

met

ers

b

areg

asym

ple

arni

ngde

cay

Bay

esD

atas

etp

robl

emra

tera

teer

ror

Pre

dict

eder

ror

rate

at40

0sa

mpl

es

25th

quar

tile

Mea

n75

thqu

arti

le

Mul

tipl

etis

sues

47

73

89

25th

quan

tile

189

iexcl0

630

003

Can

cer

vs

norm

alm

ean

142

iexcl0

520

010

75th

quan

tile

117

iexcl0

430

000

Leu

kem

ia0

62

53

425

thqu

anti

le39

8iexcl

23

000

6A

ML

vs

AL

Lm

ean

771

iexcl0

650

009

75th

quan

tile

884

iexcl0

60

010

Col

onca

ncer

48

90

124

25

thqu

anti

le4

39iexcl

037

000

Can

cer

vs

norm

alm

ean

480

iexcl0

280

0075

thqu

anti

le5

00iexcl

023

000

Ova

rian

canc

er0

31

21

825

thqu

anti

le10

7iexcl

176

000

Can

cer

vs

norm

alm

ean

736

iexcl0

690

0075

thqu

anti

le9

95iexcl

067

000

Lym

phom

a0

09

15

25th

quan

tile

899

iexcl2

480

00D

LB

Cvs

Fo

llicu

lar

mea

n5

7iexcl

070

000

175

thqu

anti

le6

71iexcl

064

000

Bra

inca

ncer

145

16

1

238

25

thqu

antil

e1

06iexcl

037

000

4T

reat

men

tou

tcom

em

ean

112

iexcl0

330

006

75th

quan

tile

980

iexcl0

240

00Ly

mph

oma

112

17

24

3

25th

quan

tile

123

iexcl0

410

008

Tre

atm

ent

outc

ome

mea

n9

43iexcl

030

001

75th

quan

tile

872

iexcl0

210

00B

reas

tca

ncer

243

32

3

404

25

thqu

anti

le5

32iexcl

014

001

Tre

atm

ent

outc

ome

mea

n4

85iexcl

007

001

75th

quan

tile

429

iexcl0

010

00

ESTIMATING DATASET SIZE REQUIREMENTS 127

Tabl

e2

Su

mm

ary

of

Res

ult

sfo

rD

atase

tsIn

clu

ded

inth

eSt

ud

y

Max

imum

Size

toac

hiev

eLe

arni

ngtr

aini

ngcl

assi

errsquo

scu

rve

set

size

used

stat

istic

alt

ted

with

int

ting

Dat

aset

sign

ica

nce

sign

ica

ntle

arni

ngD

atas

etp

robl

emsi

ze(p

-val

lt0

05)

data

curv

eC

oncl

usio

ns

Cla

ssi

catio

ner

ror

Lear

ning

curv

eA

ctua

l(e

xtra

pola

ted)

E

rror

Mul

tiple

tissu

esC

ance

rvs

no

rmal

280

15ye

s21

07

08

62

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

10

Eve

nlo

wer

erro

rra

tes

lt0

07ar

eac

hiev

able

wit

hgt

400

sam

ples

L

euke

mia

AM

Lvs

A

LL

7310

yes

350

05

65

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

05

Add

itio

nal

sam

ples

may

low

erer

ror

toraquo

0C

olon

canc

erC

ance

rvs

no

rmal

6210

yes

5016

3

152

1

1D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

aner

ror

rate

ofraquo

016

L

ower

erro

rra

tes

lt0

09)

are

achi

evab

lew

ithgt

400

sam

ples

O

vari

anca

ncer

Can

cer

vs

norm

al54

10ye

s40

56

48

08

Dat

aset

size

isla

rge

enou

ghfo

rex

trap

olat

ion

and

toac

hiev

ea

low

erro

rra

teraquo

005

)A

dditi

onal

sam

ples

may

low

erer

ror

rate

toraquo

0L

ymph

oma

DL

BC

vs

Foll

icul

ar53

5ye

s40

44

70

7D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

035)

A

ddit

iona

lsa

mpl

esm

aylo

wer

erro

rra

teto

raquo0

Bra

inca

ncer

Tre

atm

ent

outc

ome

6045

no40

22

296

7

6D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

14w

ith

gt40

0sa

mpl

es

Lym

phom

aT

reat

men

tou

tcom

e58

50no

4023

29

5

65

Dat

aset

isno

tla

rge

enou

ghfo

rex

trap

olat

ion

Fitt

ing

ofle

arni

ngcu

rve

isin

accu

rate

but

sugg

ests

erro

rra

teco

uld

belt

017

wit

hgt

400

sam

ples

B

reas

tca

ncer

Tre

atm

ent

outc

ome

7865

no70

30

363

6

3D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

32w

ith

gt40

0sa

mpl

es

128 MUKHERJEE ET AL

with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279

Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15

To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is

errorn D 142niexcl052 C 0098

The error rates for the 25th and 75th quantiles are

error25n D 189niexcl063 C 0032

error75n D 117niexcl043 C 000

Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope

FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er

ESTIMATING DATASET SIZE REQUIREMENTS 129

FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes

is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows

error400 D 142400iexcl052 C 0098 D 73

error25400 D 189400iexcl063 C 0032 D 47

error75400 D 117400iexcl043 C 000 D 89

The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)

Leukemia morphology

The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is

errorn D 7706niexcl65 C 009

In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset

130 MUKHERJEE ET AL

FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 6: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

124 MUKHERJEE ET AL

FIG 2 Dataset size estimation statistical methodology

1 For each training set size nj run the following subsampling procedure T1 D 50 times indexedby i D 1 T1a Randomly split the dataset into a training set with nj samples and a test set with ` iexcl nj

samples subject to the requirement thatnc2nc1

frac14 `c2`c1

where nc2 and nc1 are the number of

samples from class 1 and class 2 in the training set Call the two datasets generated Sni b Train a classi er on each of the training sets and measure its error rate on its corresponding

test set call each of these error rates eni B Permutation test

i For each subsampled traintest split Sni run the following permutation procedure T2 gt 50times indexed by j D 1 T21 Randomly permute the labels of the samples in the training set (leave the test set alone) call

the dataset generated Srannij

2 Train a classi er on the training set and measure its error on the test set call this error rate erannij

C Signi cance calculation1 For each training set size n construct an empirical distribution function from the error rates of

the permuted datasets P rann x D 1

T1poundT2

PT1iD1

PT2jD1 microx iexcl eran

nij where microz D 1 if z cedil 0 and0 otherwise

ESTIMATING DATASET SIZE REQUIREMENTS 125

2 Given the above empirical distribution function compute for each en the value tn D P rann en

statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p

2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M

take the sequence of error rates and compute the following quantities for each training set size n gt n0

for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1

PT1iD1 eni

the 25th and 75th quantiles of the vector of error ratescopyen1 enT1

ordf

B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-

mization procedure minregab

PMlD1aniexclreg

l Cbiexclenl2 subject to reg a b cedil 0 designate the values

as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size

Lm en D amniexclregm C bm

ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1

enT1 g

4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS

The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results

The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section

Tumor versus normal tissue

This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)

126 MUKHERJEE ET AL

Tabl

e1

Sum

ma

ryof

Lea

rnin

gC

urv

ePa

ram

eter

sa

nd

Extr

apo

late

dE

rro

rR

ates

Lea

rnin

gcu

rve

para

met

ers

b

areg

asym

ple

arni

ngde

cay

Bay

esD

atas

etp

robl

emra

tera

teer

ror

Pre

dict

eder

ror

rate

at40

0sa

mpl

es

25th

quar

tile

Mea

n75

thqu

arti

le

Mul

tipl

etis

sues

47

73

89

25th

quan

tile

189

iexcl0

630

003

Can

cer

vs

norm

alm

ean

142

iexcl0

520

010

75th

quan

tile

117

iexcl0

430

000

Leu

kem

ia0

62

53

425

thqu

anti

le39

8iexcl

23

000

6A

ML

vs

AL

Lm

ean

771

iexcl0

650

009

75th

quan

tile

884

iexcl0

60

010

Col

onca

ncer

48

90

124

25

thqu

anti

le4

39iexcl

037

000

Can

cer

vs

norm

alm

ean

480

iexcl0

280

0075

thqu

anti

le5

00iexcl

023

000

Ova

rian

canc

er0

31

21

825

thqu

anti

le10

7iexcl

176

000

Can

cer

vs

norm

alm

ean

736

iexcl0

690

0075

thqu

anti

le9

95iexcl

067

000

Lym

phom

a0

09

15

25th

quan

tile

899

iexcl2

480

00D

LB

Cvs

Fo

llicu

lar

mea

n5

7iexcl

070

000

175

thqu

anti

le6

71iexcl

064

000

Bra

inca

ncer

145

16

1

238

25

thqu

antil

e1

06iexcl

037

000

4T

reat

men

tou

tcom

em

ean

112

iexcl0

330

006

75th

quan

tile

980

iexcl0

240

00Ly

mph

oma

112

17

24

3

25th

quan

tile

123

iexcl0

410

008

Tre

atm

ent

outc

ome

mea

n9

43iexcl

030

001

75th

quan

tile

872

iexcl0

210

00B

reas

tca

ncer

243

32

3

404

25

thqu

anti

le5

32iexcl

014

001

Tre

atm

ent

outc

ome

mea

n4

85iexcl

007

001

75th

quan

tile

429

iexcl0

010

00

ESTIMATING DATASET SIZE REQUIREMENTS 127

Tabl

e2

Su

mm

ary

of

Res

ult

sfo

rD

atase

tsIn

clu

ded

inth

eSt

ud

y

Max

imum

Size

toac

hiev

eLe

arni

ngtr

aini

ngcl

assi

errsquo

scu

rve

set

size

used

stat

istic

alt

ted

with

int

ting

Dat

aset

sign

ica

nce

sign

ica

ntle

arni

ngD

atas

etp

robl

emsi

ze(p

-val

lt0

05)

data

curv

eC

oncl

usio

ns

Cla

ssi

catio

ner

ror

Lear

ning

curv

eA

ctua

l(e

xtra

pola

ted)

E

rror

Mul

tiple

tissu

esC

ance

rvs

no

rmal

280

15ye

s21

07

08

62

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

10

Eve

nlo

wer

erro

rra

tes

lt0

07ar

eac

hiev

able

wit

hgt

400

sam

ples

L

euke

mia

AM

Lvs

A

LL

7310

yes

350

05

65

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

05

Add

itio

nal

sam

ples

may

low

erer

ror

toraquo

0C

olon

canc

erC

ance

rvs

no

rmal

6210

yes

5016

3

152

1

1D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

aner

ror

rate

ofraquo

016

L

ower

erro

rra

tes

lt0

09)

are

achi

evab

lew

ithgt

400

sam

ples

O

vari

anca

ncer

Can

cer

vs

norm

al54

10ye

s40

56

48

08

Dat

aset

size

isla

rge

enou

ghfo

rex

trap

olat

ion

and

toac

hiev

ea

low

erro

rra

teraquo

005

)A

dditi

onal

sam

ples

may

low

erer

ror

rate

toraquo

0L

ymph

oma

DL

BC

vs

Foll

icul

ar53

5ye

s40

44

70

7D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

035)

A

ddit

iona

lsa

mpl

esm

aylo

wer

erro

rra

teto

raquo0

Bra

inca

ncer

Tre

atm

ent

outc

ome

6045

no40

22

296

7

6D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

14w

ith

gt40

0sa

mpl

es

Lym

phom

aT

reat

men

tou

tcom

e58

50no

4023

29

5

65

Dat

aset

isno

tla

rge

enou

ghfo

rex

trap

olat

ion

Fitt

ing

ofle

arni

ngcu

rve

isin

accu

rate

but

sugg

ests

erro

rra

teco

uld

belt

017

wit

hgt

400

sam

ples

B

reas

tca

ncer

Tre

atm

ent

outc

ome

7865

no70

30

363

6

3D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

32w

ith

gt40

0sa

mpl

es

128 MUKHERJEE ET AL

with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279

Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15

To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is

errorn D 142niexcl052 C 0098

The error rates for the 25th and 75th quantiles are

error25n D 189niexcl063 C 0032

error75n D 117niexcl043 C 000

Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope

FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er

ESTIMATING DATASET SIZE REQUIREMENTS 129

FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes

is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows

error400 D 142400iexcl052 C 0098 D 73

error25400 D 189400iexcl063 C 0032 D 47

error75400 D 117400iexcl043 C 000 D 89

The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)

Leukemia morphology

The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is

errorn D 7706niexcl65 C 009

In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset

130 MUKHERJEE ET AL

FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 7: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

ESTIMATING DATASET SIZE REQUIREMENTS 125

2 Given the above empirical distribution function compute for each en the value tn D P rann en

statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p

2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M

take the sequence of error rates and compute the following quantities for each training set size n gt n0

for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1

PT1iD1 eni

the 25th and 75th quantiles of the vector of error ratescopyen1 enT1

ordf

B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-

mization procedure minregab

PMlD1aniexclreg

l Cbiexclenl2 subject to reg a b cedil 0 designate the values

as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size

Lm en D amniexclregm C bm

ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1

enT1 g

4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS

The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results

The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section

Tumor versus normal tissue

This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)

126 MUKHERJEE ET AL

Tabl

e1

Sum

ma

ryof

Lea

rnin

gC

urv

ePa

ram

eter

sa

nd

Extr

apo

late

dE

rro

rR

ates

Lea

rnin

gcu

rve

para

met

ers

b

areg

asym

ple

arni

ngde

cay

Bay

esD

atas

etp

robl

emra

tera

teer

ror

Pre

dict

eder

ror

rate

at40

0sa

mpl

es

25th

quar

tile

Mea

n75

thqu

arti

le

Mul

tipl

etis

sues

47

73

89

25th

quan

tile

189

iexcl0

630

003

Can

cer

vs

norm

alm

ean

142

iexcl0

520

010

75th

quan

tile

117

iexcl0

430

000

Leu

kem

ia0

62

53

425

thqu

anti

le39

8iexcl

23

000

6A

ML

vs

AL

Lm

ean

771

iexcl0

650

009

75th

quan

tile

884

iexcl0

60

010

Col

onca

ncer

48

90

124

25

thqu

anti

le4

39iexcl

037

000

Can

cer

vs

norm

alm

ean

480

iexcl0

280

0075

thqu

anti

le5

00iexcl

023

000

Ova

rian

canc

er0

31

21

825

thqu

anti

le10

7iexcl

176

000

Can

cer

vs

norm

alm

ean

736

iexcl0

690

0075

thqu

anti

le9

95iexcl

067

000

Lym

phom

a0

09

15

25th

quan

tile

899

iexcl2

480

00D

LB

Cvs

Fo

llicu

lar

mea

n5

7iexcl

070

000

175

thqu

anti

le6

71iexcl

064

000

Bra

inca

ncer

145

16

1

238

25

thqu

antil

e1

06iexcl

037

000

4T

reat

men

tou

tcom

em

ean

112

iexcl0

330

006

75th

quan

tile

980

iexcl0

240

00Ly

mph

oma

112

17

24

3

25th

quan

tile

123

iexcl0

410

008

Tre

atm

ent

outc

ome

mea

n9

43iexcl

030

001

75th

quan

tile

872

iexcl0

210

00B

reas

tca

ncer

243

32

3

404

25

thqu

anti

le5

32iexcl

014

001

Tre

atm

ent

outc

ome

mea

n4

85iexcl

007

001

75th

quan

tile

429

iexcl0

010

00

ESTIMATING DATASET SIZE REQUIREMENTS 127

Tabl

e2

Su

mm

ary

of

Res

ult

sfo

rD

atase

tsIn

clu

ded

inth

eSt

ud

y

Max

imum

Size

toac

hiev

eLe

arni

ngtr

aini

ngcl

assi

errsquo

scu

rve

set

size

used

stat

istic

alt

ted

with

int

ting

Dat

aset

sign

ica

nce

sign

ica

ntle

arni

ngD

atas

etp

robl

emsi

ze(p

-val

lt0

05)

data

curv

eC

oncl

usio

ns

Cla

ssi

catio

ner

ror

Lear

ning

curv

eA

ctua

l(e

xtra

pola

ted)

E

rror

Mul

tiple

tissu

esC

ance

rvs

no

rmal

280

15ye

s21

07

08

62

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

10

Eve

nlo

wer

erro

rra

tes

lt0

07ar

eac

hiev

able

wit

hgt

400

sam

ples

L

euke

mia

AM

Lvs

A

LL

7310

yes

350

05

65

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

05

Add

itio

nal

sam

ples

may

low

erer

ror

toraquo

0C

olon

canc

erC

ance

rvs

no

rmal

6210

yes

5016

3

152

1

1D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

aner

ror

rate

ofraquo

016

L

ower

erro

rra

tes

lt0

09)

are

achi

evab

lew

ithgt

400

sam

ples

O

vari

anca

ncer

Can

cer

vs

norm

al54

10ye

s40

56

48

08

Dat

aset

size

isla

rge

enou

ghfo

rex

trap

olat

ion

and

toac

hiev

ea

low

erro

rra

teraquo

005

)A

dditi

onal

sam

ples

may

low

erer

ror

rate

toraquo

0L

ymph

oma

DL

BC

vs

Foll

icul

ar53

5ye

s40

44

70

7D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

035)

A

ddit

iona

lsa

mpl

esm

aylo

wer

erro

rra

teto

raquo0

Bra

inca

ncer

Tre

atm

ent

outc

ome

6045

no40

22

296

7

6D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

14w

ith

gt40

0sa

mpl

es

Lym

phom

aT

reat

men

tou

tcom

e58

50no

4023

29

5

65

Dat

aset

isno

tla

rge

enou

ghfo

rex

trap

olat

ion

Fitt

ing

ofle

arni

ngcu

rve

isin

accu

rate

but

sugg

ests

erro

rra

teco

uld

belt

017

wit

hgt

400

sam

ples

B

reas

tca

ncer

Tre

atm

ent

outc

ome

7865

no70

30

363

6

3D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

32w

ith

gt40

0sa

mpl

es

128 MUKHERJEE ET AL

with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279

Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15

To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is

errorn D 142niexcl052 C 0098

The error rates for the 25th and 75th quantiles are

error25n D 189niexcl063 C 0032

error75n D 117niexcl043 C 000

Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope

FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er

ESTIMATING DATASET SIZE REQUIREMENTS 129

FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes

is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows

error400 D 142400iexcl052 C 0098 D 73

error25400 D 189400iexcl063 C 0032 D 47

error75400 D 117400iexcl043 C 000 D 89

The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)

Leukemia morphology

The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is

errorn D 7706niexcl65 C 009

In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset

130 MUKHERJEE ET AL

FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 8: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

126 MUKHERJEE ET AL

Tabl

e1

Sum

ma

ryof

Lea

rnin

gC

urv

ePa

ram

eter

sa

nd

Extr

apo

late

dE

rro

rR

ates

Lea

rnin

gcu

rve

para

met

ers

b

areg

asym

ple

arni

ngde

cay

Bay

esD

atas

etp

robl

emra

tera

teer

ror

Pre

dict

eder

ror

rate

at40

0sa

mpl

es

25th

quar

tile

Mea

n75

thqu

arti

le

Mul

tipl

etis

sues

47

73

89

25th

quan

tile

189

iexcl0

630

003

Can

cer

vs

norm

alm

ean

142

iexcl0

520

010

75th

quan

tile

117

iexcl0

430

000

Leu

kem

ia0

62

53

425

thqu

anti

le39

8iexcl

23

000

6A

ML

vs

AL

Lm

ean

771

iexcl0

650

009

75th

quan

tile

884

iexcl0

60

010

Col

onca

ncer

48

90

124

25

thqu

anti

le4

39iexcl

037

000

Can

cer

vs

norm

alm

ean

480

iexcl0

280

0075

thqu

anti

le5

00iexcl

023

000

Ova

rian

canc

er0

31

21

825

thqu

anti

le10

7iexcl

176

000

Can

cer

vs

norm

alm

ean

736

iexcl0

690

0075

thqu

anti

le9

95iexcl

067

000

Lym

phom

a0

09

15

25th

quan

tile

899

iexcl2

480

00D

LB

Cvs

Fo

llicu

lar

mea

n5

7iexcl

070

000

175

thqu

anti

le6

71iexcl

064

000

Bra

inca

ncer

145

16

1

238

25

thqu

antil

e1

06iexcl

037

000

4T

reat

men

tou

tcom

em

ean

112

iexcl0

330

006

75th

quan

tile

980

iexcl0

240

00Ly

mph

oma

112

17

24

3

25th

quan

tile

123

iexcl0

410

008

Tre

atm

ent

outc

ome

mea

n9

43iexcl

030

001

75th

quan

tile

872

iexcl0

210

00B

reas

tca

ncer

243

32

3

404

25

thqu

anti

le5

32iexcl

014

001

Tre

atm

ent

outc

ome

mea

n4

85iexcl

007

001

75th

quan

tile

429

iexcl0

010

00

ESTIMATING DATASET SIZE REQUIREMENTS 127

Tabl

e2

Su

mm

ary

of

Res

ult

sfo

rD

atase

tsIn

clu

ded

inth

eSt

ud

y

Max

imum

Size

toac

hiev

eLe

arni

ngtr

aini

ngcl

assi

errsquo

scu

rve

set

size

used

stat

istic

alt

ted

with

int

ting

Dat

aset

sign

ica

nce

sign

ica

ntle

arni

ngD

atas

etp

robl

emsi

ze(p

-val

lt0

05)

data

curv

eC

oncl

usio

ns

Cla

ssi

catio

ner

ror

Lear

ning

curv

eA

ctua

l(e

xtra

pola

ted)

E

rror

Mul

tiple

tissu

esC

ance

rvs

no

rmal

280

15ye

s21

07

08

62

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

10

Eve

nlo

wer

erro

rra

tes

lt0

07ar

eac

hiev

able

wit

hgt

400

sam

ples

L

euke

mia

AM

Lvs

A

LL

7310

yes

350

05

65

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

05

Add

itio

nal

sam

ples

may

low

erer

ror

toraquo

0C

olon

canc

erC

ance

rvs

no

rmal

6210

yes

5016

3

152

1

1D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

aner

ror

rate

ofraquo

016

L

ower

erro

rra

tes

lt0

09)

are

achi

evab

lew

ithgt

400

sam

ples

O

vari

anca

ncer

Can

cer

vs

norm

al54

10ye

s40

56

48

08

Dat

aset

size

isla

rge

enou

ghfo

rex

trap

olat

ion

and

toac

hiev

ea

low

erro

rra

teraquo

005

)A

dditi

onal

sam

ples

may

low

erer

ror

rate

toraquo

0L

ymph

oma

DL

BC

vs

Foll

icul

ar53

5ye

s40

44

70

7D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

035)

A

ddit

iona

lsa

mpl

esm

aylo

wer

erro

rra

teto

raquo0

Bra

inca

ncer

Tre

atm

ent

outc

ome

6045

no40

22

296

7

6D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

14w

ith

gt40

0sa

mpl

es

Lym

phom

aT

reat

men

tou

tcom

e58

50no

4023

29

5

65

Dat

aset

isno

tla

rge

enou

ghfo

rex

trap

olat

ion

Fitt

ing

ofle

arni

ngcu

rve

isin

accu

rate

but

sugg

ests

erro

rra

teco

uld

belt

017

wit

hgt

400

sam

ples

B

reas

tca

ncer

Tre

atm

ent

outc

ome

7865

no70

30

363

6

3D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

32w

ith

gt40

0sa

mpl

es

128 MUKHERJEE ET AL

with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279

Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15

To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is

errorn D 142niexcl052 C 0098

The error rates for the 25th and 75th quantiles are

error25n D 189niexcl063 C 0032

error75n D 117niexcl043 C 000

Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope

FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er

ESTIMATING DATASET SIZE REQUIREMENTS 129

FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes

is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows

error400 D 142400iexcl052 C 0098 D 73

error25400 D 189400iexcl063 C 0032 D 47

error75400 D 117400iexcl043 C 000 D 89

The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)

Leukemia morphology

The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is

errorn D 7706niexcl65 C 009

In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset

130 MUKHERJEE ET AL

FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 9: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

ESTIMATING DATASET SIZE REQUIREMENTS 127

Tabl

e2

Su

mm

ary

of

Res

ult

sfo

rD

atase

tsIn

clu

ded

inth

eSt

ud

y

Max

imum

Size

toac

hiev

eLe

arni

ngtr

aini

ngcl

assi

errsquo

scu

rve

set

size

used

stat

istic

alt

ted

with

int

ting

Dat

aset

sign

ica

nce

sign

ica

ntle

arni

ngD

atas

etp

robl

emsi

ze(p

-val

lt0

05)

data

curv

eC

oncl

usio

ns

Cla

ssi

catio

ner

ror

Lear

ning

curv

eA

ctua

l(e

xtra

pola

ted)

E

rror

Mul

tiple

tissu

esC

ance

rvs

no

rmal

280

15ye

s21

07

08

62

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

10

Eve

nlo

wer

erro

rra

tes

lt0

07ar

eac

hiev

able

wit

hgt

400

sam

ples

L

euke

mia

AM

Lvs

A

LL

7310

yes

350

05

65

6D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

05

Add

itio

nal

sam

ples

may

low

erer

ror

toraquo

0C

olon

canc

erC

ance

rvs

no

rmal

6210

yes

5016

3

152

1

1D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

aner

ror

rate

ofraquo

016

L

ower

erro

rra

tes

lt0

09)

are

achi

evab

lew

ithgt

400

sam

ples

O

vari

anca

ncer

Can

cer

vs

norm

al54

10ye

s40

56

48

08

Dat

aset

size

isla

rge

enou

ghfo

rex

trap

olat

ion

and

toac

hiev

ea

low

erro

rra

teraquo

005

)A

dditi

onal

sam

ples

may

low

erer

ror

rate

toraquo

0L

ymph

oma

DL

BC

vs

Foll

icul

ar53

5ye

s40

44

70

7D

atas

etsi

zeis

larg

een

ough

for

extr

apol

atio

nan

dto

achi

eve

alo

wer

ror

rate

raquo0

035)

A

ddit

iona

lsa

mpl

esm

aylo

wer

erro

rra

teto

raquo0

Bra

inca

ncer

Tre

atm

ent

outc

ome

6045

no40

22

296

7

6D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

14w

ith

gt40

0sa

mpl

es

Lym

phom

aT

reat

men

tou

tcom

e58

50no

4023

29

5

65

Dat

aset

isno

tla

rge

enou

ghfo

rex

trap

olat

ion

Fitt

ing

ofle

arni

ngcu

rve

isin

accu

rate

but

sugg

ests

erro

rra

teco

uld

belt

017

wit

hgt

400

sam

ples

B

reas

tca

ncer

Tre

atm

ent

outc

ome

7865

no70

30

363

6

3D

atas

etis

not

larg

een

ough

for

extr

apol

atio

nFi

ttin

gof

lear

ning

curv

eis

inac

cura

tebu

tsu

gges

tser

ror

rate

coul

dbe

lt0

32w

ith

gt40

0sa

mpl

es

128 MUKHERJEE ET AL

with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279

Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15

To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is

errorn D 142niexcl052 C 0098

The error rates for the 25th and 75th quantiles are

error25n D 189niexcl063 C 0032

error75n D 117niexcl043 C 000

Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope

FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er

ESTIMATING DATASET SIZE REQUIREMENTS 129

FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes

is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows

error400 D 142400iexcl052 C 0098 D 73

error25400 D 189400iexcl063 C 0032 D 47

error75400 D 117400iexcl043 C 000 D 89

The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)

Leukemia morphology

The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is

errorn D 7706niexcl65 C 009

In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset

130 MUKHERJEE ET AL

FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 10: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

128 MUKHERJEE ET AL

with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279

Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15

To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is

errorn D 142niexcl052 C 0098

The error rates for the 25th and 75th quantiles are

error25n D 189niexcl063 C 0032

error75n D 117niexcl043 C 000

Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope

FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er

ESTIMATING DATASET SIZE REQUIREMENTS 129

FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes

is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows

error400 D 142400iexcl052 C 0098 D 73

error25400 D 189400iexcl063 C 0032 D 47

error75400 D 117400iexcl043 C 000 D 89

The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)

Leukemia morphology

The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is

errorn D 7706niexcl65 C 009

In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset

130 MUKHERJEE ET AL

FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 11: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

ESTIMATING DATASET SIZE REQUIREMENTS 129

FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes

is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows

error400 D 142400iexcl052 C 0098 D 73

error25400 D 189400iexcl063 C 0032 D 47

error75400 D 117400iexcl043 C 000 D 89

The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)

Leukemia morphology

The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is

errorn D 7706niexcl65 C 009

In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset

130 MUKHERJEE ET AL

FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 12: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

130 MUKHERJEE ET AL

FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 13: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

ESTIMATING DATASET SIZE REQUIREMENTS 131

FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates

Colon cancer

The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples

A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is

errorn D 4798niexcl02797

Ovarian cancer

The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is

errorn D 7362niexcl06864

Lymphoma morphology

The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 14: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

132 MUKHERJEE ET AL

FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 15: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

ESTIMATING DATASET SIZE REQUIREMENTS 133

was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is

errorn D 57niexcl07073 C 0006

Brain cancer treatment outcome

The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40

Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology

However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is

errorn D 1115niexcl3295 C 006

FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 16: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

134 MUKHERJEE ET AL

FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

Lymphoma treatment outcome

The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is

errorn D 9431niexcl2957 C 01

Breast cancer treatment outcome

The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is

errorn D 4852niexcl0733 C 01

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 17: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

ESTIMATING DATASET SIZE REQUIREMENTS 135

FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 18: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

136 MUKHERJEE ET AL

5 CONCLUSIONS

We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis

The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future

In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable

FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 19: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

ESTIMATING DATASET SIZE REQUIREMENTS 137

tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering

APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS

1a The mean of the leave-p-out estimator is unbiased

Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows

Lpz1 z` D 1p

X

p

Qzp f z`iexclp

where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased

ELpz1 z` D EQz f z`iexclp

which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp

Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations

EL1z1 z` D 1`

Z X

iD1

Qzi f z`iexcl1dPz1 dPz`

D 1`

Z X

iD1

Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`

D 1`

X

iD1

EQzi f z`iexcl1

D EQz f z`iexcl1

The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases

1b The variance of the leave-p-out estimator is biased and optimistic

Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 20: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

138 MUKHERJEE ET AL

procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic

V Lpz1 z` middot VQz f z`iexclp

which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The variance of training sets of size ` iexcl 1 is

VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c

The variance of the estimator

V1`

L1z1 z` D V

1`

X

iD1

ti

where ti is whether an error is made or not when the ith point is left out

Qzi f x1 xiiexcl1 xiC1 x`

We can rewrite this as

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl

Aacute

E1`

X

iD1

ti

2

From statement 1 we know

E1`

X

iD1

ti D EQz f z`iexcl1

So

V

1`

X

iD1

ti

D E

Aacute1`

X

iD1

ti

2

iexcl [EQz f z`iexcl1]2

One can show that

E

Aacute1`

X

iD1

ti

2

middot E[Qz f z`iexcl12]

One can write

E

Aacute1`

X

iD1

ti

2

D E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A

If the random variables ti tj are identical and independent then the above equation can be rewritten

E

0

1`2

X

iD1

t2i C 1

`2

X

i 6Dj

ti tj

1

A D E

sup3t2

`C

`2 iexcl `

`2t2

acuteD E[Qz f z`iexcl1]2

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 21: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

ESTIMATING DATASET SIZE REQUIREMENTS 139

However t1 t` are not independent and are correlated so

E

0

X

i 6Dj

ti tj

1

A middot E

sup3`2 iexcl `

`2t2

acute

The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller

1c Quantiles of the leave-p-out estimator are unbiased

Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample

Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension

The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1

`L`iexcl1z1 z` is PD`

fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is

P t lt raquo DZ raquo

iexcl1tptdt D

Z 1

iexcl1tmicroraquo iexcl tptdt

The distribution function for the random variable r D 1`

P`iD1 ri can be written as follows by a similar

sequence of transformations as used in the proof of statement 1

P r lt raquo D 1`

X

iD1

Z 1

iexcl1rimicroraquo iexcl ripridri

which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out

procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of

p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us

P fp

` supx

jF x iexcl F`xj lt g frac14 1 iexcl 21X

kD1

iexcl1kiexcl1eiexcl2k22

where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn

the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution

The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 22: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

140 MUKHERJEE ET AL

APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE

2a A motivation from the statistical mechanics approach to learning

In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function

y D signw cent z

where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be

pX

iD1

micro[iexclt cent ziw cent zi]

where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law

It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2

(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error

2b A motivation from an approximation theory point of view

Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 23: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

ESTIMATING DATASET SIZE REQUIREMENTS 141

that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as

dfo fT D Oniexclreg

where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions

For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate

ACKNOWLEDGEMENTS

We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship

REFERENCES

Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu

X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511

Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M

Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795

Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159

Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In

Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New

YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters

82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and

Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing

JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537

Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu

Page 24: Estimating Dataset Size Requirements for Classifying DNA ...enicgc/pubs/2003/datasetsize.pdfclassi” cation approaches based on machine learning algorithms applied to DNA microarray

142 MUKHERJEE ET AL

Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York

Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64

Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA

Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research

77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and

standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition

Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample

complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma

adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning

a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer

Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics

23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black

P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442

Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154

Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan

B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385

Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74

Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416

Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York

Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557

Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov

JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322

Address correspondence toSayan Mukherjee

Room 201 Building E-2545 Carleton Street

Cambridge MA 02139

E-mail sayangenomewimitedu