Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
JOURNAL OF COMPUTATIONAL BIOLOGYVolume 10 Number 2 2003copy Mary Ann Liebert IncPp 119ndash142
Estimating Dataset Size Requirements forClassifying DNA Microarray Data
SAYAN MUKHERJEE13 PABLO TAMAYO1 SIMON ROGERS4
RYAN RIFKIN13 ANNA ENGLE3 COLIN CAMPBELL4
TODD R GOLUB12 and JILL P MESIROV1
ABSTRACT
A statistical methodology for estimating dataset size requirements for classifying microarraydata using learning curves is introduced The goal is to use existing classi cation results toestimate dataset size requirements for future classi cation experiments and to evaluate thegain in accuracy and signi cance of classi ers built with additional data The method isbased on tting inverse power-law models to construct empirical learning curves It alsoincludes a permutation test procedure to assess the statistical signi cance of classi cationperformance for a given dataset size This procedure is applied to several molecular classi- cation problems representing a broad spectrum of levels of complexity
Key words gene expression pro ling molecular pattern recognition DNA microarrays micro-array analysis sample size estimation
1 INTRODUCTION
Over the last few years the routine use of DNA microarrays has made possible the creationof large datasets of molecular information characterizing complex biological systems Molecular
classi cation approaches based on machine learning algorithms applied to DNA microarray data havebeen shown to have statistical and clinical relevance for a variety of tumor types leukemia (Golub et al1999) lymphoma (Shipp et al 2001) brain cancer (Pomeroy et al 2002) lung cancer (Bhattacharjeeet al 2001) and the classi cation of multiple primary tumors (Ramaswamy et al 2001 2002 Yeanget al 2001) In this context after having obtained initial or preliminary classi cation results for a givenbiological system one is often left pondering the possibility of embarking on a larger and more systematicstudy using additional samples This is usually the case when one tries to improve the accuracy of theoriginal classi er or to provide a more rigorous statistical validation of the existing prediction results Asthe process of obtaining additional biological samples is often expensive involved and time consuming
1Whitehead InstituteMassachusetts Institute of Technology Center for Genome Research Cambridge MA02139
2Department of Pediatric Oncology Dana-Farber Cancer Institute Boston MA 021153McGovern Institute and CBCL Massachusetts Institute of Technology Cambridge MA 021394Department of Engineering Mathematics University of Bristol UK
119
120 MUKHERJEE ET AL
it is desirable to be able to estimate the performance of a classi er for yet unseen larger dataset sizes Inthis situation one has to address two sets of questions
1 For a given number of samples how signi cant is the performance of a classi er ie are the resultsbetter than what one would expect by chance
2 If we know the answers to (1) for a range of dataset sizes can we predict the performance of the classi erwhen trained with additional samples Will the accuracy of the classi er improve signi cantly Is theeffort to collect additional samples worthwhile
These two questions arise in other classi cation tasks with high dimensional data and few samples such asclassifying functional MRI images of patients with neural dysfunction (Golland et al 2002) In this paperwe develop a methodology for assessing the signi cance of a classi errsquos performance via a permutationtest We then t an inverse power law model to construct a learning curve with error rates estimated froman existing dataset and use this learning curve to extrapolate error statistics for larger datasets Powercalculations (Adcock 1997) are a standard approach to estimate the number of data samples requiredHowever these approaches do not address our data set size estimation problem for two reasons First theassumptions that the underlying data comes from a Gaussian distribution and that variables are independentdo not hold Second the question addressed by power calculations is given a particular data set size howcon dent are we of our empirical error estimate This is very different from asking how the error ratemight decrease given more data
A nontrivialclassi er changes its structure as more trainingdata becomeavailableand therefore determininghow the error rate might decrease becomes a problem of function extrapolation rather than convergence esti-mation In this regard it is important not to confuse this problem with the more standard problem of estimatingthe condence of an error estimate as a functionof training set size ie estimating the variance in an observedquantity the error estimate as a function of the number of measurements In general this latter problem isaddressed using power calculations or deviation bounds (Adcock 1997 Guyon et al 1998) These methodscomputeboundsor estimatesof a given quantityrsquos deviationfrom its expectedvalueas a functionof the numberof observations or in this case samples Other methods study the variation produced by technical factors thatcan be addressed by experimental design or by replicating sample preparation or array hybridization (Chengand Wong2001Tseng et al 2001Kerr and Churchill 2001ab)There are also methods to model differentialexpression across experiments (Lee and Whitmore 2002) that assess the effect of replication and sample sizein increasing the statisticalpower of ANOVA models In the contextof our problem these approaches can onlyhelp to nd boundson the deviationbetween the misclassi cation error rate and its expectedvalueas a functionof the number of measurements ie the realizations of the classi er for a given xed classi cation datasetsize These standard error estimation methods are therefore not particularly useful in estimating the futureperformance of a classi er as a function of increasing dataset size with yet unseen additionaldata We test ourmethodology on eight data sets which represent a range of dif culty or complexity of classi cation In somecases the distinction is quite dramatic while in others it is more subtle The examples are drawn from existingcancer classi cation data sets where discriminatingthe morphologyof a sample ( ve sets) represents the ldquoeas-ierrdquo end of the range and predicting treatment outcome (three sets) lies at the other extreme This hierarchyof dif culty will be re ected by the increase in the data set size requirements we estimate for these predictionproblemsOur results give an indicationof the minimal number of samples that are needed to obtain signi cantperformancedataand to extrapolatethe improvementonemightgetby buildingtheclassi erona largerdataset
In the next section we will give some background on general approaches addressing the problem ofestimating classi er performance and learning rates In Section 3 we describe our methodology in moredetail The results of applying our methodology to molecular classi cation problems are contained inSection 4 Section 5 summarizes the results of our tests The proofs and technical details have beencollected in the appendices
2 BACKGROUND AND GENERAL APPROACH
The problem of estimating performance of a classi er for larger yet unseen sets of examples is a dif cultanalytical problem It amounts to developing a model to compute how fast a given classi er ldquolearnsrdquo orimproves its ldquo ttingrdquo to the data as a function of dataset size
ESTIMATING DATASET SIZE REQUIREMENTS 121
In machine learning a natural way to study classi cation accuracy as a function of training set size isby building empirical scaling models called learning curves (Cortes et al 1994) Learning curves estimatethe empirical error rate as a function of training set size for a given classi er and dataset The advantageof this approach is that one avoids making assumptions about the distribution generating the dataset orthe distribution of the classi cation errors These learning curves are usually well characterized by inversepower-laws
en D aniexclreg C b (1)
The variables are the expected error rate en given n training samples the learning rate a the decayrate reg and the Bayes error b which is the minimum error rate achievable (Devroye et al 1997 Dudaet al 2000) Notice that the value of the constants a b and reg will change according to the classi er anddataset being studied Based on this scaling model as the size of the dataset increases the misclassi cationerror of a classi er will asymptotically approach b This inverse-power-law ldquolearningrdquo behavior appearsto be universal and is observed for many classi ers and types of datasets (Cortes et al 1994 Shrageret al 1988) It is in fact observed not only in machine learning but also in human and animal learning(Anderson et al 1983) It is common to nd empirical reg values around or less than 1 Besides itsempirical prevalence the power-law model can be motivated analytically and in some cases derived withinthe statistical mechanics approach to learning The basic idea behind this approach is to formulate theaverage error as a set of equations which are then solved via a statistical mechanics replica approach(Hertz et al 1991) involving integration over the parameters of the classi er This approach has beenapplied to various classi cation algorithms such as Support Vector Machines (Dietrich et al 2000) LargeMargin Perceptrons (Opper et al 1995) and Adaline and other classi ers based upon Hebbian rules (Opperet al 1990) The resulting analysis of the classi cation errors for all of the above algorithms results ininverse power laws of the form (1)
Using this power-law scaling model as a basis one can use the empirical error rates of a classi er overa range of training set sizes drawn from a dataset to t an inverse-power-law model and then use thismodel to extrapolate the error rate to larger datasets In order to make this a practical approach one alsoneeds a statistical test for classi er signi cance as a function of training set size The reason for this isthat the inverse-power-law model usually breaks down for small training set sizes where the model lacksenough data to give accurate predictions In this case the error rates are large and not signi cant If agiven classi errsquos results are not signi cant then it is better to exclude them when tting the learningcurve To directly address this problem we have included a permutation test for the signi cance of aclassi er as part of our methodology This test compares the performance of the actual classi er with theperformance of random classi ers trained to predict data whose target labels are permuted (randomized)A classi er that is able to ldquo ndrdquo structure in the data and produce signi cant results should outperform itsrandom counterparts most of the time By xing a signi cance level (005) we can produce an effectivetest to eliminate classi ers that are not signi cant from the tting of the learning curve Since the classi erperformance usually improves with increasing training set size this signi cance test also allows us to ndthe minimum number of samples that produced signi cant classi ers
3 METHODOLOGY FOR ESTIMATING ERROR RATES AS AFUNCTION OF DATASET SIZE
Given an arbitrary input dataset and classi cation algorithm the methodology we will describe providesthe following
1) A measure of the statistical signi cance of the classi er built at each training set size Using this onecan nd the minimum training set size for which the classi cation performance of the classi ers isstatistically signi cant
2) An analytic expression (power law) of the error rate as a function of the increasing dataset size as wellas similar expressions for the 25th and 75th error rate quantiles These provide a means to extrapolatethe error bar ldquoenveloperdquo for the error rate for larger yet unseen data sets
122 MUKHERJEE ET AL
As described in Section 2 a signi cance test is needed to know at which training set size error ratesare reliable enough to accurately extrapolate the error rate as a function of dataset size The 25th and 75thquantiles are used to compute the analog of error bars for the estimated error rates as a function of datasetsize Figure 1 shows a pictorial summary of the method The procedure can be broken down into two maincomputational tasks the rst involving random subsampling (traintest) and a signi cance permutation testto evaluate the classi ers and the second consisting of tting learning curves to the error rates that passedthe signi cance test
We rst describe the rationale behind tting the learning curves We want to t the inverse power curvesto the true average error and the true 25th and 75th quantile error of a classi er trained with varioustraining set sizes Our rst step is to estimate the true average error rate and 25th and 75th error quantilesfor a range of training set sizes For a dataset of size ` with xed training set size n and test set size ` iexcl nT1 traintest realizations were constructed by subsampling the dataset For each of these T1 realizations anerror rate eni is computed and the average of the sequence
copyen1 enT1
ordf en D 1
T1
PT1iD1 eni is used
as an estimate of the true average error en The average error rate is an unbiased estimator of the true errorrate of a classi er trained with n samples
ED`
1T1
T1X
iD1
eni D EDnen
where EDnen is the probability of classifying a new sample incorrectly when the classi er was trained
with n samples (see Appendix 1a for proof and details) The 25th and 75th percentile of the sequencecopyen1 enT1
ordfwere also tted to an inverse power law so that we could also estimate the variation in
error rates as a function of dataset size The 25th and 75th percentile of the sequencecopyen1 enT1
ordfare
good approximations of the 25th and 75th quantiles of error rates of classi ers trained with n samples (seeAppendix 1c for proof and details) We did not use the variance of the error rates eni because this statisticis not an unbiased estimator of the variance of the error rate of classi ers trained with n samples andtested on a new sample Indeed one can prove that the variance of this statistic is in general optimisticthe variance of the error rates eni is less than the variance of classi ers trained with n samples and testedon a new sample (see Appendix 1b for proof and details)
As described in the introduction theoretical justi cations for the use of inverse power laws can be madeusing analyses of classi cation accuracy based upon techniques from statistical mechanics (Opper et al1990 1995) and approximation theory (Niyogi et al 1996) as described in more detail in Appendix 2
FIG 1 The statistical signi cance for the ctitious dataset example with (a) 15 samples and (b) 45 samples Theline is the empirical distribution function for the random classi ers and the gray circle is the average error rate for theclassi er with randomization of labels
ESTIMATING DATASET SIZE REQUIREMENTS 123
Fitting the parameters of the learning curve by minimizing
minregab
MX
lD1
aniexclregl C b iexcl enl
2 subject to reg a b cedil 0
is a convex optimization problem when b is xed For a xed b one can estimate reg and a by takinglogarithms and solving the following equivalent linear minimization problem
minregab
MX
lD1
lna iexcl regnl C lnb iexcl enl2 subject to reg a b cedil 0
Solving this linear problem for various values of b followed by line search gives us our estimate of reg aand b
As described in Section 2 the tted learning curve does not extrapolate accurately when error rates arelarge and not statistically signi cant This motivates a procedure to determine at what training set size theerror rate is statistically signi cant when compared to the null hypothesis of the error rate of a randomclassi er
H0 py D 1jx fx1 y1 xn yng D py D iexcl1jx fx1 y1 xn yngI
the null hypothesis states that given a particular training set the conditional probability of a label being 1 oriexcl1 is equal A random classi er is built from the same input data with the class labels of the data randomlypermuted This addresses the question of how well the classi er can learn the mapping f x y whenthe y values are random y D f1 iexcl1g In essence we ask how well a classi er trained on randomlylabeled data can classify correctly labeled data The permutation procedure outlined above helps to answerthis question For each traintest realization for which an error rate eni was computed we construct T2
randomized realizations where the labels of the training set are randomly permuted We build classi erson these randomized training sets and test on the corresponding test set This results in a set of error ratesenij for training set size n From these error rates we construct an empirical distribution function for therandom classi er
P rann x D 1
T1 pound T2
T1X
iD1
T2X
jD1
microx iexcl enij
where microz D 1 if z cedil 0 and 0 otherwise The signi cance of the classi er is P rann en which is the
percentage of random classi ers with error rate smaller than en The procedure is illustrated with thefollowing two examples for a ctitious dataset
Example 1 e15 D 37 (Error rate of the classi er with 15 samples) e15ij D f215 260 290 320
366 388 395 408 420 495g (Error rates of the random classi ers with 15 samples)P ran
15 e15 D 50 Since the p-value is greater than 05 the error rate of the classi er isnot statistically signi cant (see Fig 2a)
Example 2 e45 D 1 (Error rate of the classi er with 45 samples) e45ij D f205 270 333 337
370 392 399 406 425 499g (Error rates of the random classi ers with 45 samples)P ran
45 e45 D 0 Since the p-value is less than 05 the error rate of the classi er is statis-tically signi cant (see Fig 2b)
The detailed description of the entire methodology for a two-class problem is as follows
1) Subsampling and signi cance permutation testA Subsampling procedure
i Given `c1 samples from class 1 and `c2 samples from class 2 the total number of samples is` D `c1 C `c2 where 1 cedil 10
ii Select 10 training set sizes n1 nj n10 over the interval [10 ` iexcl 10]
124 MUKHERJEE ET AL
FIG 2 Dataset size estimation statistical methodology
1 For each training set size nj run the following subsampling procedure T1 D 50 times indexedby i D 1 T1a Randomly split the dataset into a training set with nj samples and a test set with ` iexcl nj
samples subject to the requirement thatnc2nc1
frac14 `c2`c1
where nc2 and nc1 are the number of
samples from class 1 and class 2 in the training set Call the two datasets generated Sni b Train a classi er on each of the training sets and measure its error rate on its corresponding
test set call each of these error rates eni B Permutation test
i For each subsampled traintest split Sni run the following permutation procedure T2 gt 50times indexed by j D 1 T21 Randomly permute the labels of the samples in the training set (leave the test set alone) call
the dataset generated Srannij
2 Train a classi er on the training set and measure its error on the test set call this error rate erannij
C Signi cance calculation1 For each training set size n construct an empirical distribution function from the error rates of
the permuted datasets P rann x D 1
T1poundT2
PT1iD1
PT2jD1 microx iexcl eran
nij where microz D 1 if z cedil 0 and0 otherwise
ESTIMATING DATASET SIZE REQUIREMENTS 125
2 Given the above empirical distribution function compute for each en the value tn D P rann en
statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p
2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M
take the sequence of error rates and compute the following quantities for each training set size n gt n0
for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1
PT1iD1 eni
the 25th and 75th quantiles of the vector of error ratescopyen1 enT1
ordf
B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-
mization procedure minregab
PMlD1aniexclreg
l Cbiexclenl2 subject to reg a b cedil 0 designate the values
as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size
Lm en D amniexclregm C bm
ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1
enT1 g
4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS
The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results
The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section
Tumor versus normal tissue
This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)
126 MUKHERJEE ET AL
Tabl
e1
Sum
ma
ryof
Lea
rnin
gC
urv
ePa
ram
eter
sa
nd
Extr
apo
late
dE
rro
rR
ates
Lea
rnin
gcu
rve
para
met
ers
b
areg
asym
ple
arni
ngde
cay
Bay
esD
atas
etp
robl
emra
tera
teer
ror
Pre
dict
eder
ror
rate
at40
0sa
mpl
es
25th
quar
tile
Mea
n75
thqu
arti
le
Mul
tipl
etis
sues
47
73
89
25th
quan
tile
189
iexcl0
630
003
Can
cer
vs
norm
alm
ean
142
iexcl0
520
010
75th
quan
tile
117
iexcl0
430
000
Leu
kem
ia0
62
53
425
thqu
anti
le39
8iexcl
23
000
6A
ML
vs
AL
Lm
ean
771
iexcl0
650
009
75th
quan
tile
884
iexcl0
60
010
Col
onca
ncer
48
90
124
25
thqu
anti
le4
39iexcl
037
000
Can
cer
vs
norm
alm
ean
480
iexcl0
280
0075
thqu
anti
le5
00iexcl
023
000
Ova
rian
canc
er0
31
21
825
thqu
anti
le10
7iexcl
176
000
Can
cer
vs
norm
alm
ean
736
iexcl0
690
0075
thqu
anti
le9
95iexcl
067
000
Lym
phom
a0
09
15
25th
quan
tile
899
iexcl2
480
00D
LB
Cvs
Fo
llicu
lar
mea
n5
7iexcl
070
000
175
thqu
anti
le6
71iexcl
064
000
Bra
inca
ncer
145
16
1
238
25
thqu
antil
e1
06iexcl
037
000
4T
reat
men
tou
tcom
em
ean
112
iexcl0
330
006
75th
quan
tile
980
iexcl0
240
00Ly
mph
oma
112
17
24
3
25th
quan
tile
123
iexcl0
410
008
Tre
atm
ent
outc
ome
mea
n9
43iexcl
030
001
75th
quan
tile
872
iexcl0
210
00B
reas
tca
ncer
243
32
3
404
25
thqu
anti
le5
32iexcl
014
001
Tre
atm
ent
outc
ome
mea
n4
85iexcl
007
001
75th
quan
tile
429
iexcl0
010
00
ESTIMATING DATASET SIZE REQUIREMENTS 127
Tabl
e2
Su
mm
ary
of
Res
ult
sfo
rD
atase
tsIn
clu
ded
inth
eSt
ud
y
Max
imum
Size
toac
hiev
eLe
arni
ngtr
aini
ngcl
assi
errsquo
scu
rve
set
size
used
stat
istic
alt
ted
with
int
ting
Dat
aset
sign
ica
nce
sign
ica
ntle
arni
ngD
atas
etp
robl
emsi
ze(p
-val
lt0
05)
data
curv
eC
oncl
usio
ns
Cla
ssi
catio
ner
ror
Lear
ning
curv
eA
ctua
l(e
xtra
pola
ted)
E
rror
Mul
tiple
tissu
esC
ance
rvs
no
rmal
280
15ye
s21
07
08
62
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
10
Eve
nlo
wer
erro
rra
tes
lt0
07ar
eac
hiev
able
wit
hgt
400
sam
ples
L
euke
mia
AM
Lvs
A
LL
7310
yes
350
05
65
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
05
Add
itio
nal
sam
ples
may
low
erer
ror
toraquo
0C
olon
canc
erC
ance
rvs
no
rmal
6210
yes
5016
3
152
1
1D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
aner
ror
rate
ofraquo
016
L
ower
erro
rra
tes
lt0
09)
are
achi
evab
lew
ithgt
400
sam
ples
O
vari
anca
ncer
Can
cer
vs
norm
al54
10ye
s40
56
48
08
Dat
aset
size
isla
rge
enou
ghfo
rex
trap
olat
ion
and
toac
hiev
ea
low
erro
rra
teraquo
005
)A
dditi
onal
sam
ples
may
low
erer
ror
rate
toraquo
0L
ymph
oma
DL
BC
vs
Foll
icul
ar53
5ye
s40
44
70
7D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
035)
A
ddit
iona
lsa
mpl
esm
aylo
wer
erro
rra
teto
raquo0
Bra
inca
ncer
Tre
atm
ent
outc
ome
6045
no40
22
296
7
6D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
14w
ith
gt40
0sa
mpl
es
Lym
phom
aT
reat
men
tou
tcom
e58
50no
4023
29
5
65
Dat
aset
isno
tla
rge
enou
ghfo
rex
trap
olat
ion
Fitt
ing
ofle
arni
ngcu
rve
isin
accu
rate
but
sugg
ests
erro
rra
teco
uld
belt
017
wit
hgt
400
sam
ples
B
reas
tca
ncer
Tre
atm
ent
outc
ome
7865
no70
30
363
6
3D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
32w
ith
gt40
0sa
mpl
es
128 MUKHERJEE ET AL
with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279
Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15
To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is
errorn D 142niexcl052 C 0098
The error rates for the 25th and 75th quantiles are
error25n D 189niexcl063 C 0032
error75n D 117niexcl043 C 000
Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope
FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er
ESTIMATING DATASET SIZE REQUIREMENTS 129
FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes
is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows
error400 D 142400iexcl052 C 0098 D 73
error25400 D 189400iexcl063 C 0032 D 47
error75400 D 117400iexcl043 C 000 D 89
The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)
Leukemia morphology
The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is
errorn D 7706niexcl65 C 009
In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset
130 MUKHERJEE ET AL
FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
120 MUKHERJEE ET AL
it is desirable to be able to estimate the performance of a classi er for yet unseen larger dataset sizes Inthis situation one has to address two sets of questions
1 For a given number of samples how signi cant is the performance of a classi er ie are the resultsbetter than what one would expect by chance
2 If we know the answers to (1) for a range of dataset sizes can we predict the performance of the classi erwhen trained with additional samples Will the accuracy of the classi er improve signi cantly Is theeffort to collect additional samples worthwhile
These two questions arise in other classi cation tasks with high dimensional data and few samples such asclassifying functional MRI images of patients with neural dysfunction (Golland et al 2002) In this paperwe develop a methodology for assessing the signi cance of a classi errsquos performance via a permutationtest We then t an inverse power law model to construct a learning curve with error rates estimated froman existing dataset and use this learning curve to extrapolate error statistics for larger datasets Powercalculations (Adcock 1997) are a standard approach to estimate the number of data samples requiredHowever these approaches do not address our data set size estimation problem for two reasons First theassumptions that the underlying data comes from a Gaussian distribution and that variables are independentdo not hold Second the question addressed by power calculations is given a particular data set size howcon dent are we of our empirical error estimate This is very different from asking how the error ratemight decrease given more data
A nontrivialclassi er changes its structure as more trainingdata becomeavailableand therefore determininghow the error rate might decrease becomes a problem of function extrapolation rather than convergence esti-mation In this regard it is important not to confuse this problem with the more standard problem of estimatingthe condence of an error estimate as a functionof training set size ie estimating the variance in an observedquantity the error estimate as a function of the number of measurements In general this latter problem isaddressed using power calculations or deviation bounds (Adcock 1997 Guyon et al 1998) These methodscomputeboundsor estimatesof a given quantityrsquos deviationfrom its expectedvalueas a functionof the numberof observations or in this case samples Other methods study the variation produced by technical factors thatcan be addressed by experimental design or by replicating sample preparation or array hybridization (Chengand Wong2001Tseng et al 2001Kerr and Churchill 2001ab)There are also methods to model differentialexpression across experiments (Lee and Whitmore 2002) that assess the effect of replication and sample sizein increasing the statisticalpower of ANOVA models In the contextof our problem these approaches can onlyhelp to nd boundson the deviationbetween the misclassi cation error rate and its expectedvalueas a functionof the number of measurements ie the realizations of the classi er for a given xed classi cation datasetsize These standard error estimation methods are therefore not particularly useful in estimating the futureperformance of a classi er as a function of increasing dataset size with yet unseen additionaldata We test ourmethodology on eight data sets which represent a range of dif culty or complexity of classi cation In somecases the distinction is quite dramatic while in others it is more subtle The examples are drawn from existingcancer classi cation data sets where discriminatingthe morphologyof a sample ( ve sets) represents the ldquoeas-ierrdquo end of the range and predicting treatment outcome (three sets) lies at the other extreme This hierarchyof dif culty will be re ected by the increase in the data set size requirements we estimate for these predictionproblemsOur results give an indicationof the minimal number of samples that are needed to obtain signi cantperformancedataand to extrapolatethe improvementonemightgetby buildingtheclassi erona largerdataset
In the next section we will give some background on general approaches addressing the problem ofestimating classi er performance and learning rates In Section 3 we describe our methodology in moredetail The results of applying our methodology to molecular classi cation problems are contained inSection 4 Section 5 summarizes the results of our tests The proofs and technical details have beencollected in the appendices
2 BACKGROUND AND GENERAL APPROACH
The problem of estimating performance of a classi er for larger yet unseen sets of examples is a dif cultanalytical problem It amounts to developing a model to compute how fast a given classi er ldquolearnsrdquo orimproves its ldquo ttingrdquo to the data as a function of dataset size
ESTIMATING DATASET SIZE REQUIREMENTS 121
In machine learning a natural way to study classi cation accuracy as a function of training set size isby building empirical scaling models called learning curves (Cortes et al 1994) Learning curves estimatethe empirical error rate as a function of training set size for a given classi er and dataset The advantageof this approach is that one avoids making assumptions about the distribution generating the dataset orthe distribution of the classi cation errors These learning curves are usually well characterized by inversepower-laws
en D aniexclreg C b (1)
The variables are the expected error rate en given n training samples the learning rate a the decayrate reg and the Bayes error b which is the minimum error rate achievable (Devroye et al 1997 Dudaet al 2000) Notice that the value of the constants a b and reg will change according to the classi er anddataset being studied Based on this scaling model as the size of the dataset increases the misclassi cationerror of a classi er will asymptotically approach b This inverse-power-law ldquolearningrdquo behavior appearsto be universal and is observed for many classi ers and types of datasets (Cortes et al 1994 Shrageret al 1988) It is in fact observed not only in machine learning but also in human and animal learning(Anderson et al 1983) It is common to nd empirical reg values around or less than 1 Besides itsempirical prevalence the power-law model can be motivated analytically and in some cases derived withinthe statistical mechanics approach to learning The basic idea behind this approach is to formulate theaverage error as a set of equations which are then solved via a statistical mechanics replica approach(Hertz et al 1991) involving integration over the parameters of the classi er This approach has beenapplied to various classi cation algorithms such as Support Vector Machines (Dietrich et al 2000) LargeMargin Perceptrons (Opper et al 1995) and Adaline and other classi ers based upon Hebbian rules (Opperet al 1990) The resulting analysis of the classi cation errors for all of the above algorithms results ininverse power laws of the form (1)
Using this power-law scaling model as a basis one can use the empirical error rates of a classi er overa range of training set sizes drawn from a dataset to t an inverse-power-law model and then use thismodel to extrapolate the error rate to larger datasets In order to make this a practical approach one alsoneeds a statistical test for classi er signi cance as a function of training set size The reason for this isthat the inverse-power-law model usually breaks down for small training set sizes where the model lacksenough data to give accurate predictions In this case the error rates are large and not signi cant If agiven classi errsquos results are not signi cant then it is better to exclude them when tting the learningcurve To directly address this problem we have included a permutation test for the signi cance of aclassi er as part of our methodology This test compares the performance of the actual classi er with theperformance of random classi ers trained to predict data whose target labels are permuted (randomized)A classi er that is able to ldquo ndrdquo structure in the data and produce signi cant results should outperform itsrandom counterparts most of the time By xing a signi cance level (005) we can produce an effectivetest to eliminate classi ers that are not signi cant from the tting of the learning curve Since the classi erperformance usually improves with increasing training set size this signi cance test also allows us to ndthe minimum number of samples that produced signi cant classi ers
3 METHODOLOGY FOR ESTIMATING ERROR RATES AS AFUNCTION OF DATASET SIZE
Given an arbitrary input dataset and classi cation algorithm the methodology we will describe providesthe following
1) A measure of the statistical signi cance of the classi er built at each training set size Using this onecan nd the minimum training set size for which the classi cation performance of the classi ers isstatistically signi cant
2) An analytic expression (power law) of the error rate as a function of the increasing dataset size as wellas similar expressions for the 25th and 75th error rate quantiles These provide a means to extrapolatethe error bar ldquoenveloperdquo for the error rate for larger yet unseen data sets
122 MUKHERJEE ET AL
As described in Section 2 a signi cance test is needed to know at which training set size error ratesare reliable enough to accurately extrapolate the error rate as a function of dataset size The 25th and 75thquantiles are used to compute the analog of error bars for the estimated error rates as a function of datasetsize Figure 1 shows a pictorial summary of the method The procedure can be broken down into two maincomputational tasks the rst involving random subsampling (traintest) and a signi cance permutation testto evaluate the classi ers and the second consisting of tting learning curves to the error rates that passedthe signi cance test
We rst describe the rationale behind tting the learning curves We want to t the inverse power curvesto the true average error and the true 25th and 75th quantile error of a classi er trained with varioustraining set sizes Our rst step is to estimate the true average error rate and 25th and 75th error quantilesfor a range of training set sizes For a dataset of size ` with xed training set size n and test set size ` iexcl nT1 traintest realizations were constructed by subsampling the dataset For each of these T1 realizations anerror rate eni is computed and the average of the sequence
copyen1 enT1
ordf en D 1
T1
PT1iD1 eni is used
as an estimate of the true average error en The average error rate is an unbiased estimator of the true errorrate of a classi er trained with n samples
ED`
1T1
T1X
iD1
eni D EDnen
where EDnen is the probability of classifying a new sample incorrectly when the classi er was trained
with n samples (see Appendix 1a for proof and details) The 25th and 75th percentile of the sequencecopyen1 enT1
ordfwere also tted to an inverse power law so that we could also estimate the variation in
error rates as a function of dataset size The 25th and 75th percentile of the sequencecopyen1 enT1
ordfare
good approximations of the 25th and 75th quantiles of error rates of classi ers trained with n samples (seeAppendix 1c for proof and details) We did not use the variance of the error rates eni because this statisticis not an unbiased estimator of the variance of the error rate of classi ers trained with n samples andtested on a new sample Indeed one can prove that the variance of this statistic is in general optimisticthe variance of the error rates eni is less than the variance of classi ers trained with n samples and testedon a new sample (see Appendix 1b for proof and details)
As described in the introduction theoretical justi cations for the use of inverse power laws can be madeusing analyses of classi cation accuracy based upon techniques from statistical mechanics (Opper et al1990 1995) and approximation theory (Niyogi et al 1996) as described in more detail in Appendix 2
FIG 1 The statistical signi cance for the ctitious dataset example with (a) 15 samples and (b) 45 samples Theline is the empirical distribution function for the random classi ers and the gray circle is the average error rate for theclassi er with randomization of labels
ESTIMATING DATASET SIZE REQUIREMENTS 123
Fitting the parameters of the learning curve by minimizing
minregab
MX
lD1
aniexclregl C b iexcl enl
2 subject to reg a b cedil 0
is a convex optimization problem when b is xed For a xed b one can estimate reg and a by takinglogarithms and solving the following equivalent linear minimization problem
minregab
MX
lD1
lna iexcl regnl C lnb iexcl enl2 subject to reg a b cedil 0
Solving this linear problem for various values of b followed by line search gives us our estimate of reg aand b
As described in Section 2 the tted learning curve does not extrapolate accurately when error rates arelarge and not statistically signi cant This motivates a procedure to determine at what training set size theerror rate is statistically signi cant when compared to the null hypothesis of the error rate of a randomclassi er
H0 py D 1jx fx1 y1 xn yng D py D iexcl1jx fx1 y1 xn yngI
the null hypothesis states that given a particular training set the conditional probability of a label being 1 oriexcl1 is equal A random classi er is built from the same input data with the class labels of the data randomlypermuted This addresses the question of how well the classi er can learn the mapping f x y whenthe y values are random y D f1 iexcl1g In essence we ask how well a classi er trained on randomlylabeled data can classify correctly labeled data The permutation procedure outlined above helps to answerthis question For each traintest realization for which an error rate eni was computed we construct T2
randomized realizations where the labels of the training set are randomly permuted We build classi erson these randomized training sets and test on the corresponding test set This results in a set of error ratesenij for training set size n From these error rates we construct an empirical distribution function for therandom classi er
P rann x D 1
T1 pound T2
T1X
iD1
T2X
jD1
microx iexcl enij
where microz D 1 if z cedil 0 and 0 otherwise The signi cance of the classi er is P rann en which is the
percentage of random classi ers with error rate smaller than en The procedure is illustrated with thefollowing two examples for a ctitious dataset
Example 1 e15 D 37 (Error rate of the classi er with 15 samples) e15ij D f215 260 290 320
366 388 395 408 420 495g (Error rates of the random classi ers with 15 samples)P ran
15 e15 D 50 Since the p-value is greater than 05 the error rate of the classi er isnot statistically signi cant (see Fig 2a)
Example 2 e45 D 1 (Error rate of the classi er with 45 samples) e45ij D f205 270 333 337
370 392 399 406 425 499g (Error rates of the random classi ers with 45 samples)P ran
45 e45 D 0 Since the p-value is less than 05 the error rate of the classi er is statis-tically signi cant (see Fig 2b)
The detailed description of the entire methodology for a two-class problem is as follows
1) Subsampling and signi cance permutation testA Subsampling procedure
i Given `c1 samples from class 1 and `c2 samples from class 2 the total number of samples is` D `c1 C `c2 where 1 cedil 10
ii Select 10 training set sizes n1 nj n10 over the interval [10 ` iexcl 10]
124 MUKHERJEE ET AL
FIG 2 Dataset size estimation statistical methodology
1 For each training set size nj run the following subsampling procedure T1 D 50 times indexedby i D 1 T1a Randomly split the dataset into a training set with nj samples and a test set with ` iexcl nj
samples subject to the requirement thatnc2nc1
frac14 `c2`c1
where nc2 and nc1 are the number of
samples from class 1 and class 2 in the training set Call the two datasets generated Sni b Train a classi er on each of the training sets and measure its error rate on its corresponding
test set call each of these error rates eni B Permutation test
i For each subsampled traintest split Sni run the following permutation procedure T2 gt 50times indexed by j D 1 T21 Randomly permute the labels of the samples in the training set (leave the test set alone) call
the dataset generated Srannij
2 Train a classi er on the training set and measure its error on the test set call this error rate erannij
C Signi cance calculation1 For each training set size n construct an empirical distribution function from the error rates of
the permuted datasets P rann x D 1
T1poundT2
PT1iD1
PT2jD1 microx iexcl eran
nij where microz D 1 if z cedil 0 and0 otherwise
ESTIMATING DATASET SIZE REQUIREMENTS 125
2 Given the above empirical distribution function compute for each en the value tn D P rann en
statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p
2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M
take the sequence of error rates and compute the following quantities for each training set size n gt n0
for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1
PT1iD1 eni
the 25th and 75th quantiles of the vector of error ratescopyen1 enT1
ordf
B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-
mization procedure minregab
PMlD1aniexclreg
l Cbiexclenl2 subject to reg a b cedil 0 designate the values
as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size
Lm en D amniexclregm C bm
ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1
enT1 g
4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS
The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results
The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section
Tumor versus normal tissue
This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)
126 MUKHERJEE ET AL
Tabl
e1
Sum
ma
ryof
Lea
rnin
gC
urv
ePa
ram
eter
sa
nd
Extr
apo
late
dE
rro
rR
ates
Lea
rnin
gcu
rve
para
met
ers
b
areg
asym
ple
arni
ngde
cay
Bay
esD
atas
etp
robl
emra
tera
teer
ror
Pre
dict
eder
ror
rate
at40
0sa
mpl
es
25th
quar
tile
Mea
n75
thqu
arti
le
Mul
tipl
etis
sues
47
73
89
25th
quan
tile
189
iexcl0
630
003
Can
cer
vs
norm
alm
ean
142
iexcl0
520
010
75th
quan
tile
117
iexcl0
430
000
Leu
kem
ia0
62
53
425
thqu
anti
le39
8iexcl
23
000
6A
ML
vs
AL
Lm
ean
771
iexcl0
650
009
75th
quan
tile
884
iexcl0
60
010
Col
onca
ncer
48
90
124
25
thqu
anti
le4
39iexcl
037
000
Can
cer
vs
norm
alm
ean
480
iexcl0
280
0075
thqu
anti
le5
00iexcl
023
000
Ova
rian
canc
er0
31
21
825
thqu
anti
le10
7iexcl
176
000
Can
cer
vs
norm
alm
ean
736
iexcl0
690
0075
thqu
anti
le9
95iexcl
067
000
Lym
phom
a0
09
15
25th
quan
tile
899
iexcl2
480
00D
LB
Cvs
Fo
llicu
lar
mea
n5
7iexcl
070
000
175
thqu
anti
le6
71iexcl
064
000
Bra
inca
ncer
145
16
1
238
25
thqu
antil
e1
06iexcl
037
000
4T
reat
men
tou
tcom
em
ean
112
iexcl0
330
006
75th
quan
tile
980
iexcl0
240
00Ly
mph
oma
112
17
24
3
25th
quan
tile
123
iexcl0
410
008
Tre
atm
ent
outc
ome
mea
n9
43iexcl
030
001
75th
quan
tile
872
iexcl0
210
00B
reas
tca
ncer
243
32
3
404
25
thqu
anti
le5
32iexcl
014
001
Tre
atm
ent
outc
ome
mea
n4
85iexcl
007
001
75th
quan
tile
429
iexcl0
010
00
ESTIMATING DATASET SIZE REQUIREMENTS 127
Tabl
e2
Su
mm
ary
of
Res
ult
sfo
rD
atase
tsIn
clu
ded
inth
eSt
ud
y
Max
imum
Size
toac
hiev
eLe
arni
ngtr
aini
ngcl
assi
errsquo
scu
rve
set
size
used
stat
istic
alt
ted
with
int
ting
Dat
aset
sign
ica
nce
sign
ica
ntle
arni
ngD
atas
etp
robl
emsi
ze(p
-val
lt0
05)
data
curv
eC
oncl
usio
ns
Cla
ssi
catio
ner
ror
Lear
ning
curv
eA
ctua
l(e
xtra
pola
ted)
E
rror
Mul
tiple
tissu
esC
ance
rvs
no
rmal
280
15ye
s21
07
08
62
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
10
Eve
nlo
wer
erro
rra
tes
lt0
07ar
eac
hiev
able
wit
hgt
400
sam
ples
L
euke
mia
AM
Lvs
A
LL
7310
yes
350
05
65
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
05
Add
itio
nal
sam
ples
may
low
erer
ror
toraquo
0C
olon
canc
erC
ance
rvs
no
rmal
6210
yes
5016
3
152
1
1D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
aner
ror
rate
ofraquo
016
L
ower
erro
rra
tes
lt0
09)
are
achi
evab
lew
ithgt
400
sam
ples
O
vari
anca
ncer
Can
cer
vs
norm
al54
10ye
s40
56
48
08
Dat
aset
size
isla
rge
enou
ghfo
rex
trap
olat
ion
and
toac
hiev
ea
low
erro
rra
teraquo
005
)A
dditi
onal
sam
ples
may
low
erer
ror
rate
toraquo
0L
ymph
oma
DL
BC
vs
Foll
icul
ar53
5ye
s40
44
70
7D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
035)
A
ddit
iona
lsa
mpl
esm
aylo
wer
erro
rra
teto
raquo0
Bra
inca
ncer
Tre
atm
ent
outc
ome
6045
no40
22
296
7
6D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
14w
ith
gt40
0sa
mpl
es
Lym
phom
aT
reat
men
tou
tcom
e58
50no
4023
29
5
65
Dat
aset
isno
tla
rge
enou
ghfo
rex
trap
olat
ion
Fitt
ing
ofle
arni
ngcu
rve
isin
accu
rate
but
sugg
ests
erro
rra
teco
uld
belt
017
wit
hgt
400
sam
ples
B
reas
tca
ncer
Tre
atm
ent
outc
ome
7865
no70
30
363
6
3D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
32w
ith
gt40
0sa
mpl
es
128 MUKHERJEE ET AL
with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279
Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15
To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is
errorn D 142niexcl052 C 0098
The error rates for the 25th and 75th quantiles are
error25n D 189niexcl063 C 0032
error75n D 117niexcl043 C 000
Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope
FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er
ESTIMATING DATASET SIZE REQUIREMENTS 129
FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes
is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows
error400 D 142400iexcl052 C 0098 D 73
error25400 D 189400iexcl063 C 0032 D 47
error75400 D 117400iexcl043 C 000 D 89
The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)
Leukemia morphology
The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is
errorn D 7706niexcl65 C 009
In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset
130 MUKHERJEE ET AL
FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
ESTIMATING DATASET SIZE REQUIREMENTS 121
In machine learning a natural way to study classi cation accuracy as a function of training set size isby building empirical scaling models called learning curves (Cortes et al 1994) Learning curves estimatethe empirical error rate as a function of training set size for a given classi er and dataset The advantageof this approach is that one avoids making assumptions about the distribution generating the dataset orthe distribution of the classi cation errors These learning curves are usually well characterized by inversepower-laws
en D aniexclreg C b (1)
The variables are the expected error rate en given n training samples the learning rate a the decayrate reg and the Bayes error b which is the minimum error rate achievable (Devroye et al 1997 Dudaet al 2000) Notice that the value of the constants a b and reg will change according to the classi er anddataset being studied Based on this scaling model as the size of the dataset increases the misclassi cationerror of a classi er will asymptotically approach b This inverse-power-law ldquolearningrdquo behavior appearsto be universal and is observed for many classi ers and types of datasets (Cortes et al 1994 Shrageret al 1988) It is in fact observed not only in machine learning but also in human and animal learning(Anderson et al 1983) It is common to nd empirical reg values around or less than 1 Besides itsempirical prevalence the power-law model can be motivated analytically and in some cases derived withinthe statistical mechanics approach to learning The basic idea behind this approach is to formulate theaverage error as a set of equations which are then solved via a statistical mechanics replica approach(Hertz et al 1991) involving integration over the parameters of the classi er This approach has beenapplied to various classi cation algorithms such as Support Vector Machines (Dietrich et al 2000) LargeMargin Perceptrons (Opper et al 1995) and Adaline and other classi ers based upon Hebbian rules (Opperet al 1990) The resulting analysis of the classi cation errors for all of the above algorithms results ininverse power laws of the form (1)
Using this power-law scaling model as a basis one can use the empirical error rates of a classi er overa range of training set sizes drawn from a dataset to t an inverse-power-law model and then use thismodel to extrapolate the error rate to larger datasets In order to make this a practical approach one alsoneeds a statistical test for classi er signi cance as a function of training set size The reason for this isthat the inverse-power-law model usually breaks down for small training set sizes where the model lacksenough data to give accurate predictions In this case the error rates are large and not signi cant If agiven classi errsquos results are not signi cant then it is better to exclude them when tting the learningcurve To directly address this problem we have included a permutation test for the signi cance of aclassi er as part of our methodology This test compares the performance of the actual classi er with theperformance of random classi ers trained to predict data whose target labels are permuted (randomized)A classi er that is able to ldquo ndrdquo structure in the data and produce signi cant results should outperform itsrandom counterparts most of the time By xing a signi cance level (005) we can produce an effectivetest to eliminate classi ers that are not signi cant from the tting of the learning curve Since the classi erperformance usually improves with increasing training set size this signi cance test also allows us to ndthe minimum number of samples that produced signi cant classi ers
3 METHODOLOGY FOR ESTIMATING ERROR RATES AS AFUNCTION OF DATASET SIZE
Given an arbitrary input dataset and classi cation algorithm the methodology we will describe providesthe following
1) A measure of the statistical signi cance of the classi er built at each training set size Using this onecan nd the minimum training set size for which the classi cation performance of the classi ers isstatistically signi cant
2) An analytic expression (power law) of the error rate as a function of the increasing dataset size as wellas similar expressions for the 25th and 75th error rate quantiles These provide a means to extrapolatethe error bar ldquoenveloperdquo for the error rate for larger yet unseen data sets
122 MUKHERJEE ET AL
As described in Section 2 a signi cance test is needed to know at which training set size error ratesare reliable enough to accurately extrapolate the error rate as a function of dataset size The 25th and 75thquantiles are used to compute the analog of error bars for the estimated error rates as a function of datasetsize Figure 1 shows a pictorial summary of the method The procedure can be broken down into two maincomputational tasks the rst involving random subsampling (traintest) and a signi cance permutation testto evaluate the classi ers and the second consisting of tting learning curves to the error rates that passedthe signi cance test
We rst describe the rationale behind tting the learning curves We want to t the inverse power curvesto the true average error and the true 25th and 75th quantile error of a classi er trained with varioustraining set sizes Our rst step is to estimate the true average error rate and 25th and 75th error quantilesfor a range of training set sizes For a dataset of size ` with xed training set size n and test set size ` iexcl nT1 traintest realizations were constructed by subsampling the dataset For each of these T1 realizations anerror rate eni is computed and the average of the sequence
copyen1 enT1
ordf en D 1
T1
PT1iD1 eni is used
as an estimate of the true average error en The average error rate is an unbiased estimator of the true errorrate of a classi er trained with n samples
ED`
1T1
T1X
iD1
eni D EDnen
where EDnen is the probability of classifying a new sample incorrectly when the classi er was trained
with n samples (see Appendix 1a for proof and details) The 25th and 75th percentile of the sequencecopyen1 enT1
ordfwere also tted to an inverse power law so that we could also estimate the variation in
error rates as a function of dataset size The 25th and 75th percentile of the sequencecopyen1 enT1
ordfare
good approximations of the 25th and 75th quantiles of error rates of classi ers trained with n samples (seeAppendix 1c for proof and details) We did not use the variance of the error rates eni because this statisticis not an unbiased estimator of the variance of the error rate of classi ers trained with n samples andtested on a new sample Indeed one can prove that the variance of this statistic is in general optimisticthe variance of the error rates eni is less than the variance of classi ers trained with n samples and testedon a new sample (see Appendix 1b for proof and details)
As described in the introduction theoretical justi cations for the use of inverse power laws can be madeusing analyses of classi cation accuracy based upon techniques from statistical mechanics (Opper et al1990 1995) and approximation theory (Niyogi et al 1996) as described in more detail in Appendix 2
FIG 1 The statistical signi cance for the ctitious dataset example with (a) 15 samples and (b) 45 samples Theline is the empirical distribution function for the random classi ers and the gray circle is the average error rate for theclassi er with randomization of labels
ESTIMATING DATASET SIZE REQUIREMENTS 123
Fitting the parameters of the learning curve by minimizing
minregab
MX
lD1
aniexclregl C b iexcl enl
2 subject to reg a b cedil 0
is a convex optimization problem when b is xed For a xed b one can estimate reg and a by takinglogarithms and solving the following equivalent linear minimization problem
minregab
MX
lD1
lna iexcl regnl C lnb iexcl enl2 subject to reg a b cedil 0
Solving this linear problem for various values of b followed by line search gives us our estimate of reg aand b
As described in Section 2 the tted learning curve does not extrapolate accurately when error rates arelarge and not statistically signi cant This motivates a procedure to determine at what training set size theerror rate is statistically signi cant when compared to the null hypothesis of the error rate of a randomclassi er
H0 py D 1jx fx1 y1 xn yng D py D iexcl1jx fx1 y1 xn yngI
the null hypothesis states that given a particular training set the conditional probability of a label being 1 oriexcl1 is equal A random classi er is built from the same input data with the class labels of the data randomlypermuted This addresses the question of how well the classi er can learn the mapping f x y whenthe y values are random y D f1 iexcl1g In essence we ask how well a classi er trained on randomlylabeled data can classify correctly labeled data The permutation procedure outlined above helps to answerthis question For each traintest realization for which an error rate eni was computed we construct T2
randomized realizations where the labels of the training set are randomly permuted We build classi erson these randomized training sets and test on the corresponding test set This results in a set of error ratesenij for training set size n From these error rates we construct an empirical distribution function for therandom classi er
P rann x D 1
T1 pound T2
T1X
iD1
T2X
jD1
microx iexcl enij
where microz D 1 if z cedil 0 and 0 otherwise The signi cance of the classi er is P rann en which is the
percentage of random classi ers with error rate smaller than en The procedure is illustrated with thefollowing two examples for a ctitious dataset
Example 1 e15 D 37 (Error rate of the classi er with 15 samples) e15ij D f215 260 290 320
366 388 395 408 420 495g (Error rates of the random classi ers with 15 samples)P ran
15 e15 D 50 Since the p-value is greater than 05 the error rate of the classi er isnot statistically signi cant (see Fig 2a)
Example 2 e45 D 1 (Error rate of the classi er with 45 samples) e45ij D f205 270 333 337
370 392 399 406 425 499g (Error rates of the random classi ers with 45 samples)P ran
45 e45 D 0 Since the p-value is less than 05 the error rate of the classi er is statis-tically signi cant (see Fig 2b)
The detailed description of the entire methodology for a two-class problem is as follows
1) Subsampling and signi cance permutation testA Subsampling procedure
i Given `c1 samples from class 1 and `c2 samples from class 2 the total number of samples is` D `c1 C `c2 where 1 cedil 10
ii Select 10 training set sizes n1 nj n10 over the interval [10 ` iexcl 10]
124 MUKHERJEE ET AL
FIG 2 Dataset size estimation statistical methodology
1 For each training set size nj run the following subsampling procedure T1 D 50 times indexedby i D 1 T1a Randomly split the dataset into a training set with nj samples and a test set with ` iexcl nj
samples subject to the requirement thatnc2nc1
frac14 `c2`c1
where nc2 and nc1 are the number of
samples from class 1 and class 2 in the training set Call the two datasets generated Sni b Train a classi er on each of the training sets and measure its error rate on its corresponding
test set call each of these error rates eni B Permutation test
i For each subsampled traintest split Sni run the following permutation procedure T2 gt 50times indexed by j D 1 T21 Randomly permute the labels of the samples in the training set (leave the test set alone) call
the dataset generated Srannij
2 Train a classi er on the training set and measure its error on the test set call this error rate erannij
C Signi cance calculation1 For each training set size n construct an empirical distribution function from the error rates of
the permuted datasets P rann x D 1
T1poundT2
PT1iD1
PT2jD1 microx iexcl eran
nij where microz D 1 if z cedil 0 and0 otherwise
ESTIMATING DATASET SIZE REQUIREMENTS 125
2 Given the above empirical distribution function compute for each en the value tn D P rann en
statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p
2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M
take the sequence of error rates and compute the following quantities for each training set size n gt n0
for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1
PT1iD1 eni
the 25th and 75th quantiles of the vector of error ratescopyen1 enT1
ordf
B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-
mization procedure minregab
PMlD1aniexclreg
l Cbiexclenl2 subject to reg a b cedil 0 designate the values
as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size
Lm en D amniexclregm C bm
ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1
enT1 g
4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS
The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results
The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section
Tumor versus normal tissue
This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)
126 MUKHERJEE ET AL
Tabl
e1
Sum
ma
ryof
Lea
rnin
gC
urv
ePa
ram
eter
sa
nd
Extr
apo
late
dE
rro
rR
ates
Lea
rnin
gcu
rve
para
met
ers
b
areg
asym
ple
arni
ngde
cay
Bay
esD
atas
etp
robl
emra
tera
teer
ror
Pre
dict
eder
ror
rate
at40
0sa
mpl
es
25th
quar
tile
Mea
n75
thqu
arti
le
Mul
tipl
etis
sues
47
73
89
25th
quan
tile
189
iexcl0
630
003
Can
cer
vs
norm
alm
ean
142
iexcl0
520
010
75th
quan
tile
117
iexcl0
430
000
Leu
kem
ia0
62
53
425
thqu
anti
le39
8iexcl
23
000
6A
ML
vs
AL
Lm
ean
771
iexcl0
650
009
75th
quan
tile
884
iexcl0
60
010
Col
onca
ncer
48
90
124
25
thqu
anti
le4
39iexcl
037
000
Can
cer
vs
norm
alm
ean
480
iexcl0
280
0075
thqu
anti
le5
00iexcl
023
000
Ova
rian
canc
er0
31
21
825
thqu
anti
le10
7iexcl
176
000
Can
cer
vs
norm
alm
ean
736
iexcl0
690
0075
thqu
anti
le9
95iexcl
067
000
Lym
phom
a0
09
15
25th
quan
tile
899
iexcl2
480
00D
LB
Cvs
Fo
llicu
lar
mea
n5
7iexcl
070
000
175
thqu
anti
le6
71iexcl
064
000
Bra
inca
ncer
145
16
1
238
25
thqu
antil
e1
06iexcl
037
000
4T
reat
men
tou
tcom
em
ean
112
iexcl0
330
006
75th
quan
tile
980
iexcl0
240
00Ly
mph
oma
112
17
24
3
25th
quan
tile
123
iexcl0
410
008
Tre
atm
ent
outc
ome
mea
n9
43iexcl
030
001
75th
quan
tile
872
iexcl0
210
00B
reas
tca
ncer
243
32
3
404
25
thqu
anti
le5
32iexcl
014
001
Tre
atm
ent
outc
ome
mea
n4
85iexcl
007
001
75th
quan
tile
429
iexcl0
010
00
ESTIMATING DATASET SIZE REQUIREMENTS 127
Tabl
e2
Su
mm
ary
of
Res
ult
sfo
rD
atase
tsIn
clu
ded
inth
eSt
ud
y
Max
imum
Size
toac
hiev
eLe
arni
ngtr
aini
ngcl
assi
errsquo
scu
rve
set
size
used
stat
istic
alt
ted
with
int
ting
Dat
aset
sign
ica
nce
sign
ica
ntle
arni
ngD
atas
etp
robl
emsi
ze(p
-val
lt0
05)
data
curv
eC
oncl
usio
ns
Cla
ssi
catio
ner
ror
Lear
ning
curv
eA
ctua
l(e
xtra
pola
ted)
E
rror
Mul
tiple
tissu
esC
ance
rvs
no
rmal
280
15ye
s21
07
08
62
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
10
Eve
nlo
wer
erro
rra
tes
lt0
07ar
eac
hiev
able
wit
hgt
400
sam
ples
L
euke
mia
AM
Lvs
A
LL
7310
yes
350
05
65
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
05
Add
itio
nal
sam
ples
may
low
erer
ror
toraquo
0C
olon
canc
erC
ance
rvs
no
rmal
6210
yes
5016
3
152
1
1D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
aner
ror
rate
ofraquo
016
L
ower
erro
rra
tes
lt0
09)
are
achi
evab
lew
ithgt
400
sam
ples
O
vari
anca
ncer
Can
cer
vs
norm
al54
10ye
s40
56
48
08
Dat
aset
size
isla
rge
enou
ghfo
rex
trap
olat
ion
and
toac
hiev
ea
low
erro
rra
teraquo
005
)A
dditi
onal
sam
ples
may
low
erer
ror
rate
toraquo
0L
ymph
oma
DL
BC
vs
Foll
icul
ar53
5ye
s40
44
70
7D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
035)
A
ddit
iona
lsa
mpl
esm
aylo
wer
erro
rra
teto
raquo0
Bra
inca
ncer
Tre
atm
ent
outc
ome
6045
no40
22
296
7
6D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
14w
ith
gt40
0sa
mpl
es
Lym
phom
aT
reat
men
tou
tcom
e58
50no
4023
29
5
65
Dat
aset
isno
tla
rge
enou
ghfo
rex
trap
olat
ion
Fitt
ing
ofle
arni
ngcu
rve
isin
accu
rate
but
sugg
ests
erro
rra
teco
uld
belt
017
wit
hgt
400
sam
ples
B
reas
tca
ncer
Tre
atm
ent
outc
ome
7865
no70
30
363
6
3D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
32w
ith
gt40
0sa
mpl
es
128 MUKHERJEE ET AL
with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279
Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15
To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is
errorn D 142niexcl052 C 0098
The error rates for the 25th and 75th quantiles are
error25n D 189niexcl063 C 0032
error75n D 117niexcl043 C 000
Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope
FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er
ESTIMATING DATASET SIZE REQUIREMENTS 129
FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes
is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows
error400 D 142400iexcl052 C 0098 D 73
error25400 D 189400iexcl063 C 0032 D 47
error75400 D 117400iexcl043 C 000 D 89
The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)
Leukemia morphology
The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is
errorn D 7706niexcl65 C 009
In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset
130 MUKHERJEE ET AL
FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
122 MUKHERJEE ET AL
As described in Section 2 a signi cance test is needed to know at which training set size error ratesare reliable enough to accurately extrapolate the error rate as a function of dataset size The 25th and 75thquantiles are used to compute the analog of error bars for the estimated error rates as a function of datasetsize Figure 1 shows a pictorial summary of the method The procedure can be broken down into two maincomputational tasks the rst involving random subsampling (traintest) and a signi cance permutation testto evaluate the classi ers and the second consisting of tting learning curves to the error rates that passedthe signi cance test
We rst describe the rationale behind tting the learning curves We want to t the inverse power curvesto the true average error and the true 25th and 75th quantile error of a classi er trained with varioustraining set sizes Our rst step is to estimate the true average error rate and 25th and 75th error quantilesfor a range of training set sizes For a dataset of size ` with xed training set size n and test set size ` iexcl nT1 traintest realizations were constructed by subsampling the dataset For each of these T1 realizations anerror rate eni is computed and the average of the sequence
copyen1 enT1
ordf en D 1
T1
PT1iD1 eni is used
as an estimate of the true average error en The average error rate is an unbiased estimator of the true errorrate of a classi er trained with n samples
ED`
1T1
T1X
iD1
eni D EDnen
where EDnen is the probability of classifying a new sample incorrectly when the classi er was trained
with n samples (see Appendix 1a for proof and details) The 25th and 75th percentile of the sequencecopyen1 enT1
ordfwere also tted to an inverse power law so that we could also estimate the variation in
error rates as a function of dataset size The 25th and 75th percentile of the sequencecopyen1 enT1
ordfare
good approximations of the 25th and 75th quantiles of error rates of classi ers trained with n samples (seeAppendix 1c for proof and details) We did not use the variance of the error rates eni because this statisticis not an unbiased estimator of the variance of the error rate of classi ers trained with n samples andtested on a new sample Indeed one can prove that the variance of this statistic is in general optimisticthe variance of the error rates eni is less than the variance of classi ers trained with n samples and testedon a new sample (see Appendix 1b for proof and details)
As described in the introduction theoretical justi cations for the use of inverse power laws can be madeusing analyses of classi cation accuracy based upon techniques from statistical mechanics (Opper et al1990 1995) and approximation theory (Niyogi et al 1996) as described in more detail in Appendix 2
FIG 1 The statistical signi cance for the ctitious dataset example with (a) 15 samples and (b) 45 samples Theline is the empirical distribution function for the random classi ers and the gray circle is the average error rate for theclassi er with randomization of labels
ESTIMATING DATASET SIZE REQUIREMENTS 123
Fitting the parameters of the learning curve by minimizing
minregab
MX
lD1
aniexclregl C b iexcl enl
2 subject to reg a b cedil 0
is a convex optimization problem when b is xed For a xed b one can estimate reg and a by takinglogarithms and solving the following equivalent linear minimization problem
minregab
MX
lD1
lna iexcl regnl C lnb iexcl enl2 subject to reg a b cedil 0
Solving this linear problem for various values of b followed by line search gives us our estimate of reg aand b
As described in Section 2 the tted learning curve does not extrapolate accurately when error rates arelarge and not statistically signi cant This motivates a procedure to determine at what training set size theerror rate is statistically signi cant when compared to the null hypothesis of the error rate of a randomclassi er
H0 py D 1jx fx1 y1 xn yng D py D iexcl1jx fx1 y1 xn yngI
the null hypothesis states that given a particular training set the conditional probability of a label being 1 oriexcl1 is equal A random classi er is built from the same input data with the class labels of the data randomlypermuted This addresses the question of how well the classi er can learn the mapping f x y whenthe y values are random y D f1 iexcl1g In essence we ask how well a classi er trained on randomlylabeled data can classify correctly labeled data The permutation procedure outlined above helps to answerthis question For each traintest realization for which an error rate eni was computed we construct T2
randomized realizations where the labels of the training set are randomly permuted We build classi erson these randomized training sets and test on the corresponding test set This results in a set of error ratesenij for training set size n From these error rates we construct an empirical distribution function for therandom classi er
P rann x D 1
T1 pound T2
T1X
iD1
T2X
jD1
microx iexcl enij
where microz D 1 if z cedil 0 and 0 otherwise The signi cance of the classi er is P rann en which is the
percentage of random classi ers with error rate smaller than en The procedure is illustrated with thefollowing two examples for a ctitious dataset
Example 1 e15 D 37 (Error rate of the classi er with 15 samples) e15ij D f215 260 290 320
366 388 395 408 420 495g (Error rates of the random classi ers with 15 samples)P ran
15 e15 D 50 Since the p-value is greater than 05 the error rate of the classi er isnot statistically signi cant (see Fig 2a)
Example 2 e45 D 1 (Error rate of the classi er with 45 samples) e45ij D f205 270 333 337
370 392 399 406 425 499g (Error rates of the random classi ers with 45 samples)P ran
45 e45 D 0 Since the p-value is less than 05 the error rate of the classi er is statis-tically signi cant (see Fig 2b)
The detailed description of the entire methodology for a two-class problem is as follows
1) Subsampling and signi cance permutation testA Subsampling procedure
i Given `c1 samples from class 1 and `c2 samples from class 2 the total number of samples is` D `c1 C `c2 where 1 cedil 10
ii Select 10 training set sizes n1 nj n10 over the interval [10 ` iexcl 10]
124 MUKHERJEE ET AL
FIG 2 Dataset size estimation statistical methodology
1 For each training set size nj run the following subsampling procedure T1 D 50 times indexedby i D 1 T1a Randomly split the dataset into a training set with nj samples and a test set with ` iexcl nj
samples subject to the requirement thatnc2nc1
frac14 `c2`c1
where nc2 and nc1 are the number of
samples from class 1 and class 2 in the training set Call the two datasets generated Sni b Train a classi er on each of the training sets and measure its error rate on its corresponding
test set call each of these error rates eni B Permutation test
i For each subsampled traintest split Sni run the following permutation procedure T2 gt 50times indexed by j D 1 T21 Randomly permute the labels of the samples in the training set (leave the test set alone) call
the dataset generated Srannij
2 Train a classi er on the training set and measure its error on the test set call this error rate erannij
C Signi cance calculation1 For each training set size n construct an empirical distribution function from the error rates of
the permuted datasets P rann x D 1
T1poundT2
PT1iD1
PT2jD1 microx iexcl eran
nij where microz D 1 if z cedil 0 and0 otherwise
ESTIMATING DATASET SIZE REQUIREMENTS 125
2 Given the above empirical distribution function compute for each en the value tn D P rann en
statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p
2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M
take the sequence of error rates and compute the following quantities for each training set size n gt n0
for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1
PT1iD1 eni
the 25th and 75th quantiles of the vector of error ratescopyen1 enT1
ordf
B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-
mization procedure minregab
PMlD1aniexclreg
l Cbiexclenl2 subject to reg a b cedil 0 designate the values
as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size
Lm en D amniexclregm C bm
ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1
enT1 g
4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS
The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results
The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section
Tumor versus normal tissue
This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)
126 MUKHERJEE ET AL
Tabl
e1
Sum
ma
ryof
Lea
rnin
gC
urv
ePa
ram
eter
sa
nd
Extr
apo
late
dE
rro
rR
ates
Lea
rnin
gcu
rve
para
met
ers
b
areg
asym
ple
arni
ngde
cay
Bay
esD
atas
etp
robl
emra
tera
teer
ror
Pre
dict
eder
ror
rate
at40
0sa
mpl
es
25th
quar
tile
Mea
n75
thqu
arti
le
Mul
tipl
etis
sues
47
73
89
25th
quan
tile
189
iexcl0
630
003
Can
cer
vs
norm
alm
ean
142
iexcl0
520
010
75th
quan
tile
117
iexcl0
430
000
Leu
kem
ia0
62
53
425
thqu
anti
le39
8iexcl
23
000
6A
ML
vs
AL
Lm
ean
771
iexcl0
650
009
75th
quan
tile
884
iexcl0
60
010
Col
onca
ncer
48
90
124
25
thqu
anti
le4
39iexcl
037
000
Can
cer
vs
norm
alm
ean
480
iexcl0
280
0075
thqu
anti
le5
00iexcl
023
000
Ova
rian
canc
er0
31
21
825
thqu
anti
le10
7iexcl
176
000
Can
cer
vs
norm
alm
ean
736
iexcl0
690
0075
thqu
anti
le9
95iexcl
067
000
Lym
phom
a0
09
15
25th
quan
tile
899
iexcl2
480
00D
LB
Cvs
Fo
llicu
lar
mea
n5
7iexcl
070
000
175
thqu
anti
le6
71iexcl
064
000
Bra
inca
ncer
145
16
1
238
25
thqu
antil
e1
06iexcl
037
000
4T
reat
men
tou
tcom
em
ean
112
iexcl0
330
006
75th
quan
tile
980
iexcl0
240
00Ly
mph
oma
112
17
24
3
25th
quan
tile
123
iexcl0
410
008
Tre
atm
ent
outc
ome
mea
n9
43iexcl
030
001
75th
quan
tile
872
iexcl0
210
00B
reas
tca
ncer
243
32
3
404
25
thqu
anti
le5
32iexcl
014
001
Tre
atm
ent
outc
ome
mea
n4
85iexcl
007
001
75th
quan
tile
429
iexcl0
010
00
ESTIMATING DATASET SIZE REQUIREMENTS 127
Tabl
e2
Su
mm
ary
of
Res
ult
sfo
rD
atase
tsIn
clu
ded
inth
eSt
ud
y
Max
imum
Size
toac
hiev
eLe
arni
ngtr
aini
ngcl
assi
errsquo
scu
rve
set
size
used
stat
istic
alt
ted
with
int
ting
Dat
aset
sign
ica
nce
sign
ica
ntle
arni
ngD
atas
etp
robl
emsi
ze(p
-val
lt0
05)
data
curv
eC
oncl
usio
ns
Cla
ssi
catio
ner
ror
Lear
ning
curv
eA
ctua
l(e
xtra
pola
ted)
E
rror
Mul
tiple
tissu
esC
ance
rvs
no
rmal
280
15ye
s21
07
08
62
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
10
Eve
nlo
wer
erro
rra
tes
lt0
07ar
eac
hiev
able
wit
hgt
400
sam
ples
L
euke
mia
AM
Lvs
A
LL
7310
yes
350
05
65
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
05
Add
itio
nal
sam
ples
may
low
erer
ror
toraquo
0C
olon
canc
erC
ance
rvs
no
rmal
6210
yes
5016
3
152
1
1D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
aner
ror
rate
ofraquo
016
L
ower
erro
rra
tes
lt0
09)
are
achi
evab
lew
ithgt
400
sam
ples
O
vari
anca
ncer
Can
cer
vs
norm
al54
10ye
s40
56
48
08
Dat
aset
size
isla
rge
enou
ghfo
rex
trap
olat
ion
and
toac
hiev
ea
low
erro
rra
teraquo
005
)A
dditi
onal
sam
ples
may
low
erer
ror
rate
toraquo
0L
ymph
oma
DL
BC
vs
Foll
icul
ar53
5ye
s40
44
70
7D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
035)
A
ddit
iona
lsa
mpl
esm
aylo
wer
erro
rra
teto
raquo0
Bra
inca
ncer
Tre
atm
ent
outc
ome
6045
no40
22
296
7
6D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
14w
ith
gt40
0sa
mpl
es
Lym
phom
aT
reat
men
tou
tcom
e58
50no
4023
29
5
65
Dat
aset
isno
tla
rge
enou
ghfo
rex
trap
olat
ion
Fitt
ing
ofle
arni
ngcu
rve
isin
accu
rate
but
sugg
ests
erro
rra
teco
uld
belt
017
wit
hgt
400
sam
ples
B
reas
tca
ncer
Tre
atm
ent
outc
ome
7865
no70
30
363
6
3D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
32w
ith
gt40
0sa
mpl
es
128 MUKHERJEE ET AL
with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279
Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15
To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is
errorn D 142niexcl052 C 0098
The error rates for the 25th and 75th quantiles are
error25n D 189niexcl063 C 0032
error75n D 117niexcl043 C 000
Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope
FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er
ESTIMATING DATASET SIZE REQUIREMENTS 129
FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes
is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows
error400 D 142400iexcl052 C 0098 D 73
error25400 D 189400iexcl063 C 0032 D 47
error75400 D 117400iexcl043 C 000 D 89
The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)
Leukemia morphology
The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is
errorn D 7706niexcl65 C 009
In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset
130 MUKHERJEE ET AL
FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
ESTIMATING DATASET SIZE REQUIREMENTS 123
Fitting the parameters of the learning curve by minimizing
minregab
MX
lD1
aniexclregl C b iexcl enl
2 subject to reg a b cedil 0
is a convex optimization problem when b is xed For a xed b one can estimate reg and a by takinglogarithms and solving the following equivalent linear minimization problem
minregab
MX
lD1
lna iexcl regnl C lnb iexcl enl2 subject to reg a b cedil 0
Solving this linear problem for various values of b followed by line search gives us our estimate of reg aand b
As described in Section 2 the tted learning curve does not extrapolate accurately when error rates arelarge and not statistically signi cant This motivates a procedure to determine at what training set size theerror rate is statistically signi cant when compared to the null hypothesis of the error rate of a randomclassi er
H0 py D 1jx fx1 y1 xn yng D py D iexcl1jx fx1 y1 xn yngI
the null hypothesis states that given a particular training set the conditional probability of a label being 1 oriexcl1 is equal A random classi er is built from the same input data with the class labels of the data randomlypermuted This addresses the question of how well the classi er can learn the mapping f x y whenthe y values are random y D f1 iexcl1g In essence we ask how well a classi er trained on randomlylabeled data can classify correctly labeled data The permutation procedure outlined above helps to answerthis question For each traintest realization for which an error rate eni was computed we construct T2
randomized realizations where the labels of the training set are randomly permuted We build classi erson these randomized training sets and test on the corresponding test set This results in a set of error ratesenij for training set size n From these error rates we construct an empirical distribution function for therandom classi er
P rann x D 1
T1 pound T2
T1X
iD1
T2X
jD1
microx iexcl enij
where microz D 1 if z cedil 0 and 0 otherwise The signi cance of the classi er is P rann en which is the
percentage of random classi ers with error rate smaller than en The procedure is illustrated with thefollowing two examples for a ctitious dataset
Example 1 e15 D 37 (Error rate of the classi er with 15 samples) e15ij D f215 260 290 320
366 388 395 408 420 495g (Error rates of the random classi ers with 15 samples)P ran
15 e15 D 50 Since the p-value is greater than 05 the error rate of the classi er isnot statistically signi cant (see Fig 2a)
Example 2 e45 D 1 (Error rate of the classi er with 45 samples) e45ij D f205 270 333 337
370 392 399 406 425 499g (Error rates of the random classi ers with 45 samples)P ran
45 e45 D 0 Since the p-value is less than 05 the error rate of the classi er is statis-tically signi cant (see Fig 2b)
The detailed description of the entire methodology for a two-class problem is as follows
1) Subsampling and signi cance permutation testA Subsampling procedure
i Given `c1 samples from class 1 and `c2 samples from class 2 the total number of samples is` D `c1 C `c2 where 1 cedil 10
ii Select 10 training set sizes n1 nj n10 over the interval [10 ` iexcl 10]
124 MUKHERJEE ET AL
FIG 2 Dataset size estimation statistical methodology
1 For each training set size nj run the following subsampling procedure T1 D 50 times indexedby i D 1 T1a Randomly split the dataset into a training set with nj samples and a test set with ` iexcl nj
samples subject to the requirement thatnc2nc1
frac14 `c2`c1
where nc2 and nc1 are the number of
samples from class 1 and class 2 in the training set Call the two datasets generated Sni b Train a classi er on each of the training sets and measure its error rate on its corresponding
test set call each of these error rates eni B Permutation test
i For each subsampled traintest split Sni run the following permutation procedure T2 gt 50times indexed by j D 1 T21 Randomly permute the labels of the samples in the training set (leave the test set alone) call
the dataset generated Srannij
2 Train a classi er on the training set and measure its error on the test set call this error rate erannij
C Signi cance calculation1 For each training set size n construct an empirical distribution function from the error rates of
the permuted datasets P rann x D 1
T1poundT2
PT1iD1
PT2jD1 microx iexcl eran
nij where microz D 1 if z cedil 0 and0 otherwise
ESTIMATING DATASET SIZE REQUIREMENTS 125
2 Given the above empirical distribution function compute for each en the value tn D P rann en
statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p
2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M
take the sequence of error rates and compute the following quantities for each training set size n gt n0
for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1
PT1iD1 eni
the 25th and 75th quantiles of the vector of error ratescopyen1 enT1
ordf
B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-
mization procedure minregab
PMlD1aniexclreg
l Cbiexclenl2 subject to reg a b cedil 0 designate the values
as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size
Lm en D amniexclregm C bm
ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1
enT1 g
4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS
The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results
The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section
Tumor versus normal tissue
This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)
126 MUKHERJEE ET AL
Tabl
e1
Sum
ma
ryof
Lea
rnin
gC
urv
ePa
ram
eter
sa
nd
Extr
apo
late
dE
rro
rR
ates
Lea
rnin
gcu
rve
para
met
ers
b
areg
asym
ple
arni
ngde
cay
Bay
esD
atas
etp
robl
emra
tera
teer
ror
Pre
dict
eder
ror
rate
at40
0sa
mpl
es
25th
quar
tile
Mea
n75
thqu
arti
le
Mul
tipl
etis
sues
47
73
89
25th
quan
tile
189
iexcl0
630
003
Can
cer
vs
norm
alm
ean
142
iexcl0
520
010
75th
quan
tile
117
iexcl0
430
000
Leu
kem
ia0
62
53
425
thqu
anti
le39
8iexcl
23
000
6A
ML
vs
AL
Lm
ean
771
iexcl0
650
009
75th
quan
tile
884
iexcl0
60
010
Col
onca
ncer
48
90
124
25
thqu
anti
le4
39iexcl
037
000
Can
cer
vs
norm
alm
ean
480
iexcl0
280
0075
thqu
anti
le5
00iexcl
023
000
Ova
rian
canc
er0
31
21
825
thqu
anti
le10
7iexcl
176
000
Can
cer
vs
norm
alm
ean
736
iexcl0
690
0075
thqu
anti
le9
95iexcl
067
000
Lym
phom
a0
09
15
25th
quan
tile
899
iexcl2
480
00D
LB
Cvs
Fo
llicu
lar
mea
n5
7iexcl
070
000
175
thqu
anti
le6
71iexcl
064
000
Bra
inca
ncer
145
16
1
238
25
thqu
antil
e1
06iexcl
037
000
4T
reat
men
tou
tcom
em
ean
112
iexcl0
330
006
75th
quan
tile
980
iexcl0
240
00Ly
mph
oma
112
17
24
3
25th
quan
tile
123
iexcl0
410
008
Tre
atm
ent
outc
ome
mea
n9
43iexcl
030
001
75th
quan
tile
872
iexcl0
210
00B
reas
tca
ncer
243
32
3
404
25
thqu
anti
le5
32iexcl
014
001
Tre
atm
ent
outc
ome
mea
n4
85iexcl
007
001
75th
quan
tile
429
iexcl0
010
00
ESTIMATING DATASET SIZE REQUIREMENTS 127
Tabl
e2
Su
mm
ary
of
Res
ult
sfo
rD
atase
tsIn
clu
ded
inth
eSt
ud
y
Max
imum
Size
toac
hiev
eLe
arni
ngtr
aini
ngcl
assi
errsquo
scu
rve
set
size
used
stat
istic
alt
ted
with
int
ting
Dat
aset
sign
ica
nce
sign
ica
ntle
arni
ngD
atas
etp
robl
emsi
ze(p
-val
lt0
05)
data
curv
eC
oncl
usio
ns
Cla
ssi
catio
ner
ror
Lear
ning
curv
eA
ctua
l(e
xtra
pola
ted)
E
rror
Mul
tiple
tissu
esC
ance
rvs
no
rmal
280
15ye
s21
07
08
62
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
10
Eve
nlo
wer
erro
rra
tes
lt0
07ar
eac
hiev
able
wit
hgt
400
sam
ples
L
euke
mia
AM
Lvs
A
LL
7310
yes
350
05
65
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
05
Add
itio
nal
sam
ples
may
low
erer
ror
toraquo
0C
olon
canc
erC
ance
rvs
no
rmal
6210
yes
5016
3
152
1
1D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
aner
ror
rate
ofraquo
016
L
ower
erro
rra
tes
lt0
09)
are
achi
evab
lew
ithgt
400
sam
ples
O
vari
anca
ncer
Can
cer
vs
norm
al54
10ye
s40
56
48
08
Dat
aset
size
isla
rge
enou
ghfo
rex
trap
olat
ion
and
toac
hiev
ea
low
erro
rra
teraquo
005
)A
dditi
onal
sam
ples
may
low
erer
ror
rate
toraquo
0L
ymph
oma
DL
BC
vs
Foll
icul
ar53
5ye
s40
44
70
7D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
035)
A
ddit
iona
lsa
mpl
esm
aylo
wer
erro
rra
teto
raquo0
Bra
inca
ncer
Tre
atm
ent
outc
ome
6045
no40
22
296
7
6D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
14w
ith
gt40
0sa
mpl
es
Lym
phom
aT
reat
men
tou
tcom
e58
50no
4023
29
5
65
Dat
aset
isno
tla
rge
enou
ghfo
rex
trap
olat
ion
Fitt
ing
ofle
arni
ngcu
rve
isin
accu
rate
but
sugg
ests
erro
rra
teco
uld
belt
017
wit
hgt
400
sam
ples
B
reas
tca
ncer
Tre
atm
ent
outc
ome
7865
no70
30
363
6
3D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
32w
ith
gt40
0sa
mpl
es
128 MUKHERJEE ET AL
with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279
Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15
To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is
errorn D 142niexcl052 C 0098
The error rates for the 25th and 75th quantiles are
error25n D 189niexcl063 C 0032
error75n D 117niexcl043 C 000
Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope
FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er
ESTIMATING DATASET SIZE REQUIREMENTS 129
FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes
is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows
error400 D 142400iexcl052 C 0098 D 73
error25400 D 189400iexcl063 C 0032 D 47
error75400 D 117400iexcl043 C 000 D 89
The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)
Leukemia morphology
The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is
errorn D 7706niexcl65 C 009
In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset
130 MUKHERJEE ET AL
FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
124 MUKHERJEE ET AL
FIG 2 Dataset size estimation statistical methodology
1 For each training set size nj run the following subsampling procedure T1 D 50 times indexedby i D 1 T1a Randomly split the dataset into a training set with nj samples and a test set with ` iexcl nj
samples subject to the requirement thatnc2nc1
frac14 `c2`c1
where nc2 and nc1 are the number of
samples from class 1 and class 2 in the training set Call the two datasets generated Sni b Train a classi er on each of the training sets and measure its error rate on its corresponding
test set call each of these error rates eni B Permutation test
i For each subsampled traintest split Sni run the following permutation procedure T2 gt 50times indexed by j D 1 T21 Randomly permute the labels of the samples in the training set (leave the test set alone) call
the dataset generated Srannij
2 Train a classi er on the training set and measure its error on the test set call this error rate erannij
C Signi cance calculation1 For each training set size n construct an empirical distribution function from the error rates of
the permuted datasets P rann x D 1
T1poundT2
PT1iD1
PT2jD1 microx iexcl eran
nij where microz D 1 if z cedil 0 and0 otherwise
ESTIMATING DATASET SIZE REQUIREMENTS 125
2 Given the above empirical distribution function compute for each en the value tn D P rann en
statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p
2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M
take the sequence of error rates and compute the following quantities for each training set size n gt n0
for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1
PT1iD1 eni
the 25th and 75th quantiles of the vector of error ratescopyen1 enT1
ordf
B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-
mization procedure minregab
PMlD1aniexclreg
l Cbiexclenl2 subject to reg a b cedil 0 designate the values
as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size
Lm en D amniexclregm C bm
ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1
enT1 g
4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS
The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results
The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section
Tumor versus normal tissue
This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)
126 MUKHERJEE ET AL
Tabl
e1
Sum
ma
ryof
Lea
rnin
gC
urv
ePa
ram
eter
sa
nd
Extr
apo
late
dE
rro
rR
ates
Lea
rnin
gcu
rve
para
met
ers
b
areg
asym
ple
arni
ngde
cay
Bay
esD
atas
etp
robl
emra
tera
teer
ror
Pre
dict
eder
ror
rate
at40
0sa
mpl
es
25th
quar
tile
Mea
n75
thqu
arti
le
Mul
tipl
etis
sues
47
73
89
25th
quan
tile
189
iexcl0
630
003
Can
cer
vs
norm
alm
ean
142
iexcl0
520
010
75th
quan
tile
117
iexcl0
430
000
Leu
kem
ia0
62
53
425
thqu
anti
le39
8iexcl
23
000
6A
ML
vs
AL
Lm
ean
771
iexcl0
650
009
75th
quan
tile
884
iexcl0
60
010
Col
onca
ncer
48
90
124
25
thqu
anti
le4
39iexcl
037
000
Can
cer
vs
norm
alm
ean
480
iexcl0
280
0075
thqu
anti
le5
00iexcl
023
000
Ova
rian
canc
er0
31
21
825
thqu
anti
le10
7iexcl
176
000
Can
cer
vs
norm
alm
ean
736
iexcl0
690
0075
thqu
anti
le9
95iexcl
067
000
Lym
phom
a0
09
15
25th
quan
tile
899
iexcl2
480
00D
LB
Cvs
Fo
llicu
lar
mea
n5
7iexcl
070
000
175
thqu
anti
le6
71iexcl
064
000
Bra
inca
ncer
145
16
1
238
25
thqu
antil
e1
06iexcl
037
000
4T
reat
men
tou
tcom
em
ean
112
iexcl0
330
006
75th
quan
tile
980
iexcl0
240
00Ly
mph
oma
112
17
24
3
25th
quan
tile
123
iexcl0
410
008
Tre
atm
ent
outc
ome
mea
n9
43iexcl
030
001
75th
quan
tile
872
iexcl0
210
00B
reas
tca
ncer
243
32
3
404
25
thqu
anti
le5
32iexcl
014
001
Tre
atm
ent
outc
ome
mea
n4
85iexcl
007
001
75th
quan
tile
429
iexcl0
010
00
ESTIMATING DATASET SIZE REQUIREMENTS 127
Tabl
e2
Su
mm
ary
of
Res
ult
sfo
rD
atase
tsIn
clu
ded
inth
eSt
ud
y
Max
imum
Size
toac
hiev
eLe
arni
ngtr
aini
ngcl
assi
errsquo
scu
rve
set
size
used
stat
istic
alt
ted
with
int
ting
Dat
aset
sign
ica
nce
sign
ica
ntle
arni
ngD
atas
etp
robl
emsi
ze(p
-val
lt0
05)
data
curv
eC
oncl
usio
ns
Cla
ssi
catio
ner
ror
Lear
ning
curv
eA
ctua
l(e
xtra
pola
ted)
E
rror
Mul
tiple
tissu
esC
ance
rvs
no
rmal
280
15ye
s21
07
08
62
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
10
Eve
nlo
wer
erro
rra
tes
lt0
07ar
eac
hiev
able
wit
hgt
400
sam
ples
L
euke
mia
AM
Lvs
A
LL
7310
yes
350
05
65
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
05
Add
itio
nal
sam
ples
may
low
erer
ror
toraquo
0C
olon
canc
erC
ance
rvs
no
rmal
6210
yes
5016
3
152
1
1D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
aner
ror
rate
ofraquo
016
L
ower
erro
rra
tes
lt0
09)
are
achi
evab
lew
ithgt
400
sam
ples
O
vari
anca
ncer
Can
cer
vs
norm
al54
10ye
s40
56
48
08
Dat
aset
size
isla
rge
enou
ghfo
rex
trap
olat
ion
and
toac
hiev
ea
low
erro
rra
teraquo
005
)A
dditi
onal
sam
ples
may
low
erer
ror
rate
toraquo
0L
ymph
oma
DL
BC
vs
Foll
icul
ar53
5ye
s40
44
70
7D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
035)
A
ddit
iona
lsa
mpl
esm
aylo
wer
erro
rra
teto
raquo0
Bra
inca
ncer
Tre
atm
ent
outc
ome
6045
no40
22
296
7
6D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
14w
ith
gt40
0sa
mpl
es
Lym
phom
aT
reat
men
tou
tcom
e58
50no
4023
29
5
65
Dat
aset
isno
tla
rge
enou
ghfo
rex
trap
olat
ion
Fitt
ing
ofle
arni
ngcu
rve
isin
accu
rate
but
sugg
ests
erro
rra
teco
uld
belt
017
wit
hgt
400
sam
ples
B
reas
tca
ncer
Tre
atm
ent
outc
ome
7865
no70
30
363
6
3D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
32w
ith
gt40
0sa
mpl
es
128 MUKHERJEE ET AL
with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279
Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15
To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is
errorn D 142niexcl052 C 0098
The error rates for the 25th and 75th quantiles are
error25n D 189niexcl063 C 0032
error75n D 117niexcl043 C 000
Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope
FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er
ESTIMATING DATASET SIZE REQUIREMENTS 129
FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes
is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows
error400 D 142400iexcl052 C 0098 D 73
error25400 D 189400iexcl063 C 0032 D 47
error75400 D 117400iexcl043 C 000 D 89
The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)
Leukemia morphology
The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is
errorn D 7706niexcl65 C 009
In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset
130 MUKHERJEE ET AL
FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
ESTIMATING DATASET SIZE REQUIREMENTS 125
2 Given the above empirical distribution function compute for each en the value tn D P rann en
statistical signi cance with respect to an reg-value of p is achieved for n0 the smallest n forwhich tn lt p
2) Learning curves and training set size estimationA Assume the subsampling procedure was run for M different sample sizes n indexed by l iexcl1 M
take the sequence of error rates and compute the following quantities for each training set size n gt n0
for which the classi er passed the signi cance test tn lt p the mean error rate en D 1T1
PT1iD1 eni
the 25th and 75th quantiles of the vector of error ratescopyen1 enT1
ordf
B Use the above quantities to t the following learning curvesi Given training set sizes nl and mean error rates enl compute reg a b via the following mini-
mization procedure minregab
PMlD1aniexclreg
l Cbiexclenl2 subject to reg a b cedil 0 designate the values
as reg a b as regm am bm The resulting curve estimates the error rate as a function of trainingset size
Lm en D amniexclregm C bm
ii Repeat the above procedure for the 25th and 75th quantiles of the vector of error rates fen1
enT1 g
4 METHODOLOGY APPLIED TO SEVERAL CANCERCLASSIFICATION PROBLEMS
The procedure outlined in the previous section has been applied to eight binary DNA microarray cancerclassi cation problems representing a broad range of level of complexity of classi cation The classi -cation problems are to discriminate between tumor morphologies (including disease versus normal andsubmorphologies) or treatment outcome A more detailed analysis of the methodology will be given forthe largest dataset (cancer versus normal tissue classi cation) For the seven other datasets we will presentonly the nal results
The set of examples falls into two cases The rst case consists of classi cation problems that arerelatively easy and where statistical signi cance for the classi ers is achieved at a low number of trainingset samples (eg between 10ndash20 samples) and where the dataset is suf ciently large (eg 40 samples) toaccurately t a learning curve The second case consists of classi cation problems that are more dif cultand where statistical signi cance is achieved at between 40ndash60 training samples while the total number ofsamples in the dataset is barely larger (eg between 50ndash70) For these more dif cult problems we cannotstrictly follow the methodology since we do not have enough training set sizes at which signi cance isreached to make an accurate t of the learning curve However we can still t the curves and use theresults as indicative and exploratory A possible third case is when signi cance is never reached for anyavailable training set size for a dataset In this case it is dif cult to draw any conclusion but it is possiblethat either adding more samples will not help (eg because there is not suf cient molecular information toclassify this dataset) or the problem is very hard and substantial numbers of samples are needed before onesees a statistically signi cant results Our rst ve morphology datasets are examples of the rst case The nal three treatment outcome datasets are examples of the second case Table 1 summarizes the learningcurve parameters and extrapolated error rate estimates at 400 training samples for all the data sets Table 2summarizes and comments on the results of running the methodology on the various datasets Generalconclusions and interpretation of the results will be presented in the next section
Tumor versus normal tissue
This dataset consists of expression levels for 180 samples of a variety of different primary tumors(breast prostate lung etc) and 100 normal samples from the corresponding tissue of origin (again breastprostate lung etc) (Ramaswamy et al 2001b) The dimensionality of the dataset is 16063 (throughoutthis section by dimensionality we mean the number of gene expression values recorded for a sample)No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik 1998)
126 MUKHERJEE ET AL
Tabl
e1
Sum
ma
ryof
Lea
rnin
gC
urv
ePa
ram
eter
sa
nd
Extr
apo
late
dE
rro
rR
ates
Lea
rnin
gcu
rve
para
met
ers
b
areg
asym
ple
arni
ngde
cay
Bay
esD
atas
etp
robl
emra
tera
teer
ror
Pre
dict
eder
ror
rate
at40
0sa
mpl
es
25th
quar
tile
Mea
n75
thqu
arti
le
Mul
tipl
etis
sues
47
73
89
25th
quan
tile
189
iexcl0
630
003
Can
cer
vs
norm
alm
ean
142
iexcl0
520
010
75th
quan
tile
117
iexcl0
430
000
Leu
kem
ia0
62
53
425
thqu
anti
le39
8iexcl
23
000
6A
ML
vs
AL
Lm
ean
771
iexcl0
650
009
75th
quan
tile
884
iexcl0
60
010
Col
onca
ncer
48
90
124
25
thqu
anti
le4
39iexcl
037
000
Can
cer
vs
norm
alm
ean
480
iexcl0
280
0075
thqu
anti
le5
00iexcl
023
000
Ova
rian
canc
er0
31
21
825
thqu
anti
le10
7iexcl
176
000
Can
cer
vs
norm
alm
ean
736
iexcl0
690
0075
thqu
anti
le9
95iexcl
067
000
Lym
phom
a0
09
15
25th
quan
tile
899
iexcl2
480
00D
LB
Cvs
Fo
llicu
lar
mea
n5
7iexcl
070
000
175
thqu
anti
le6
71iexcl
064
000
Bra
inca
ncer
145
16
1
238
25
thqu
antil
e1
06iexcl
037
000
4T
reat
men
tou
tcom
em
ean
112
iexcl0
330
006
75th
quan
tile
980
iexcl0
240
00Ly
mph
oma
112
17
24
3
25th
quan
tile
123
iexcl0
410
008
Tre
atm
ent
outc
ome
mea
n9
43iexcl
030
001
75th
quan
tile
872
iexcl0
210
00B
reas
tca
ncer
243
32
3
404
25
thqu
anti
le5
32iexcl
014
001
Tre
atm
ent
outc
ome
mea
n4
85iexcl
007
001
75th
quan
tile
429
iexcl0
010
00
ESTIMATING DATASET SIZE REQUIREMENTS 127
Tabl
e2
Su
mm
ary
of
Res
ult
sfo
rD
atase
tsIn
clu
ded
inth
eSt
ud
y
Max
imum
Size
toac
hiev
eLe
arni
ngtr
aini
ngcl
assi
errsquo
scu
rve
set
size
used
stat
istic
alt
ted
with
int
ting
Dat
aset
sign
ica
nce
sign
ica
ntle
arni
ngD
atas
etp
robl
emsi
ze(p
-val
lt0
05)
data
curv
eC
oncl
usio
ns
Cla
ssi
catio
ner
ror
Lear
ning
curv
eA
ctua
l(e
xtra
pola
ted)
E
rror
Mul
tiple
tissu
esC
ance
rvs
no
rmal
280
15ye
s21
07
08
62
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
10
Eve
nlo
wer
erro
rra
tes
lt0
07ar
eac
hiev
able
wit
hgt
400
sam
ples
L
euke
mia
AM
Lvs
A
LL
7310
yes
350
05
65
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
05
Add
itio
nal
sam
ples
may
low
erer
ror
toraquo
0C
olon
canc
erC
ance
rvs
no
rmal
6210
yes
5016
3
152
1
1D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
aner
ror
rate
ofraquo
016
L
ower
erro
rra
tes
lt0
09)
are
achi
evab
lew
ithgt
400
sam
ples
O
vari
anca
ncer
Can
cer
vs
norm
al54
10ye
s40
56
48
08
Dat
aset
size
isla
rge
enou
ghfo
rex
trap
olat
ion
and
toac
hiev
ea
low
erro
rra
teraquo
005
)A
dditi
onal
sam
ples
may
low
erer
ror
rate
toraquo
0L
ymph
oma
DL
BC
vs
Foll
icul
ar53
5ye
s40
44
70
7D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
035)
A
ddit
iona
lsa
mpl
esm
aylo
wer
erro
rra
teto
raquo0
Bra
inca
ncer
Tre
atm
ent
outc
ome
6045
no40
22
296
7
6D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
14w
ith
gt40
0sa
mpl
es
Lym
phom
aT
reat
men
tou
tcom
e58
50no
4023
29
5
65
Dat
aset
isno
tla
rge
enou
ghfo
rex
trap
olat
ion
Fitt
ing
ofle
arni
ngcu
rve
isin
accu
rate
but
sugg
ests
erro
rra
teco
uld
belt
017
wit
hgt
400
sam
ples
B
reas
tca
ncer
Tre
atm
ent
outc
ome
7865
no70
30
363
6
3D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
32w
ith
gt40
0sa
mpl
es
128 MUKHERJEE ET AL
with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279
Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15
To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is
errorn D 142niexcl052 C 0098
The error rates for the 25th and 75th quantiles are
error25n D 189niexcl063 C 0032
error75n D 117niexcl043 C 000
Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope
FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er
ESTIMATING DATASET SIZE REQUIREMENTS 129
FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes
is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows
error400 D 142400iexcl052 C 0098 D 73
error25400 D 189400iexcl063 C 0032 D 47
error75400 D 117400iexcl043 C 000 D 89
The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)
Leukemia morphology
The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is
errorn D 7706niexcl65 C 009
In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset
130 MUKHERJEE ET AL
FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
126 MUKHERJEE ET AL
Tabl
e1
Sum
ma
ryof
Lea
rnin
gC
urv
ePa
ram
eter
sa
nd
Extr
apo
late
dE
rro
rR
ates
Lea
rnin
gcu
rve
para
met
ers
b
areg
asym
ple
arni
ngde
cay
Bay
esD
atas
etp
robl
emra
tera
teer
ror
Pre
dict
eder
ror
rate
at40
0sa
mpl
es
25th
quar
tile
Mea
n75
thqu
arti
le
Mul
tipl
etis
sues
47
73
89
25th
quan
tile
189
iexcl0
630
003
Can
cer
vs
norm
alm
ean
142
iexcl0
520
010
75th
quan
tile
117
iexcl0
430
000
Leu
kem
ia0
62
53
425
thqu
anti
le39
8iexcl
23
000
6A
ML
vs
AL
Lm
ean
771
iexcl0
650
009
75th
quan
tile
884
iexcl0
60
010
Col
onca
ncer
48
90
124
25
thqu
anti
le4
39iexcl
037
000
Can
cer
vs
norm
alm
ean
480
iexcl0
280
0075
thqu
anti
le5
00iexcl
023
000
Ova
rian
canc
er0
31
21
825
thqu
anti
le10
7iexcl
176
000
Can
cer
vs
norm
alm
ean
736
iexcl0
690
0075
thqu
anti
le9
95iexcl
067
000
Lym
phom
a0
09
15
25th
quan
tile
899
iexcl2
480
00D
LB
Cvs
Fo
llicu
lar
mea
n5
7iexcl
070
000
175
thqu
anti
le6
71iexcl
064
000
Bra
inca
ncer
145
16
1
238
25
thqu
antil
e1
06iexcl
037
000
4T
reat
men
tou
tcom
em
ean
112
iexcl0
330
006
75th
quan
tile
980
iexcl0
240
00Ly
mph
oma
112
17
24
3
25th
quan
tile
123
iexcl0
410
008
Tre
atm
ent
outc
ome
mea
n9
43iexcl
030
001
75th
quan
tile
872
iexcl0
210
00B
reas
tca
ncer
243
32
3
404
25
thqu
anti
le5
32iexcl
014
001
Tre
atm
ent
outc
ome
mea
n4
85iexcl
007
001
75th
quan
tile
429
iexcl0
010
00
ESTIMATING DATASET SIZE REQUIREMENTS 127
Tabl
e2
Su
mm
ary
of
Res
ult
sfo
rD
atase
tsIn
clu
ded
inth
eSt
ud
y
Max
imum
Size
toac
hiev
eLe
arni
ngtr
aini
ngcl
assi
errsquo
scu
rve
set
size
used
stat
istic
alt
ted
with
int
ting
Dat
aset
sign
ica
nce
sign
ica
ntle
arni
ngD
atas
etp
robl
emsi
ze(p
-val
lt0
05)
data
curv
eC
oncl
usio
ns
Cla
ssi
catio
ner
ror
Lear
ning
curv
eA
ctua
l(e
xtra
pola
ted)
E
rror
Mul
tiple
tissu
esC
ance
rvs
no
rmal
280
15ye
s21
07
08
62
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
10
Eve
nlo
wer
erro
rra
tes
lt0
07ar
eac
hiev
able
wit
hgt
400
sam
ples
L
euke
mia
AM
Lvs
A
LL
7310
yes
350
05
65
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
05
Add
itio
nal
sam
ples
may
low
erer
ror
toraquo
0C
olon
canc
erC
ance
rvs
no
rmal
6210
yes
5016
3
152
1
1D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
aner
ror
rate
ofraquo
016
L
ower
erro
rra
tes
lt0
09)
are
achi
evab
lew
ithgt
400
sam
ples
O
vari
anca
ncer
Can
cer
vs
norm
al54
10ye
s40
56
48
08
Dat
aset
size
isla
rge
enou
ghfo
rex
trap
olat
ion
and
toac
hiev
ea
low
erro
rra
teraquo
005
)A
dditi
onal
sam
ples
may
low
erer
ror
rate
toraquo
0L
ymph
oma
DL
BC
vs
Foll
icul
ar53
5ye
s40
44
70
7D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
035)
A
ddit
iona
lsa
mpl
esm
aylo
wer
erro
rra
teto
raquo0
Bra
inca
ncer
Tre
atm
ent
outc
ome
6045
no40
22
296
7
6D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
14w
ith
gt40
0sa
mpl
es
Lym
phom
aT
reat
men
tou
tcom
e58
50no
4023
29
5
65
Dat
aset
isno
tla
rge
enou
ghfo
rex
trap
olat
ion
Fitt
ing
ofle
arni
ngcu
rve
isin
accu
rate
but
sugg
ests
erro
rra
teco
uld
belt
017
wit
hgt
400
sam
ples
B
reas
tca
ncer
Tre
atm
ent
outc
ome
7865
no70
30
363
6
3D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
32w
ith
gt40
0sa
mpl
es
128 MUKHERJEE ET AL
with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279
Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15
To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is
errorn D 142niexcl052 C 0098
The error rates for the 25th and 75th quantiles are
error25n D 189niexcl063 C 0032
error75n D 117niexcl043 C 000
Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope
FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er
ESTIMATING DATASET SIZE REQUIREMENTS 129
FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes
is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows
error400 D 142400iexcl052 C 0098 D 73
error25400 D 189400iexcl063 C 0032 D 47
error75400 D 117400iexcl043 C 000 D 89
The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)
Leukemia morphology
The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is
errorn D 7706niexcl65 C 009
In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset
130 MUKHERJEE ET AL
FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
ESTIMATING DATASET SIZE REQUIREMENTS 127
Tabl
e2
Su
mm
ary
of
Res
ult
sfo
rD
atase
tsIn
clu
ded
inth
eSt
ud
y
Max
imum
Size
toac
hiev
eLe
arni
ngtr
aini
ngcl
assi
errsquo
scu
rve
set
size
used
stat
istic
alt
ted
with
int
ting
Dat
aset
sign
ica
nce
sign
ica
ntle
arni
ngD
atas
etp
robl
emsi
ze(p
-val
lt0
05)
data
curv
eC
oncl
usio
ns
Cla
ssi
catio
ner
ror
Lear
ning
curv
eA
ctua
l(e
xtra
pola
ted)
E
rror
Mul
tiple
tissu
esC
ance
rvs
no
rmal
280
15ye
s21
07
08
62
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
10
Eve
nlo
wer
erro
rra
tes
lt0
07ar
eac
hiev
able
wit
hgt
400
sam
ples
L
euke
mia
AM
Lvs
A
LL
7310
yes
350
05
65
6D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
05
Add
itio
nal
sam
ples
may
low
erer
ror
toraquo
0C
olon
canc
erC
ance
rvs
no
rmal
6210
yes
5016
3
152
1
1D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
aner
ror
rate
ofraquo
016
L
ower
erro
rra
tes
lt0
09)
are
achi
evab
lew
ithgt
400
sam
ples
O
vari
anca
ncer
Can
cer
vs
norm
al54
10ye
s40
56
48
08
Dat
aset
size
isla
rge
enou
ghfo
rex
trap
olat
ion
and
toac
hiev
ea
low
erro
rra
teraquo
005
)A
dditi
onal
sam
ples
may
low
erer
ror
rate
toraquo
0L
ymph
oma
DL
BC
vs
Foll
icul
ar53
5ye
s40
44
70
7D
atas
etsi
zeis
larg
een
ough
for
extr
apol
atio
nan
dto
achi
eve
alo
wer
ror
rate
raquo0
035)
A
ddit
iona
lsa
mpl
esm
aylo
wer
erro
rra
teto
raquo0
Bra
inca
ncer
Tre
atm
ent
outc
ome
6045
no40
22
296
7
6D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
14w
ith
gt40
0sa
mpl
es
Lym
phom
aT
reat
men
tou
tcom
e58
50no
4023
29
5
65
Dat
aset
isno
tla
rge
enou
ghfo
rex
trap
olat
ion
Fitt
ing
ofle
arni
ngcu
rve
isin
accu
rate
but
sugg
ests
erro
rra
teco
uld
belt
017
wit
hgt
400
sam
ples
B
reas
tca
ncer
Tre
atm
ent
outc
ome
7865
no70
30
363
6
3D
atas
etis
not
larg
een
ough
for
extr
apol
atio
nFi
ttin
gof
lear
ning
curv
eis
inac
cura
tebu
tsu
gges
tser
ror
rate
coul
dbe
lt0
32w
ith
gt40
0sa
mpl
es
128 MUKHERJEE ET AL
with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279
Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15
To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is
errorn D 142niexcl052 C 0098
The error rates for the 25th and 75th quantiles are
error25n D 189niexcl063 C 0032
error75n D 117niexcl043 C 000
Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope
FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er
ESTIMATING DATASET SIZE REQUIREMENTS 129
FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes
is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows
error400 D 142400iexcl052 C 0098 D 73
error25400 D 189400iexcl063 C 0032 D 47
error75400 D 117400iexcl043 C 000 D 89
The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)
Leukemia morphology
The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is
errorn D 7706niexcl65 C 009
In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset
130 MUKHERJEE ET AL
FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
128 MUKHERJEE ET AL
with a linear kernel and no feature selection Error rates were estimated for training set sizes of n D30 40 50 60 80 90 130 170 210 A leave-one-out model built with all the available samples (280)was used to validate the method and to compare the scaling model to the error rate achieved when usingalmost the entire dataset this corresponds to a training set size of n D 279
Figure 3 illustrates the results of the signi cance permutation test for this dataset ie the statisticalsigni cance of classi ers with training sets of 15 and 30 samples As can be seen in Fig 3b with 30samples most of the random classi ers attain larger error rates than the actual classi er For the case using15 samples about one in 20 of the random classi ers attain the same or better error rates and therefore ap-value of 5 is achieved To t the learning curves we will use only data points obtained from trainingsets of size greater than or equal to 15
To study the improvement of the learning curve estimates as a function of the number of training setsizes used to t the learning curves we constructed four learning curves using 1) the error rates for alltraining set sizes (up to 210) 2) the error rates for the rst 8 training set sizes 3) the error rates for the rst 6 training set sizes 4) the error rates for the rst 4 training set sizes The plots of these learning curvesalong with the leave-one-out error for 280 samples is given in Fig 4 As expected the model improvesas more and larger training set sizes are used in the t The actual leave-one-out error rate achieved with280 samples is only about 2 less than the error rate estimated for 279 training samples by extrapolatingthe learning curve model Figure 5 shows the curve for the power law that results from applying themethodology to a) all training samples sizes stated above (up to 210) and b) using the rst 6 training setsizes (up to 90) along with the leave-one-out error for the entire dataset (280 samples) The expressionfor the error rate as a function of n estimated using training sets sizes (up to 210) is
errorn D 142niexcl052 C 0098
The error rates for the 25th and 75th quantiles are
error25n D 189niexcl063 C 0032
error75n D 117niexcl043 C 000
Based on this model one can see clearly how fast the error rate decreases with increasing dataset sizeThe asymptotic Bayes error rate b is very small indicating that indeed very low errors can be achieved if alarge dataset were used to train the classi er The decay rate reg is about 5 indicating that in scaling termsthis is a rather dif cult problem for the model to learn The size of the 25th and 75th quantiles envelope
FIG 3 The statistical signi cance in the tumor versus nontumor classi cation for (a) 15 samples and (b) 30 samplesThe line is the empirical distribution function for the random classi ers and the gray circle is the average error ratefor the actual classi er
ESTIMATING DATASET SIZE REQUIREMENTS 129
FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes
is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows
error400 D 142400iexcl052 C 0098 D 73
error25400 D 189400iexcl063 C 0032 D 47
error75400 D 117400iexcl043 C 000 D 89
The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)
Leukemia morphology
The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is
errorn D 7706niexcl65 C 009
In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset
130 MUKHERJEE ET AL
FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
ESTIMATING DATASET SIZE REQUIREMENTS 129
FIG 4 Error rate as a function of sample size The diamond is the leave-one-out error using 279 samples The topcurve is the learning curve using the rst four sample sizes The bottom curve is the learning curve using all samplesizes The middle two curves (which basically overlap) are the learning curves using the rst six (top) and rst eight(bottom) sample sizes
is about C=iexcl 2 and it indicates that the model is relatively accurate If we were going to collect 400training samples this model can be used to extrapolate the error rate as follows
error400 D 142400iexcl052 C 0098 D 73
error25400 D 189400iexcl063 C 0032 D 47
error75400 D 117400iexcl043 C 000 D 89
The achievable error rate using 400 samples according to the model is 73 sect 26 and perhaps as low as47 (25th quantile envelope)
Leukemia morphology
The dataset consists of expression levels for 48 samples of acute lymphoblastic leukemia (ALL) and25 samples of acute myeloid leukemia (AML) (Golub et al 1999) The dimensionality of the datasetis 7129 No preprocessing was performed The classi er used was a Support Vector Machine (Vapnik1998) with a linear kernel and no feature selection Error rates were estimated for training set sizes ofn D 10 15 20 25 30 35 In Fig 6 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 73 samples A p-value of 5 is achieved at about 5 samples Thelearning curve estimate of the error rate as a function of n is
errorn D 7706niexcl65 C 009
In this case the learning and decay rates are such that the model clearly learns more quickly than in theprevious example as a function of training set size It achieves practically a zero error rate at 73 samples(consistent with the 25th quantile envelope) The envelope is wider in this case because we t the modelusing a narrower range of dataset sizes over which the empirical error rates display more variation thanthe previous dataset
130 MUKHERJEE ET AL
FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
130 MUKHERJEE ET AL
FIG 5 Learning curves in the tumor versus nontumor classi cation constructed using (a) all sample sizes statedabove and (b) using the rst six sample sizes stated above The middle line is the learning curve for the mean errorThe bottom line is for the 25th quantile The top line is for the 75th quantile The diamond is the leave-one-out errorand the circles are the measured average error rates
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
ESTIMATING DATASET SIZE REQUIREMENTS 131
FIG 6 Learning curves in the AML versus ALL classi cation constructed using sample sizes stated above Themiddle line is the learning curve for the mean error The bottom line is for the 25th quantile The top line is for the75th quantile The diamond is the leave-one-out error and the circles are the measured average error rates
Colon cancer
The dataset consists of expression levels for 22 samples of normal colon tissue and 40 samples of malig-nant tissue (Noterman et al 2001) The dimensionality of the dataset is 2000 The data was preprocessedby taking the natural logarithm of all input values and then applying a hyperbolic-tangent function Theclassi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selectionError rates were estimated for training set sizes of n D 10 15 20 25 30 35 40 45 50 In Fig 7 a plotof the learning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the62 samples
A p-value of 5 is achieved at about 10 samples The learning curve estimate of the error rate as afunction of n is
errorn D 4798niexcl02797
Ovarian cancer
The dataset consists expression levels for 24 samples of normal ovarian tissue and 30 samples ofmalignant tissue (Schummer et al 1999) The dimensionality of the dataset is 1536 The data waspreprocessed by adding 1 and taking the natural logarithm of all input values The classi er used was aSupport Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Error rates wereestimated for training set sizes of n D 10 15 20 25 30 35 40 In Fig 8 a plot of the learning curveand its 25th and 75th quantiles is given along with the leave-one-out error of the 54 samples A p-valueof 5 is achieved at about 10 samples The learning curve estimate of the error rate as a function of n is
errorn D 7362niexcl06864
Lymphoma morphology
The dataset consists of expression levels for 24 samples of diffuse large B-cell lymphoma and 12 samplesof follicular lymphoma and chronic lymphocyptic (Alizadeh et al 2000) The dimensionality of the dataset
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
132 MUKHERJEE ET AL
FIG 7 Learning curves in the malignant versus normal colon tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
FIG 8 Learning curves in the cancerous versus normal ovarian tissue classi cation constructed using sample sizesstated above The middle line is the learning curve for the mean error The bottom line is for the 25th quantile Thetop line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured averageerror rates
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
ESTIMATING DATASET SIZE REQUIREMENTS 133
was 18432 The data was preprocessed by taking the base 2 logarithm of all input values The classi erused was a Support Vector Machine (Vapnik 1998) with a linear kernel and no feature selection Errorrates were estimated for training sizes of n D 5 10 15 20 25 30 35 40 In Fig 9 a plot of the learningcurve and its 25th and 75th quantiles is given along with the leave-one-out error of the 36 samples Ap-value of 5 is achieved at about 5 samples The learning curve estimate of the error rate as a functionof n is
errorn D 57niexcl07073 C 0006
Brain cancer treatment outcome
The dataset was obtained from 39 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 21 samples of patients with poor treatment outcome All patients had childhoodmedulloblastomas (Pomeroy et al 2002) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40
Statistical signi cance on this dataset (a p-value of 5) is achieved at about 45 samples which is largerthan any of the training set sizes for which error rates were estimated so strictly speaking we cannot applythe methodology
However we can examine how accurately a learning curve t to the error rates for the above trainingset sizes can extrapolate In Fig 10 a plot of the learning curve and its 25th and 75th quantiles is givenalong with the leave-one-out error of the 60 samples As expected this model is not very accurate andoverestimates the error rate at 59 samples by more than 7 The learning curve estimate of the error rateas a function of n is
errorn D 1115niexcl3295 C 006
FIG 9 Learning curves in the diffuse large B-cell versus follicular morphology classi cation constructed usingsample sizes stated above The middle line is the learning curve for the mean error The bottom line is for the 25thquantile The top line is for the 75th quantile The diamond is the leave-one-out error and the circles are the measuredaverage error rates
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
134 MUKHERJEE ET AL
FIG 10 Learning curves in the brain tumor treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
Lymphoma treatment outcome
The dataset was obtained from 32 samples of patients that had successful treatment outcome (alive twoyears after treatment) and 26 samples of patients with poor treatment outcome All patients had diffuse largecell lymphoma (DLCL) (Shipp et al 2001) The dimensionality of the dataset is 7129 No preprocessingwas performed The classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernelselecting 150 features using the radius-margin criteria (Chapelle et al 2001) Error rates were estimatedfor training set sizes of n D 20 25 30 35 40 Statistical signi cance on this dataset (a p-value of 5)is achieved at about 50 samples In Fig 11 a plot of the learning curve and its 25th and 75th quantiles isgiven along with the leave-one-out error of the 58 samples As expected this model is not very accurateand overestimates the error rate at 57 samples by more than 9 The learning curve estimate of the errorrate as a function of n is
errorn D 9431niexcl2957 C 01
Breast cancer treatment outcome
The dataset consists expression levels of 34 samples from patients with breast tumors that metastasizedwithin ve years of disease onset and 44 samples from patients that were disease free for at least ve years(Vanrsquot Veer et al 2002) The dimensionality of the dataset was 24624 No preprocessing was performedThe classi er used was a Support Vector Machine (Vapnik 1998) with a linear kernel without featureselection Error rates were estimated for training set sizes of n D 10 20 30 40 50 60 70 Statisticalsigni cance on this dataset (a p-value of 5) is achieved at about 65 samples In Fig 12 a plot of thelearning curve and its 25th and 75th quantiles is given along with the leave-one-out error of the 78 Asexpected this model is not very accurate and overestimates the error rate at 77 samples by more than 6The learning curve estimate of the error rate as a function of n is
errorn D 4852niexcl0733 C 01
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
ESTIMATING DATASET SIZE REQUIREMENTS 135
FIG 11 Learning curves in the lymphoma treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
FIG 12 Learning curves in the breast cancer treatment outcome classi cation constructed using sample sizes statedabove The middle line is the learning curve for the mean error The bottom line is for the 25th quantile The topline is for the 75th quantile The diamond is the leave-one-out error and the circles are the measured average errorrates
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
136 MUKHERJEE ET AL
5 CONCLUSIONS
We have described a methodology for assessing the signi cance of a classi errsquos performance via apermutation test and constructing a learning curve to extrapolate error statistics for larger data sets thatinclude yet unseen samples We applied the method to eight cancer classi cation problems of varyinglevels of dif culty Based on the results of the previous section one can see that the inverse power-lawscaling model proposed ts the empirical error rates reasonably well The classi er we used was an SVMbut the methodology is applicable to other algorithms (eg weighted voting k-nearest neighbors logisticregression etc) For the morphology classi cation problems the extrapolation is quite accurate For thetreatment outcome classi cation problems the combination of the increased complexity of the problemsand the limited dataset sizes yield a less accurate but still indicative extrapolation As expected the modelimproves as larger training samples sizes are used in the learning curve t (see Figs 4 and 5) The learningcurves bear out the empirical observation that morphological distinctions are more dramatic and thusin general ldquosimplerrdquo problems than the more subtle distinctions that must be determined for treatmentoutcome prediction Signi cance on morphology problems is achieved with 10ndash20 training samples andldquoreasonably accuraterdquo extrapolation requires 30ndash40 training samples In contrast for treatment outcomesigni cance is achieved with 45ndash60 training samples and ldquoreasonably accuraterdquo extrapolation requireson the order of 75ndash100 training samples For morphological distinctions the learning curve prediction isreasonably close to the actual leave-one-out error measured at a larger size The 25th and 75th quantilemodels provide useful error bar envelopes that enclose the observed error rates for those problems Fortreatment outcome prediction due to the large training set size required to achieve signi cance and smallavailable dataset sizes we do not have enough signi cant classi ers with which to construct an accuratelearning curve Consequently we get less accurate estimates of the leave-one-error on the entire datasetfor the outcome treatment examples with differences of 7 for brain tumor outcome 9 for lymphomatreatment outcome and 8 for breast tumor metastasis
The estimations of the asymptotic Bayes error b the learning rate a and decay rate reg can also be useddirectly to characterize the dif culty of a problem and the complexity of a model They can provide abasis for comparing and contrasting models and problems To illustrate we show in Fig 13 the values ofthese parameters for the examples discussed in the paper The morphology and treatment outcome datasetscluster with respect to reg and b We have not elaborated on this aspect of the analysis but it is certainly aninteresting direction to pursue in the future
In summary our methodology produces reasonable nontrivial dataset size estimates when applied to afairly general set of molecular cancer classi cation problems In this context it can serve as a valuable
FIG 13 A plot of 1=reg versus macr for the eight datasets The diamonds correspond to treatment outcome problemsand the squares correspond to morphology prediction problems
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
ESTIMATING DATASET SIZE REQUIREMENTS 137
tool when designing future experiments either for evaluating whether it is worthwhile to collect additionalsamples or for obtaining a deeper insight into the complexity of a given classi cation problem basedon preliminary data Table 1 shows a summary of the results for the examples described in this paperThe results of applying this method to those examples suggest that minimum dataset size requirementsfor morphological classi cations are typically in the 10ndash20 samples range and upwards of 50 samples fortreatment outcome classi cation These results can be used to provide general rule-of-thumb guidelinesbut the exact numbers for a given problem are dataset and classi er dependent This method can also beapplied to other domains where a prospective estimation of the number of samples is relevant as is thecase in many problems using molecular features to classify biological samples eg classi cation basedon proteomic mass spectroscopy data chemosensitivity prediction survival analysis and putative classdiscovery using clustering
APPENDIX 1 BIAS PROPERTIES OF THE MEAN VARIANCE ANDQUANTILES OF LEAVE-P-OUT ESTIMATORS
1a The mean of the leave-p-out estimator is unbiased
Statement 1 The procedure of excluding p samples from a dataset of size ` constructing a classi erand then testing on the p samples left out is designated as follows
Lpz1 z` D 1p
X
p
Qzp f z`iexclp
where z D x y f z`iexclp is the classi er constructed with p samples left out and Qzp f z`iexclp is theerror of this classi er on the p samples left out This procedure is unbiased
ELpz1 z` D EQz f z`iexclp
which means that the expected error when the classi er is trained with ` iexcl p samples is the same as theexpected error of procedure Lp
Proof The proof is a straightforward extension of the leave-one-out case which was derived by Luntzand Brailovsky (Luntz and Brailovsky 1969) by the following series of transformations
EL1z1 z` D 1`
Z X
iD1
Qzi f z`iexcl1dPz1 dPz`
D 1`
Z X
iD1
Qzi f z`iexcl1dPzidPz1 dPziiexcl1dPziC1 dPz`
D 1`
X
iD1
EQzi f z`iexcl1
D EQz f z`iexcl1
The implication of this statement is that the subsampling procedure proposed is unbiased and in expectationgives us more accurate estimates of the true error of a classi er trained with ` iexcl p samples as the numberof subsamples increases
1b The variance of the leave-p-out estimator is biased and optimistic
Statement 2 The variance of the leave-p-out estimator is less than or equal to the variance of aclassi er trained with `iexclp samples and tested on an independent sample so the variance of the leave-p-out
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
138 MUKHERJEE ET AL
procedure is not necessarily unbiased This procedure is not necessarily unbiased and in general will beoptimistic
V Lpz1 z` middot VQz f z`iexclp
which means that the expected variance when the classi er is trained with ` iexcl p samples is greater thanor equal to the variance of the procedure Lp
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The variance of training sets of size ` iexcl 1 is
VQz f z`iexcl1 D EbQz f z`iexcl12 iexcl [EQz f z`iexcl1]2c
The variance of the estimator
V1`
L1z1 z` D V
1`
X
iD1
ti
where ti is whether an error is made or not when the ith point is left out
Qzi f x1 xiiexcl1 xiC1 x`
We can rewrite this as
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl
Aacute
E1`
X
iD1
ti
2
From statement 1 we know
E1`
X
iD1
ti D EQz f z`iexcl1
So
V
1`
X
iD1
ti
D E
Aacute1`
X
iD1
ti
2
iexcl [EQz f z`iexcl1]2
One can show that
E
Aacute1`
X
iD1
ti
2
middot E[Qz f z`iexcl12]
One can write
E
Aacute1`
X
iD1
ti
2
D E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A
If the random variables ti tj are identical and independent then the above equation can be rewritten
E
0
1`2
X
iD1
t2i C 1
`2
X
i 6Dj
ti tj
1
A D E
sup3t2
`C
`2 iexcl `
`2t2
acuteD E[Qz f z`iexcl1]2
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
ESTIMATING DATASET SIZE REQUIREMENTS 139
However t1 t` are not independent and are correlated so
E
0
X
i 6Dj
ti tj
1
A middot E
sup3`2 iexcl `
`2t2
acute
The implication of this statement is that the variance of the subsampling procedure proposed is biased anddoes not give an accurate estimate of the variance of a classi er trained with ` iexcl p samples and in generalthe variance of the subsampling procedure will be smaller
1c Quantiles of the leave-p-out estimator are unbiased
Statement 3 Quantiles of the leave-p-out estimator estimated give an accurate estimate of quantilesof a classi er trained with ` iexcl p samples and tested on an independent sample
Proof Again we will prove the leave-one-out case and the leave-p-out case is a straightforwardextension
The cumulative distribution function of the random variable is t D Qz f z`iexcl1 is PD`iexcl1 ft lt raquog Thecumulative distribution function of the random variable r D 1
`L`iexcl1z1 z` is PD`
fr lt raquog If we showthat these distribution functions are equal then the quantiles of the leave-p-out estimator are unbiased Thedistribution function of the random variable t is
P t lt raquo DZ raquo
iexcl1tptdt D
Z 1
iexcl1tmicroraquo iexcl tptdt
The distribution function for the random variable r D 1`
P`iD1 ri can be written as follows by a similar
sequence of transformations as used in the proof of statement 1
P r lt raquo D 1`
X
iD1
Z 1
iexcl1rimicroraquo iexcl ripridri
which is the same as P t lt raquo We have now shown that the cumulative distribution function of the error measured of the leave-p-out
procedure is equivalent to the cumulative distribution of a classi er trained on ` iexcl p samplesHowever we do not have this distribution when we run the leave-p-out procedure We have a sequence of
p error rates and we take the 25th and 75th quantiles of the empirical distribution function constructed fromthe sequence We can use the KolmogorovndashSmirnov or Smirnov distributions to show that the empiricalquantiles values are close to those for the true underlying distribution For ` large enough ` gt 20 theKolmogorovndashSmirnov distribution gives us
P fp
` supx
jF x iexcl F`xj lt g frac14 1 iexcl 21X
kD1
iexcl1kiexcl1eiexcl2k22
where F x is the distribution function of the error rates and F`x is the empirical distribution functionconstructed from a sequence of ` error rates We can use this result to state that with probability 1 iexcl plusmn
the difference between the estimate of a quantile and the true quantile value will be bounded For the casewhere ` middot 20 instead of using the KolmogorovndashSmirnov distribution one can use tabulated values of theSmirnov distribution
The implication of this statement is that the subsampling procedure proposed gives us more accurateestimates of the quantiles of the true error of a classi er trained with ` iexcl p samples as the number ofsubsamples increases
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
140 MUKHERJEE ET AL
APPENDIX 2 MOTIVATION OF THE INVERSE POWER LAW FOR THEERROR RATE AS A FUNCTION OF TRAINING SET SIZE
2a A motivation from the statistical mechanics approach to learning
In this appendix we will describe results derived within the statistical mechanics (SM) approach togeneralization (Engel 2001 Watkin 1993) which strongly motivates the use of Equation (1) In thisapproach the average generalization error can be calculated as a function of n In order to derive analyticexpressions the data is assumed to consist of randomly constructed and uncorrelated input patterns Thisassumption is unrealistic for practical datasets of course but we can assume that the functional relationderived between en and n largely holds for real-life data In the SM approach a teacher (the rule to bediscovered) and student (the learner) are used with the extent of correlation between teacher and studentquantifying generalization ability To be more speci c let us consider a simple perceptron (Rosenblatt1962) rule with a decision function
y D signw cent z
where z is an input vector w is the weight vector for the perceptron (which weights the relevance ofparticular inputs or attributes) and y D sect1 is the output Suppose the weight vector for the teacherperceptron is t and the weight vector for the student perceptron is w Then the number of generalizationerrors made by the student perceptron on a set of p new samples will be
pX
iD1
micro[iexclt cent ziw cent zi]
where microx D 0 if x lt 0 and 1 otherwise This general approach leads to a set of equations for determiningthe generalization error via a replica approach (Hertz 1991) involving integrations in the weight space wThe generalization error is given as a function of the ratio macr D n=m (where m is the number of attributes)However with m xed we can assume the same functional dependence on n as for macr From this analysiswe nd that the generalization error depends on the algorithm used and generally assumes a power law
It can be argued (Dietrich et al 1999) that a Support Vector Machine with linear kernel used in ournumerical simulations has the same generalization error dependence as the optimal perceptron (Opperet al 1990) We have solved the system of equations (Opper et al 1990) in the low macr limit and nd avery close t to Equation (1) With few samples and a large number of measured attributes the low macr limitis most appropriate when considering microarray data However some further insights can also be gainedby considering the high macr limit where the dependence of generalization error on macr (or equivalently n) canbe extracted explicitly Thus for the optimal perceptron the generalization error scales as 50niexcl1 (Opperet al 1995) Similarly for other rules this scaling can be extracted explicitly (Engel 2001) For examplefor the Bayes optimal classi er (derived from the Bayes point or center of mass of version spacemdashthespace of all hypotheses consistent with the data) the generalization error scales as 44niexcl1 (Opper et al1991) For the Adaline learning rule the error scales as 24niexcl1=2 and for the Hebb rule as 40niexcl1=2
(see Watkin [1993] for a review) The dependence on n is thus approximately niexclreg with reg near 1 forthe more ef cient rules such as the optimal perceptron and Bayes optimal classi er The SM approach togeneralization has also been used to quantify the effects of input noise output noise and noise affectingthe parameters in the model (eg the weights w) Thus for example white noise added to examples inthe training set appears as an additive constant term to the generalization error (justifying the b term inEquation (1)) In summary then this approach strongly motivates use of en D aniexclreg C b for modelingthe generalization error
2b A motivation from an approximation theory point of view
Another justi cation for a power law for regression or classi cation comes from approximation theory(Niyogi et al 1996) In the approximation theory framework the classi cation functions come from somerestricted function class f 2 H and the optimal classi cation function fT is a more complicated function
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
ESTIMATING DATASET SIZE REQUIREMENTS 141
that is not included in the function class H For a wide variety of algorithms the distance between theoptimal function in the function class fo 2 H and fT is characterized as
dfo fT D Oniexclreg
where reg gt 0 Worst case analyses measure the distance between the two functions as the value of thepoint of greatest deviation between these functions
For function classes used in most algorithms the worst case analysis yields reg D 5 In general anempirical result should have quicker convergence since the worst case assumptions need not be madeWhen the loss function V cent cent is smooth then the difference in error measured using the loss functionbetween the functions fo and fT is V fox y iexcl V fT x y D Oniexclreg for all x 2 X and y 2 Y By asmooth loss function we mean loss functions that are `P with p cedil 1 or Lipschitz over a bounded domainNote that the classi cation loss V f x y D microiexclyf x is not Lipschitz and when the classi er outputsf x D sect1 the loss function is `0 However for most algorithms the loss function optimized to set theparameters of the classi er is Lipschitz (for computational reasons the `0 loss is not used) For examplein Support Vector Machines for classi cation the loss function is Lipschitz For this reason this analysisis still appropriate
ACKNOWLEDGEMENTS
We are indebted to members of the Cancer Genomics Group WhiteheadMIT Center for GenomeResearch and the Golub Laboratory Dana-Farber Cancer Institute for many valuable discussions Thiswork is supported in part by grants from Affymetrix Inc Millennium Pharmaceuticals Inc Bristol-MyersSquibb Company and the Of ce of Science (BER) US Department of Energy Grant No DE-FG02-01ER63185 Sayan Mukherjee was supported by a SLOANDOE postdoctoral fellowship
REFERENCES
Adcock C 1997 Sample size determination a review Statistician 46 261ndash283Alizadeh AA Eisen MB Davis RE Ma C Lossos IS Rosenwald A Boldrick JC Sabet H Tran T Yu
X Powell JI Yang L Marti GE Moore T Hudson J Jr Lu L Lewis DB Tibshirani R Sherlock GChan WC Greiner TC Weisenburger DD Armitage JO Warnke R and Staudt LM 2000 Distinct typesof diffuse large B-cell lymphoma identi ed by gene expression pro ling Nature 403 503ndash511
Anderson J 1983 Architecture of Cognition Harvard University Press Cambridge MABhattacharjeeA Richards WG Staunton J Li C Monti S Vasa P Ladd C Beheshti J Bueno R Gillette M
Loda M Weber G Mark EJ Lander ES Wong W Johnson BE Golub TR Sugarbaker DJ MeyersonM 2001 Classi cation of human lung carcinomas by mRNA expression pro ling reveals distinct adenocarcinomasubclasses Proc Natl Acad Sci USA 98 13790ndash13795
Chapelle O Vapnik V Bousquet O Mukherjee S 2001 Choosing multiple parameters for support vector machinesMachine Learning 46 131ndash159
Chipping Forecast 1999 Nature Genetics vol 21 supplementCortes C Jackel L Solla S Vapnik V and Denker S 1994 Asymptotic values and rates of convergence In
Advances in Neural Information Processing Systems VI Morgan Kaufmann San Francisco CADevoye L Gyor L and Lugosi G 1997 A Probabilistic Theory of Pattern Recognition Springer-Verlag New
YorkDietrich R Opper M and Sompolinsky H 1999 Statistical mechanics of support vector networks Phys Rev Letters
82 2975ndash2978Dietrich R Opper M and Sompolinsky H 2000 Support vectors and statistical mechanics In Smola AJ and
Bartlett PJ eds Advances in Large Margin Classi ers 359ndash368 MIT Press Cambridge MAEngel A and Van den Broeck C 2001 Statistical Mechanics of Machine Learning Cambridge University PressGolub TR Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh ML Downing
JR Caligiuri MA Bloom eld CD Lander ES 1999 Molecular classi cation of cancer Class discovery andclass prediction by gene expression monitoring Science 286 531ndash537
Golland P Mukherjee S Tamayo P and Fischl B 2002 Permutation tests for classi cation Submitted to Advancesin Neural Information Proceedings
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu
142 MUKHERJEE ET AL
Good P 1994 Permutation Tests A Practical Guide to Resampling Methods for Testing Hypothesis Springer-VerlagNew York
Guyon I Makhoul J Schwartz R and Vapnik V 1998 What size test set gives good error estimates IEEETransactions on Pattern Analysis and Machine Intelligence 20 52ndash64
Hertz J Krogh A and Palmer R 1991 Introduction to the Theory of Neural Computation Addison-Wesley BostonMA
Kerr A and Churchill G 2001a Experimental design for gene expression microarrays Biostatistics 2 183ndash201Kerr A and Churchill G 2001b Statistical design and the analysis of gene expression microarrays Genetic Research
77 123ndash128Li C and Wong WH 2001 Model-based analysis of oligonucleotide arrays Model validation design issues and
standard error application Genome Biol 2(8) research 00321-003211Luntz A and Brailovsky V 1969 On estimation of characters obtained in statistical procedure of recognition
Technicheskaya Kibernetica 3 (in Russian)Niyogi P and Girosi F 1996 On the relationship between generalization error hypothesis complexity and sample
complexity for radial basis functions Neural Computation 8 819ndash842Noterman D Alon U Sierk A Levine A 2001 Transcriptional gene expression pro les for colorectal adenoma
adenocarcinoma and normal tissue examined by oligonucleotide array Cancer Res 61 3124ndash3130Opper M and Haussler D 1991 Generalization performance of Bayes optimal classi cation algorithms for learning
a perceptron Phys Rev Letters 66 2677ndash2680Opper M and Kinzel W 1995 Statistical Mechanics of Generalisation in Models of Neural Networks Springer
Verlag HeidelbergOpper M Kinzel W Kleinz J and Nehl R 1990 On the ability of the optimal perceptron to generalise J Physics
23 581ndash586Pomeroy S Tamayo P Gaasenbeek M Sturla L Angelo M McLaughlin M Kim J Goumnerova L Black
P Lau C Allen J Zagzag D Olson J Curran T Wetmore C Biegel J Poggio T Mukherjee S RifkinR Califano A Stolovitzky G Louis D Mesirov JP Lander E Golub T 2002 Gene expression-basedclassi cation and outcome prediction of central nervous system embryonal tumors Nature 415 436ndash442
Ramaswamy S Osteen RT and Shulman LN 2001b Metastatic cancer from an unknown primary site In LenhardRE Osteen RT and Gansler T eds Clinical Oncology 711ndash719 American Cancer Society Atlanta GA
Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C-H Angelo M Ladd C Reich M Latulippe EMesirov JP Poggio T Gerald W Loda M Lander ES Golub TR 2001a Multiclass cancer diagnosis byusing tumor gene expression signatures Proc Natl Acad Sci USA 98 15149ndash15154
Rosenblatt F 1962 Principles of Neurodynamics Spartan Books New YorkSchummer M Ng W Bumgarner R Nelson P Schummer B Bednarski D Hassell R Baldwin R Karlan
B Hood L 1999 Comparitive hybridization of an array of 21500 ovarian cDNAs for the discovery of genesoverexpressed in ovarian carcinomas Gene 238 375ndash385
Shipp M Ross K Tamayo P Weng A Kutok J Aguiar R Gaasenbeek M Angelo M Reich M PinkusG Ray T Koval M Last K Norton A Lister T Mesirov JP Neuberg D Lander E Aster J GolubT 2001 Diffuse large B-cell lymphoma outcome prediction by gene expression pro ling and supervised machinelearning Nature Medicine 8 68ndash74
Shrager J Hogg T and Huberman BA 1988 A graph-dynamic model of the power law of practice and theproblem-solving fan effect Science 242 414ndash416
Slonim DK Tamayo P Mesirov JP Golub TR Lander ES 2000 Class prediction and discovery using geneexpression data In Proc 4th Ann Int Conf Computational Molecular Biology RECOMB ACM Press New York
Tsang GC Oh M-K Rohlin L Liao JC and Wong WH 2001 Issues in cDNA microarray analysis Quality ltering channel normalization models of variation and assessment of gene effects Nucl Acids Res 29 2549ndash2557
Watkin T Rau A and Biehl M 1993 The statistical mechanics of learning a rule Rev Modern Physics 65 499ndash556Vapnik VN 1998 Statistical Learning Theory John Wiley New YorkYeang CH Ramaswamy S Tamayo P Mukherjee S Rifkin RM Angelo M Reich M Lander E Mesirov
JP Golub T 2001 Molecular classi cation of multiple tumor types Bioinformatics 17 (suppl 1) S316ndashS322
Address correspondence toSayan Mukherjee
Room 201 Building E-2545 Carleton Street
Cambridge MA 02139
E-mail sayangenomewimitedu