18
1 Survey and Evaluation of Methods for Tissue Classification Using Gene Expression Data Per-Olof Fjällström Affibody P.O. Box 20137 SE-161 02 Bromma, SWEDEN Abstract Microarray experiments allow us to simultaneously monitor the expression levels in cells of thousands of genes. This may lead to both a better understanding of biological mechanisms and to more accurate diagnosis methods. For example, for diagnostic purposes it would be very valuable if we could develop class prediction methods that, given a collection of labeled gene expression profiles, accurately predict the label of an unlabeled profile. Many general-purpose class prediction methods are known from the statistical learning literature, and it seems reasonable to first evaluate these methods before starting to develop gene-expression-specific methods. In this report, we describe and compare a number of well-known class prediction methods, as well as a number of methods for identifying so-called informative genes. That is, genes whose expres- sion levels show a strong correlation with given phenotypes. To evaluate the methods, we have implemented them in MATLAB, and then applied them to both real and simulated gene expression data sets. The results of our evaluation suggest that simple, well-known classification methods, such as nearest neighbor classification and linear discriminant analysis, are very competitive. Not only are they easy to implement and fast to execute, but for most data sets they are al- so as accurate (or more) as more advanced general- or specific-purpose methods. (These results agree with the results presented in Dudoit, Fridlyand and Speed (2002), and Ben-Dor et al (2000).) Our eval- uation of methods for finding informative genes indicates that they produce similar results. Interesting- ly, we also found that constructing predictors using only the most informative genes sometimes may lead to worse prediction accuracy than if all genes are used. 1 Introduction Suppose that we have m mRNA tissue samples. Each sample j X , m j ,..., 2 , 1 = , con- sists of mRNA expression levels, ij X , n i ,..., 2 , 1 = , measured from (the same) n genes. With each sample is associated a class label j l such that } ,..., 2 , 1 { K l j . The class labels have usually not been determined by examining the mRNA samples, but by examining morphological and clinical data about the patient from which the sam- ple was taken. Often 2 = K , in which case the class labels may correspond to sick or healthy, acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML), etc. Is it possible to use the given mRNA samples to quickly and correctly classify unla- beled samples? More specifically, can we devise a class prediction function ) ( ˆ Y f by which we accurately can label an unknown sample Y ?

Survey and Evaluation of Methods for Tissue Classification

  • Upload
    perfj

  • View
    39

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Survey and Evaluation of Methods for Tissue Classification

1

Survey and Evaluation of Methods for Tissue Classification Using

Gene Expression Data

Per-Olof Fjällström Affibody

P.O. Box 20137 SE-161 02 Bromma, SWEDEN

Abstract Microarray experiments allow us to simultaneously monitor the expression levels in cells of thousands of genes. This may lead to both a better understanding of biological mechanisms and to more accurate diagnosis methods. For example, for diagnostic purposes it would be very valuable if we could develop class prediction methods that, given a collection of labeled gene expression profiles, accurately predict the label of an unlabeled profile. Many general-purpose class prediction methods are known from the statistical learning literature, and it seems reasonable to first evaluate these methods before starting to develop gene-expression-specific methods. In this report, we describe and compare a number of well-known class prediction methods, as well as a number of methods for identifying so-called informative genes. That is, genes whose expres-sion levels show a strong correlation with given phenotypes. To evaluate the methods, we have implemented them in MATLAB, and then applied them to both real and simulated gene expression data sets. The results of our evaluation suggest that simple, well-known classification methods, such as nearest neighbor classification and linear discriminant analysis, are very competitive. Not only are they easy to implement and fast to execute, but for most data sets they are al-so as accurate (or more) as more advanced general- or specific-purpose methods. (These results agree with the results presented in Dudoit, Fridlyand and Speed (2002), and Ben-Dor et al (2000).) Our eval-uation of methods for finding informative genes indicates that they produce similar results. Interesting-ly, we also found that constructing predictors using only the most informative genes sometimes may lead to worse prediction accuracy than if all genes are used.

1 Introduction Suppose that we have m mRNA tissue samples. Each sample jX , mj ,...,2,1= , con-

sists of mRNA expression levels, ijX , ni ,...,2,1= , measured from (the same) n

genes. With each sample is associated a class label jl such that },...,2,1{ Kl j ∈ . The class labels have usually not been determined by examining the mRNA samples, but by examining morphological and clinical data about the patient from which the sam-ple was taken. Often 2=K , in which case the class labels may correspond to sick or healthy, acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML), etc. Is it possible to use the given mRNA samples to quickly and correctly classify unla-beled samples? More specifically, can we devise a class prediction function )(ˆ Yf by which we accurately can label an unknown sample Y ?

Page 2: Survey and Evaluation of Methods for Tissue Classification

2

One might ask why this is an interesting problem. After all, if one can determine if some patients have ALL or AML without using gene expression data, one should be able to do so in all cases. According to Golub et al (1999), current clinical methods for distinguishing between ALL and AML are complicated but still not perfect. Un-fortunately, distinguishing ALL from AML is crucial for successful treatment. There-fore, one reason for studying the above problem is to find more accurate diagnosis methods. Another reason, also mentioned by Golub et al, is that the classification rules may provide a better understanding of the underlying biological mechanisms. The problem of predicting the class of an unlabeled entity, given a set of labeled enti-ties, occurs in many applications, and many methods for so-called supervised learning (or discriminant analysis) have been proposed. (See e.g. Mitchell (1997).) The large number of such methods is itself a problem. Which of the available well-known me-thods are most appropriate for classifying mRNA samples? Are they at all appropri-ate? If not, how can we improve them? Recently, a number of papers on methods for tissue classification using gene expres-sion data have been published. Some of them compare well-known methods (e.g. Ben-Dor et al (2000), and Dudoit et al (2002)), while others appear to propose new methods (e.g. Golub et al (1999)). The purpose of this report is similar to the former category of papers.

1.1 A framework for classifying gene expression profiles Radmacher et al (2001) propose a general framework for class prediction using gene expression profiles. Their framework consists of four steps:

1. Evaluation of the appropriateness of the given data for class prediction. To be appropriate each tissue sample must have a class label, and the class labels should not be based on the gene expression profiles of the samples, e.g. the class labels should not be derived by clustering the gene expression profiles.

2. Selection of classification (and gene selection) method. This step entails se-lecting one or more methods that should be accurate, and simple to implement and use.

3. Cross-validated class prediction. The accuracy of the methods selected in the previous step have to be evaluated. Since the number of samples usually is rel-atively small, Radmacher et al recommend using leave-one-out cross-validation, that is, each sample is left out one at a time and its label is pre-dicted based on the remaining samples. The smaller the error rate, the better the classification method.

4. Assessing the significance of the cross-validation results. According to Rad-macher et al, small error rates can be achieved even when there is no systemat-ic difference in expression profiles between classes. They recommend a per-mutation test to assess the significance of an observed cross-validation error rate. More specifically, for each permutation of the class labels, perform cross-validation as described above, and record the error rate. The proportion of the error rates that are smaller than or equal to the observed error rate serves as the significance level of the observed error rate. If the significance level is smaller than 0.05, Radmacher et al reject the hypothesis that there is no systematic dif-ference in expression profiles between classes. (In practice, it is too time-

Page 3: Survey and Evaluation of Methods for Tissue Classification

3

consuming to examine every permutation. Instead, Radmacher et al estimate the significance level by examining 2000 randomly selected permutations.)

In this report, we will follow the recommendations of Radmacher et al.

2 Data preprocessing The mRNA expression levels have been measured using either cDNA microchips or high-density oligonucleotide chips. Due to outliers, missing data points, etc. the “raw” mRNA data must usually be preprocessed in various ways:

1. Thresholding, i.e. if necessary increase or decrease expression levels such that all levels lie between specified lower and upper thresholds.

2. Filtering, e.g. removal of genes with too many missing data points or genes that vary too little between samples.

3. Log-transformation: it seems to be standard to log-transform either the ratio of red-green intensities (cDNA microchips) or the difference between average PM and MM (oligonucleotide chips).

4. Standardization (of columns or rows); 5. Missing value imputation: e.g. using the k-NN approach described by

Troyanskaya et al (2001). As an example, according to Dudoit et al (2002), the following data preprocessing steps were applied by Golub et al (1999) on the leukemia dataset1:

1. Thresholding: floor of 100 and ceiling of 16,000. 2. Filtering: exclusion of genes with max/min≤5 or (max-min)≤500, where max

and min refer to the maximum and minimum intensities for a particular gene across all the mRNA samples.

3. Base 10 logarithm transformation.

3 Gene selection methods Class prediction using gene expression samples differs from many other applications in that m, the number of labeled entities, is much smaller than n, the number of fea-tures. Usually, the number of mRNA samples is less than hundred, while there can be tens of thousands of genes. In this section, brief descriptions of a number of gene selection (or ranking) methods are given. The purpose of these methods is to identify the genes whose expression le-vels are informative with respect to class membership. For example, a gene that is strongly down-regulated in all AML samples and strongly up-regulated in all ALL samples is clearly informative, whereas a gene that is weakly up-regulated in all sam-ples hardly can qualify as informative.

1 Downloadable from http://waldo.wi.mit.edu/MPR/data_set_ALL_AML.html

Page 4: Survey and Evaluation of Methods for Tissue Classification

4

The methods presented below all begin by computing, for each gene, some kind of gene-correlation score, which is intended to measure how informative a gene is. For example, the higher the score, the more informative the gene. The next step is (or at least should be) to assess the significance of the scores. That is, we must decide which (if any) genes actually can be regarded as informative. Ideally, only informative genes should be used to construct the class prediction function. The expression levels are stored in an mn × matrix X, which we refer to as the gene expression matrix. That is, the genes correspond to the rows of the matrix, and the columns correspond to the labeled samples. The class labels are stored in an array

),...,,( 21 mlllL = . There are no missing data in X, and the data has been properly thre-sholded, filtered and log-transformed. In the descriptions below, the following defini-tions are used:

• iµ and iσ denote the sample mean and standard deviation, respectively, of

the elements mjX ij ,...,2,1, = .

• ciµ and c

iσ , },...,2,1{ Kc∈ , are the sample mean and standard deviation, re-

spectively, of the elements mjX ij ,...,2,1, = , for which cl j = .

• |}:},...,2,1{{| clmjm jc =∈= , },...,2,1{ Kc∈ .

3.1 Two-sample t-tests The t-test score is:

2

22

1

21

21

)()(),(

mm

LiTii

ii

σσ

µµ

+

−=

The larger the absolute value of a t-test score, the more informative the gene is. To as-sess the significance of ),( LiT , we can estimate its p-value by computing:

+21

,22),(

vI

LiTv

v

where ),( baI x denotes the incomplete Beta function (an algorithm for computing this function is given in Press et al (1988)) and

2

2

22

2

2

1

21

1

2

2

22

1

21

)(1

1)(1

1

)()(

−+

+

=

mmmm

mmv

ii

ii

σσ

σσ

.

Note that the estimate of the p-value is reliable only if the preprocessed gene expres-sion data can be assumed to be (approximately) normally distributed. Furthermore, since there may be thousands of genes, the p-values should also be adjusted for multi-ple testing, e.g. using the Bonferroni procedure.

Page 5: Survey and Evaluation of Methods for Tissue Classification

5

3.2 The method of Golub et al Golub et al (1999) normalize each row of X by first subtracting the row mean, and then dividing by the row standard deviation. That is, the normalized entry is:

i

iijij

XX

σµ−

=~

Let c

iµ~ and c

iσ~ , }2,1{∈c , be the mean and standard deviation, respectively, of the

elements mjX ij ,...,2,1,~

= , such that cl j = . The gene-class correlation score for the ith gene is then computed as

21

21

~~~~

),(ii

iiLiPσσµµ

+−

= .

Clearly, if a gene has high (low) expression levels for class 1 and low (high) expres-sion levels for class 2 (and the standard deviations are not too large), the correspond-ing score will be a relatively big positive (negative) number. On the other hand, if a gene has similar expression levels for both classes, the score will be close to zero. To assess the significance of the gene-class correlation scores, Golub et al perform a neighborhood analysis. This is done as follows. Let |}),(:{|),(1 rLiPirLN ≥= and |}),(:{|),(2 rLiPirLN −≤= . Suppose for example that 3.1for ,10),(1 == rrLN , that is, there are ten genes with score larger than or equal to 1.3. To decide if this is unusual, Golub et al compute

B

rLNrLNj j |)},(),(:{| 1≥

where jL , Bj ,...,2,1= , are random permutations of L . (Golub et al use 400=B .) If this ratio is small, let’s say not larger than 0.05, then the gene-class correlations for the ten genes are regarded as significant at the 5% level. If the neighborhood analysis shows that there are genes with significant class correla-tion, the next step is to select a subset of particularly informative genes to use as a prediction set. Golub et al select the 2/n genes with smallest (i.e. negative) correla-tion score, and 2/n genes with largest (i.e. positive) correlation score, where n is a free parameter. Golub et al used 50=n . Alternative methods for selecting the predic-tion set are discussed in Slonim et al (2000).

3.3 The method of Dudoit, Friedlyand and Speed Dudoit et al (2002) rank genes using the following score:

Page 6: Survey and Evaluation of Methods for Tissue Classification

6

∑∑

∑∑

= =

= =

−=

−== m

j

K

c

ciijj

m

j

K

ci

cij

XclI

clIiLR

1 1

2

1 1

2

))((

))((),(

µ

µµ,

where )(conditionI is 1 if condition is true; otherwise 0. They select the p genes with largest ratio (they use p ranging from 30 to 50). Dudoit et al briefly discuss the use-fulness of p-values for ),( LiR , but do not describe how to compute them.

3.4 The TNoM method Ben-Dor et al (2002) propose the threshold number of misclassification (TNoM) score. The rank vector, ),...,,( 21 imiii vvvv = , for the ith gene is defined as follows. If

the sample corresponding to the kth smallest member of },...,2,1:{ mjX ij = belongs

to class 1, then +=ikv ; otherwise −=ikv . )(vTNoM measures to which extent it is possible to divide v into two homogeneous parts. More specifically, ))()((min)(

&yMCxMCvTNoM

vyx+=

=,

where )(xMC is the cardinality of the minority element in the vector x, i.e. )in #,in min(#)( xxxMC −+= . For example, if the ith gene is down-regulated in all AML samples but up-regulated in all ALL samples (or vice versa), then )( ivTNoM is zero. Note that )( ivTNoM cannot exceed 2/m . Therefore, if n is large, many genes will have the same TNoM score. Ben-Dor et al describe an exact procedure for computing p-values for TNoM scores.

3.5 The method of Park, Pagano and Bonetti Park et al (2001) propose the following score: )( ),(

2: 1:ij

lj lkik XXhiLS

j k

−= ∑ ∑= =

,

where )(xh is 1 if 0>x ; otherwise zero. That is, the score for the ith gene is com-puted by first, for each sample belonging to class 2, counting the number of samples belonging to class 1 that have higher expression levels, and then by summing these numbers. Note that if 0),( =iLS , then the ith gene has consistently higher expression levels for class 2 than for class 1. Conversely, if 21),( mmiLS = , the expression levels for class 2 are consistently lower than for class 1. P-values are computed by permut-ing class labels in a manner similar to the neighborhood analysis proposed by Golub et al (1999).

4 Classification methods In this section, we describe a number of methods for constructing and using class pre-diction functions. The primary input for the construction of a prediction function con-sists of a training data set: }},...,2,1{,:),(),...,,{( 11 KlRXlXlXT i

nimm ∈∈= .

Page 7: Survey and Evaluation of Methods for Tissue Classification

7

The process of constructing a class prediction function based on the training data is called the training phase. The training phase may also require the user to specify various parameters. In general, the more “sophisticated” the method, the more parameters it requires. The simplest methods have essentially no training phase, while for other methods, the training phase may be quite time consuming. Once the prediction function is constructed, we can use it to predict the label of an unknown sample nRY ∈ .

4.1 Nearest neighbor classification Nearest-neighbors (NN) methods are based on some distance measure ),( yxd (e.g. one minus the Pearson correlation) for pairs of samples. To classify a sample Y, we first find the k samples in the training set, which are closest to Y. Then, either we can use the majority or the distance-weighted rule to decide which class Y should be as-signed to. If

kiii XXX ,...,,21

are the k closest samples, the majority rule simply assigns Y to the class to which most of them belong, i.e.

))((maxarg)(ˆ1},...,2,1{

cXlIYfji

k

jKc== ∑

=∈,

where )(ji

Xl denotes the class label of ji

X .

How find an appropriate value for k? One possibility, described in Dudoit et al (2002), is to examine several values for k, and choose the value giving the smallest cross-validation error.

4.2 Quadratic and linear discriminant analysis The quadratic discriminant rule is )(minarg)(ˆ

},...,2,1{YYf c

Kcδ

∈= ,

where

cc

cTc

cc YYY πµµδ log)()(21

log21

)( 1 −−Σ−+Σ= − ,

are called the quadratic discriminant functions, and cc µ,Σ and cπ are population co-

variance matrix, population mean vector, and prior probability, respectively, of class c. (See e.g. Hastie et al (2001).) If we assume that all classes have a common population covariance matrix, Σ , the discrimination functions simplify to the linear discriminant functions:

ccTcc YYY πµµδ log)()(

21

)( 1 −−Σ−= − .

In practice, we do not know the population covariance, mean, and prior probability, but they can be estimated as follows:

Page 8: Survey and Evaluation of Methods for Tissue Classification

8

Class mean:

== ∑=

cn

c

c

cljj

c

c

j

Xm

µ

µµ

µ:

1ˆ 2

1

:

,

Class covariance matrix: Tcj

c

cljj

cc XX

mj

)ˆ)(ˆ(1

1ˆ:

µµ −−−

=Σ ∑=

,

Common covariance matrix:

∑ ∑∑= ==

−−−

=Σ−−

=ΣK

c

Tcj

c

cljj

K

ccc XX

Kmm

Kmj1 :1

)ˆ)(ˆ(1ˆ)1(

1ˆ µµ , and

Class prior probability: mmcc =π .

Dudoit et al (2002) reported surprisingly good results with a simplified version of the linear discriminant function. In this version, only the diagonal elements of the com-mon covariance matrix are used. More specifically,

∑∑∑∑ ==

= =

−=−=

−=

n

i

ciii

n

im

j

K

c

ciijj

cii

c YwXclI

YY

1

2

1

1 1

2

2

)())((

)()( µ

µ

µδ .

We can interpret )(Ycδ as the (squared) “weighted” Euclidean distance between Y

and the class mean cµ . More weight is given to genes with expression values close to

the class means. This means that even if Y is closer to 1µ than to 2µ (according to the ordinary Euclidean distance measure), Y may still be predicted to belong to class 2 if the class 1 training samples exhibit less variation than the class 2 training sam-ples. Dudoit et al refer to class prediction using this function as diagonal linear dis-criminant analysis (DLDA).

4.3 Weighted gene voting In section 3.2 it was described how Golub et al (1999) propose to rank genes, and how to select a subset, the so-called prediction set S, consisting of the most relevant genes. To predict the class membership of a sample Y, each gene Si∈ casts a weighted vote:

)~)(,(),( byLiPYiV i −= where

,log~

i

iii

yy

σµ−

= ,2

~~ 21iib

µµ +=

and iy is the expression level (before log-transformation) of the ith gene in sample Y.

The 2,1~iµ , iµ and iσ are defined in Section 3.

Note that 0),( <YiV , either if 0),( <LiP and byi >

~ , or if 0),( >LiP and byi <~ . In

the first case, since 12 ~~ii µµ > , we see that iy~ is closer to 2~

iµ than to 1~iµ . In the second

Page 9: Survey and Evaluation of Methods for Tissue Classification

9

case, since 21 ~~ii µµ > , we see again that iy~ is closer to 2~

iµ than to 1~iµ . Golub b et al in-

terpret this as vote for class 2. By similar reasoning, 0),( >YiV can be interpreted as a vote for class 1. Next, we compute the 1−V and 1V , the total votes for each class, and the prediction strength PS:

∑>

=0),(

1 ),(XiV

YiVV

∑<

=0),(

2 ),(XiV

YiVV

and

21

2121 ),min(),max(VV

VVVVPS

+−

=

The sample Y is predicted to belong to the class with largest total vote unless PS is too small. (Golub et al used 0.3 as threshold value, that is, the difference in votes between the “winner” and the “loser” must be at least 30% of the total number of votes.)

4.4 The Arc-fs boosting method To construct a prediction function, boosting methods repeatedly apply a weak classifi-cation method2 to modified versions of the training data, producing a sequence

MiYf i ,...,2,1 ),( = , of prediction functions. The resulting prediction function is a weighted vote:

)((sign)(ˆ1

YfYf i

M

ii∑

=

= α ),

where the Mii ,...,2,1 , =α are computed by the boosting method. (Here we assume that there are only two classes, and that they are labeled with 1± .) Several variations on boosting have been proposed. Breiman (1998) presented the fol-lowing variation (referred to as Arc-fs):

1. Initialize the sampling probabilities: mp j /1= , mj ,...,2,1= .

2. For Mi ,...,2,1= do: a. Using the current sampling probabilities, draw m samples (with re-

placement) from the original training set S to create a new training da-ta set S .

b. Fit a prediction function )(Yf i to S .

c. Compute ∑ =≠=

m

j jijji XflIp1

))((ε . (If 0=iε or 5.0≥iε , Breiman

recommends starting all over at Step 1.) d. Compute iii εεβ /)1( −= .

2A weak classification method is a method that is guaranteed to perform only slightly better than ran-dom guessing.

Page 10: Survey and Evaluation of Methods for Tissue Classification

10

e. Set

∑=

= m

j

XflIij

XflIij

jjij

jij

p

pp

1

))ˆ(ˆ(

))ˆ(ˆ(

β

β, mj ,...,2,1= .

3. Output )()log((sign)(ˆ1

YfYf i

M

ii∑

=

= β ).

As we can see, misclassified samples get their sampling probabilities increased, and thus become more likely to be selected in Step 2a. The idea is that this will force suc-cessive prediction functions to improve their accuracy on these samples. As the weak classification method, one can use a very simple “classification tree”:

tiYd

tiYdYf

<−

>=

)(

)()( ,

where the parameters 1±=d , i and t are determined such that )(Yf , when applied to the training data gives the smallest number of errors.

4.5 Support vector machine classification Support vector machines (SVM) (Vapnik (1999)) have become popular classifiers within the machine learning community. SVMs have been applied to gene expression data in several publications, e.g. Mukherjee et al (1999), Furey et al (2000), and Ben-Dor et al (2000). The training phase of an SVM consists in solving the following optimization problem. Given training the data }}1,1{,:),(),...,,{( 11 +−∈∈= i

nimm lRXlXlXS , a kernel func-

tion ),( yxK , and a positive real number C, find mαα ,...,1 that

maximize ),(21

1,1jijij

m

jii

m

ii XXKll ααα ∑∑

==

subject to ,01

=∑=

i

m

iilα

miC i ,...,1 ,0 =≥≥α .

The prediction function is )),((sgn)(1

∗∗

=

+= ∑ bYXKlYf ii

m

iiα , where ∗∗

mαα ,...,1 are the

solutions to the optimization problem, and ∗b is chosen such that 1)( =ii Xfl for any

i with 0>> iC α . Examples of kernel functions are:

• Gaussian Radial Basis Function (RBF):22 2/),( σyxeyxK −−= .

• Polynomials of degree d: dyxyxK )1(),( +⋅= . • Multi-layer Perceptron function: ),tanh(),( δκ −⋅= yxyxK (for some values of

κ and δ ).

Page 11: Survey and Evaluation of Methods for Tissue Classification

11

5 Evaluation We have implemented all of the gene selection and class prediction methods de-scribed above in MATLAB (except the SVM method where we used the implementa-tion available in the OSU Support Vector Machines (SVMs) Toolbox). In this section, we present the results of applying these methods to various data sets.

5.1 Data sets Unfortunately, it is not so easy to find appropriate collections of classified mRNA samples. Dudoit et al (2002) use only three such data sets: lymphoma, leukemia, and NCI 60. Ben-Dor et al (2000) use, in addition to the leukemia data, colon and ovarian data. Here, we only use one such collection: the leukemia data. Instead, we use com-puter-generated data sets. Of course, these data sets cannot substitute real data sets, but they allow systematic studies of e.g. sensitivity to “noise” that would be hard to do using only a few real data sets. 5.1.1 Leukemia data For the leukemia data, gene expression levels were measured using Affymetrix high-density oligonucleotide arrays containing 6,817 human genes. There is a total of 72 samples of which 47 have been classified as ALL, and the remaining as AML. After preprocessing as described in section 2, 3,517 genes remain. 5.1.2 Simulation of gene expression data We developed a MATLAB function for generating “mRNA samples”. The input pa-rameters are:

• Total number of genes; • Number of informative genes; • Total number of samples; • Number of samples with class label “1” (the remaining samples are labeled

“2”). The expression levels of a non-informative gene are normally distributed with stan-dard deviation equal to 1500, and a gene-specific mean. The latter is determined by drawing a random number between 3000 and 5000. For an informative gene, it is first decided (with equal probability) if it is to be high for class 1 and low for class 2, or vice versa. In the first case (the second case is treated analogously), the expression values for class 1 are normally distributed (again with standard deviation 1500) with mean 5000, while the class 2 expression values are normally distributed (again with standard deviation 1500) with mean 3000. The output from the function consists of the gene expression matrix and the class labels array.

5.2 Evaluation of gene selection methods In section 3, we described five gene selection methods: the t-test, the method of Golub et al, the method of Dudoit, Friedlyand and Speed, the TNoM method, and the method of Park, Pagano and Bonetti. In this section, we investigate to which extent these me-thods agree in their ranking of the “informativiness” of the genes, and if any method can be regarded as more or less correct than the others.

Page 12: Survey and Evaluation of Methods for Tissue Classification

12

For the sake of brevity, in the following we refer to the methods of:

1. Golub et al as the “GOLUB method”, 2. Dudoit, Friedlyand and Speed as the “DFS method”, and 3. Park, Pagano and Bonetti as the “PPB method”.

5.2.1 Leukemia data Each method first ranked the genes and then selected the 1%, 5%, 10%, 15%, and 20% most informative genes. The sets of selected genes were then compared with each other. The overall agreement, that is, the proportion of genes selected by all me-thods, varied around 60%. For example, when each method selected the 1% (i.e. 36) most informative genes, 21 genes were selected by all methods. The methods were also compared pairwise. The two methods that disagree most are the t-test and the TNoM methods. Their overlap (that is, the extent to which they se-lect the same genes) varies around 66%. (However, the TNoM method tends to disag-ree also with the other methods.) At the other end of the spectrum, the DFS and GOLUB methods have an overlap that ranges between 78% (1% most informative genes were selected) and 93%. 5.2.2 Computer-generated data The data consists of 50 samples (with class labels evenly divided over two classes) with expression levels for 1000 genes of which 10% are informative. Each method se-lected the 1%, 5%, 10%, 15%, and 20% most informative genes, and the sets were compared in the same manner as for the leukemia data. This process was repeated fif-ty times, and the average overlaps were computed. The overall agreement varied between 52% (1% most informative genes were se-lected) to 90% (10% most informative genes were selected). The latter result indicates that all methods successfully identified the informative genes. The pairwise compari-sons showed a very high degree of agreement (96% to 100%) between the t-test and DFS methods. Again, the TNoM methods disagreed most with the other methods. In particular, it disagreed with the GOLUB method, where the overlap varied between 58% and 92%. We also investigated how successful the methods were in identifying the genes that had been created to be informative. This was done by deciding how many of the 10% most informative genes selected by the methods were also created to be informative. On average, 96 of the 100 most informative genes selected by the t-test and DFS me-thods had been created to be informative. The TNoM and PBB methods were slightly less successful; on average, they selected 93.5% of the informative genes. 5.2.3 Conclusions The above results indicate that the methods essentially agree on which genes are most informative. Therefore, in the following we only use the DFS gene selection method (except that we use the GOLUB method together with the class prediction method of Golub et al.)

Page 13: Survey and Evaluation of Methods for Tissue Classification

13

5.3 Evaluation of class prediction methods In evaluating the accuracy of the class prediction methods, we follow the framework proposed by Radmacher et al (as described in section 1). That is, we use leave-one-out cross-validation (LOOCV). More specifically, to evaluate the accuracy of a method, we perform the following procedure: #Errors = 0;

For each ),( jj lX in }},...,2,1{,:),(),...,,{( 11 KlRXlXlXT in

imm ∈∈= do:

Form the training set )},{(ˆjj lXTT −= ;

If required, rank the genes in T and select the most informative (let S denote the selected genes);

Construct the class prediction function )(ˆ Yf using ST

( ST denotes the restriction of T to the genes in S . If no gene selection

was done, TTSˆˆ = );

If jSj lXf ≠))((ˆ , #Errors = #Errors + 1. In the following, whenever we refer to the number of errors (or error ratio), it is the number of errors computed by the above procedure that we mean. The procedure it-self, we refer to as the LOOCV procedure. 5.3.1 Prediction without gene selection We begin by evaluating the accuracy of the methods if no gene selection is per-formed. That is, we construct the prediction functions using training sets that may contain many irrelevant genes. Observe, however, that in the case of the leukemia da-ta many irrelevant genes were removed by the preprocessing procedure. It is therefore possible that the proportion of informative genes is rather high. For the computer-generated data, on the other hand, we know that, except for the genes that we created to be informative, all other genes are irrelevant.

5.3.1.1 Leukemia data All of the methods performed quite well on the leukemia data: after some parameter “fine-tuning” no method made more than three errors. Moreover, two samples (num-ber 66 and 67) were misclassified by almost all methods, and it is possible that these samples have been incorrectly labeled. What distinguishes the methods is how fast they run. The following table gives the execution time (in seconds) of the LOOCV procedure for each method: Arc-fs

100=M DLDA GOLUB3 k-NN

5=k SVM linear kernel

4156 13 99 13 40 We end this section with some brief comments on some of the methods.

3 “GOLUB refers here to the class prediction method proposed by Golub et al (1999).

Page 14: Survey and Evaluation of Methods for Tissue Classification

14

Arc-fs: As can be seen from the table, the training phase is very time consuming, at least compared to the other methods. (An experiment described in Breiman (1998) re-quires 4 hours of CPU time.) Therefore, only the following (rather low) M-values were tried: 1, 5, 10, 30, 50, and 100. The number of errors decreased from 7 to 1 as the M-value increased. Note that, since Arc-fs uses random sampling, the number of errors may vary even if the same training data and M-value are used. This is particu-larly likely if the M-value is small. k-NN: Various k-values were tried out. For k less than 30, the number of errors varied between 1 and 3. For higher k-values, the number of errors increased dramatically. SVM: With the linear polynomial kernel, there were two errors, while the radial basis kernel resulted in 25 errors! From now on, we use only the linear kernel SVM.

5.3.1.2 Computer-generated data In the following experiments, the class prediction methods were applied to computer-generated data sets containing 50 samples (equally divided between class 1 and 2) with expression levels for 1000 genes. For each experiment, 25 data sets were gener-ated and the average numbers of errors were recorded. In the first set of experiments, we varied the number of informative genes. The results are as follows: #Informative Genes/Classification Method

Arc-fs 10=M

DLDA GOLUB k-NN 10=k

SVM

100 4.5 0 0 0 0 50 4 0.04 0.08 0.2 0.04 20 5.8 3.2 5.3 7.2 3.2 All methods perform worse as the number of informative genes decreases. For the k-NN method, the accuracy decreases more significantly than for the other methods. The Arc-fs method seems to be rather insensitive to the number of informative genes. The above results indicate that the methods are quite robust. Even with 90% irrelevant genes, they (except for the Arc-fs method) have zero error rates. The next table shows what happens when the 100 informative genes are made less informative by increas-ing the standard deviation of their expression levels from 1500 to 2000 and 2500. Standard devia-tion/Classification Me-thod

Arc-fs 10=M

DLDA GOLUB k-NN 5=k

SVM

1500 4.5 0 0 0 0 2000 7.48 0 0.04 0.2 0 2500 11.48 0.32 0.4 2.28 0.36 Not surprisingly, the error ratios increase as the standard deviation increases, but the DLDA, GOLUB and SVM methods still have error ratios close to zero.

Page 15: Survey and Evaluation of Methods for Tissue Classification

15

5.3.2 Prediction with gene selection Several researchers claim that by removing irrelevant (or less informative) genes from the training data, the corresponding class prediction function becomes more accurate. In this section, we try to experimentally confirm this claim.

5.3.2.1 Leukemia data We performed class prediction using the 1%, 5%, 10%, 15% and 20% most informa-tive genes. The numbers of prediction errors are summarized in the following table: % Selected genes/ Classification Me-thod

Arc-fs 100=M

DLDA GOLUB k-NN 5=k

SVM

1 2 2 3 3 6 5 2 2 3 2 3 10 1 2 2 3 3 15 2 2 2 1 3 20 3 2 2 1 3 All genes 3 1 1 1 2 The main conclusion that we can draw from these results is that gene selection does not necessarily lead to improved accuracy. In fact, for the leukemia data it is better to use all genes than the most “informative”! The only method that seems to benefit from gene selection is the Arc-fs method: with fewer genes, we can increase M and still get reasonable execution times.

5.3.2.2 Computer-generated data The results in section 5.3.1.2 show that without gene selection and with 95% irrele-vant genes, the error rates are very close to zero. The following experiments were per-formed in the same way as in section 5.3.1.2, except that 98% of the genes were irre-levant: % Selected genes/ Classification Me-thod

Arc-fs 100=M

DLDA GOLUB k-NN 5=k

SVM

1 2.32 1.12 1.24 1.64 1.84 5 2 0.84 1.16 0.72 1.08 10 2.48 1.36 1.68 1.80 1.72 15 2.48 1.76 2.32 2.84 2.32 20 3.16 2.36 3.2 3.44 2.28 All genes 2.04 2.76 3.96 6.44 2.64 These results show that when a data set contains an extremely high proportion of “ir-relevant” genes, then gene selection may actually improve accuracy. All methods (ex-cept the Arc-fs method) achieved substantially better error rates when the 5% most in-formative genes were selected compared with when all genes are used. This is particu-larly noticeable for the k-NN method for which the accuracy differs by almost an or-der of magnitude between optimal gene selection and no gene selection.

Page 16: Survey and Evaluation of Methods for Tissue Classification

16

5.3.3 Permutation tests As described in section 1, Radmacher et al recommend assessing cross-validation re-sults by repeatedly permuting class labels and recording the numbers of errors com-puted by the LOOCV procedure for each permutation. The histogram given below shows the error distribution for a computer-generated data set consisting of 50 sam-ples (equally divided between class 1 and 2) with 1000 genes of which 20 were in-formative. The labels were permuted 1000 times.

10 15 20 25 30 35 40 450

10

20

30

40

50

60

70

80

90

Number of leave-one-out-cross-validation errors

As we can see, the error rates for permuted labels are significantly higher than for non-permuted labels. The corresponding histogram for the leukemia data is almost identical to this histogram. The error rates that we have observed above are thus high-ly significant. 5.3.4 Conclusions The results presented in sections 5.3.1 and 5.3.2 indicate that all of the evaluated class prediction methods perform quite well (although the Arc-fs method seems to be main-ly of theoretical interest). When applied to the leukemia data, the methods are essen-tially equally accurate. Only by increasing the proportion of irrelevant genes to 95% or higher (or by making the informative genes less informative) in the computer-generated data, could we discover any differences between the methods. If we must declare any method the “winner”, it must be the DLDA method. It is both fast, easy to implement, and accurate. The k-NN method is also fast and easy to im-plement, but is more sensitive to noise than the DLDA method. However, it is unclear if this higher sensitivity makes any difference for real mRNA data. The SVM method is as accurate as DLDA, but not that easy to implement. The method proposed by Go-lub et al also performs well, but since there are well-known “general purpose” me-

Page 17: Survey and Evaluation of Methods for Tissue Classification

17

thods that perform equally well or better, it is not clear if their method has contributed anything to the state-of-the-art of cancer classification. The notion that using only the most informative genes in the training data results in more accurate class prediction functions makes a lot of sense. However, as we have seen it can actually lead to decreased accuracy. Since the class prediction methods ap-pear to be rather insensitive to noise, it may be better to allow a limited proportion of irrelevant genes than risk removing too many informative genes.

6 References A. Ben-Dor, L. Bruhn, N. Friedmann, I. Nachman, M. Schummer, Z. Yakhini. Tissue classification with gene expression profiles. Proc. Fourth Annual Int. Conference on Computational Molecular Biology (RECOMB), 2000. L. Breiman. Arcing classifiers. The Annals of Statistics, 26, 801-824, 1998. M.P.S. Brown et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. National Academy of Sciences, 97:262-267,2000. J. Deutsch. Algorithm for finding optimal gene sets in microarray prediction. ? S. Dudoit, J. Friedlyand, and T.P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. J. American Statistical Asso-ciation, March 2002, Vol. 97, No. 457. T. Furey, N. Cristianini, N. Duffy, D. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10), p. 906-914, 2000. T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caliguri, C. Bloomfield, and E. Lander. Molecular classification of cancer: class discovery and class description by gene monitoring. Science, 286, p. 531-537, 1999. G. Getz, E. Levine and E. Domany. Coupled two-way analysis of gene microarray da-ta. Proc. National Academy of Sciences, 97:12079-84, 2000. I. Guyon, J. Weston, S. Barnhil, V. Vapnik. Gene selection for cancer classification using support vector machines. Submitted to Machine Learning? T. Hastie, R. Tibshirani, M. Eisen, P. Brown, D. Ross, U. Scherf, J. Weinstein, J. Ali-zadeh and L. Stadt. Gene shaving: a new class of clustering methods for expression arrays. Technical report, Stanford University, 2000. J. Khan, M. Ringner, L. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Scwab, C.R. Antonescu, C. Peterson and P. Meltzer. Classification and diagnostic prediction of cancers using expression profiling and artificial neural networks. Nature Medicine, 7(6), p.673-679, 2001. A. Keller, M. Schummer, L. Hood and W. Ruzzo. Bayesian classification of DNA ar-ray expression data. Technical report UW-CSE-2000-08-01, Univ. Washington, 2000. W. Li and Y. Yang. How many genes are needed for a discriminant microarray data analysis? lanl physics preprint archive xxx.lanl.gov, arXiv:physics/0104029 v1, 2001. T.M. Mitchell. Machine Learning. McGraw-Hill, 1997. S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. Mesirov, and T. Poggio. Support vector machines classification of microarray data. Technical report, MIT, 1999.

Page 18: Survey and Evaluation of Methods for Tissue Classification

18

P.J. Park, M. Pagano and M. Bonetti. A nonparametric scoring algorithm for identify-ing informative genes from microarray data. Pacific Symposium on Bioinformatics, 2001. P. Pavlidis, J. Weston, J. Cai and W. Grundy. Gene functional classification from he-terogenous data. Proc. Fifth International Conf. Computational Molecular Biology, 2001. M.D. Radmacher, L.M. McShane and R.Simon. A paradigm for class prediction using gene expression profiles. Technical Report 001, July 2001, National Cancer Institute. D. Slonim, P. Tamayo, J. Mesirov, T. Golub, E. Lander. Class prediction and discov-ery using gene expression data. Proc. Fourth Annual Int. Conference on Computa-tional Molecular Biology (RECOMB), 2000. O. Troyanskaya et al. Missing value estimation methods for DNA microarrays. Bioin-formatics, p. 520-525, 2001. V.N. Vapnik. The Nature of Statistical Learning Theory. New York: Springer, 2000.