Recursive Cluster Elimination (RCE) for Classification and Feature Selection From Gene Expression Data

8/8/2019 Recursive Cluster Elimination (RCE) for Classification and Feature Selection From Gene Expression Data

1/12

BioMedCentral

Page 1 of 12(page number not for citation purposes)

BMC Bioinformatics

Open AccesResearch article

Recursive Cluster Elimination (RCE) for classification and featureselection from gene expression data

Malik Yousef, Segun Jung, Louise C Showe and Michael K Showe*Address: The Wistar Institute, Systems Biology Division, Philadelphia, PA 19104, USA

Email: Malik Yousef - [email protected]; Segun Jung - [email protected]; Louise C Showe - [email protected];Michael K Showe* - [email protected]

* Corresponding author

Abstract

Background: Classification studies using gene expression datasets are usually based on small numbers of

samples and tens of thousands of genes. The selection of those genes that are important for distinguishing

the different sample classes being compared, poses a challenging problem in high dimensional data analysis.We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather

than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its

performance with that of two related classification procedures with RFE.

Results: We have developed a novel method for selecting significant genes in comparative geneexpression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method,

to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning

classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-

means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied toiteratively remove those clusters of genes that contribute the least to the classification performance. SVM-

RCE identifies the clusters of correlated genes that are most significantly differentially expressed between

the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised

classification accuracy of the same data as compared to the accuracy when either SVM or Penalized

Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to

remove genes based on their individual discriminant weights.

Conclusion: SVM-RCE provides improved classification accuracy with complex microarray data sets

when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE.

SVM-RCE identifies clusters of correlated genes that when considered together provide greater insightinto the structure of the microarray data. Clustering genes for classification appears to result in some

concomitant clustering of samples into subgroups.

Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the

SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant

metrics that group genes based on functional parameters might also be useful.

Published: 2 May 2007

BMC Bioinformatics 2007, 8:144 doi:10.1186/1471-2105-8-144

Received: 11 September 2006Accepted: 2 May 2007

This article is available from: http://www.biomedcentral.com/1471-2105/8/144

2007 Yousef et al; licensee BioMed Central Ltd.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
http://www.biomedcentral.com/http://www.biomedcentral.com/http://www.biomedcentral.com/http://www.biomedcentral.com/http://www.biomedcentral.com/info/about/charter/http://www.biomedcentral.com/1471-2105/8/144http://creativecommons.org/licenses/by/2.0http://www.biomedcentral.com/info/about/charter/http://www.biomedcentral.com/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=17474999http://creativecommons.org/licenses/by/2.0http://www.biomedcentral.com/1471-2105/8/144


2/12

BMC Bioinformatics 2007, 8:144 http://www.biomedcentral.com/1471-2105/8/144


BackgroundThe Matlab version of SVM-RCE can be downloaded from[1] under the "Tools->SVM-RCE" tab.

Classification of samples from gene expression datasets

usually involves small numbers of samples and tens ofthousands of genes. The problem of selecting those genesthat are important for distinguishing the different sampleclasses being compared poses a challenging problem inhigh dimensional data analysis. A variety of methods toaddress these types of problems have been implemented[2-8]. These methods can be divided into two main cate-gories: those that rely on filtering methods and those thatare model-based or so-called wrapper approaches [2,4].

W. Pan [8] has reported a comparison of different filteringmethods, highlighting similarities and differencesbetween three main methods. The filtering methods,although faster than the wrapper approaches, are not par-

ticularly appropriate for establishing rankings among sig-nificant genes, as each gene is examined individually andcorrelations among the genes are not taken into account.

Although wrapper methods appear to be more accurate,filtering methods are presently more frequently applied todata analysis than wrapper methods [4].

Recently, Li and Yang [9] compared the performance ofSupport Vector Machine (SVM) algorithms and RidgeRegression (RR) for classifying gene expression datasetsand also examined the contribution of recursive proce-dures to the classification accuracy. Their study explicitlyshows that the way in which the classifier penalizes redun-

dant features in the recursive process has a strong influ-ence on its success. They concluded that RR performedbest in this comparison and further demonstrate theadvantages of the wrapper method over filtering methodsin these types of studies.

Guyon et. al. [10] compared the usefulness of RFE (forSVM) against the "nave" ranking on a subset of genes.

The nave ranking is just the first iteration of RFE to obtainranks for each gene. They found that SVM-RFE is superiorto SVM without RFE and also to other multivariate lineardiscriminant methods, such as Linear Discriminant Anal-

ysis (LDA) and Mean-Squared-Error (MSE) with recursive

feature elimination.

In this study, we describe a new method for gene selectionand classification, which is comparable to or better thansome methods which are currently applied. Our method(SVM-RCE) combines the K-means algorithm for geneclustering and the machine learning algorithm, SVMs[11], for classification and gene cluster ranking. The SVM-RCE method differs from related classification methods inthat it first groups genes into correlated gene clusters by K-means and then evaluates the contributions of each of

those clusters to the classification task by SVM. One canthink of this approach as a search for those significantclusters of gene which have the most pronounced effecton enhancing the performance of the classifier. While wehave used K-means and SVM to approach this problem,

other combinations of clustering and classification meth-ods could be used in approaching similar data analysisproblems. Yu and Liu (2004) have discussed the redun-dancy and the relevance of features which is a relatedmethod [12].

Using SVM-RCE, the initial assessment of the performanceof each individual gene cluster, as a separate feature,allows for the identification of those clusters that contrib-ute the least to the classification. These are removed fromthe analysis while retaining those clusters which exhibitrelatively better classification performance. We allow re-clustering of genes after each elimination step to allow the

formation of new, potentially more informative clusters.The most informative gene clusters are retained for addi-tional rounds of assessment until the clusters of genes

with the best classification accuracy are identified (seeMethod section). Our results show that the classificationaccuracy with SVM-RCE is superior to SVM-RFE and PDA-RFE, which eliminate genes without explicit regard totheir correlation with other genes.

Several recent studies [7,13,14] have also combined the K-means clustering algorithm and SVM but for very differentpurposes. In a previous study K-means was used to clusterthe samples, rather than the features (genes). The sample

clusters, represented as centroids, were then used as inputto the SVM. In this case the sample clustering speeds theSVM learning by introducing fewer samples for training.Li et. al. [15] also used K-means in combination withSVM, but in this case K-means was used to cluster unla-belled sample data and SVM was used to develop the clas-sifier among the clusters. However, none of the previousstudies used K-means to cluster features and none are con-cerned with feature reduction, the principal aim of ourmethod. Tanget. al. [16], proposed portioning the genesinto clusters using the Fuzzy C-Means clustering algo-rithm. However, this study ranks each gene, in each indi-

vidual cluster, by SVM coefficient weights rather than

ranking each cluster as a unit. The size of the clusters,rather than the number of clusters, is reduced. A similarapproach has recently been described by Ma and Huang[17] who propose a new method that takes into accountthe cluster structure, as described by correlation metrics,to perform gene selection at the cluster level and within-cluster gene level.

The following sections describe the individual compo-nents of the SVM-RCE algorithm. We present data show-ing the classification performance of SVM-RCE on
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


3/12



complex data sets. We compare SVM-RCE with the per-formance of SVM-RFE and PDA-RFE and demonstrate thesuperior performance of SVM-RCE as measured byimproved classification accuracy [18-20].

ResultsData used for assessment of classification accuracyWe tested the SVM-RCE method, described below, withseveral datasets. The preprocessed datasets for Leukemiaand Prostate cancer were downloaded from the website[21] and used by the study [22]. The following is a briefdescription of these datasets.

Leukemia

The leukemia dataset reported by Golub et. al. [23].includes 72 patients to be classified into two disease types:

Acute Lymphocytic Leukemia (ALL) and Acute MyeloidLeukemia (AML). 47 of the samples were from ALL

patients (38 B-cell ALL and 9 T-cell ALL). An additional 25cases were from patients with AML. Gene expression data was generated using the Affymetrix oligonucleotidemicroarrays with probe sets for 6,817 human genes. Datafor 3571 genes remained, after preprocessing followingthe protocol described by Dudoit et. al. [24]. For simplic-ity we will refer to this data set as Leukemia(I). To prop-erly compare the SVM-RCE performance with previous[9,25] studies, we split the data into two sets, a training setof 38 samples (27 ALL and 11 AML) and a test set of 34samples (20 ALL and 14 AML) as in the original publica-tion and used 7129 genes. The data was preprocessed bysubtracting the mean and dividing the result by the stand-

ard deviation [9,23,25]. For simplicity, we will refer to thisdata as Leukemia (II).

Prostate

This data set consists of 52 prostate tumor samples and 50non-tumor prostate samples. It was generated using the

Affymetrix platform with 9,000 genes. Data for 6033genes remains after the preprocessing stage [22].

CTCL Datasets (I) and (II)

Cutaneous T-cell lymphoma (CTCL) refers to a heteroge-neous group of non-Hodgkin lymphomas of skin-homing

T lymphocytes. CTCL(I) includes 18 patients and 12 con-

trols [19] while CTCL(II) consist of 58 patients and 24controls (Loboda et. al. unpublished). For more informa-tion about the data and preprocessing refer to [18,19].

Head & neck vs. lung tumors (I)

Gene expression profiling was performed on a panel of 18head and neck (HN) and 10 lung cancer (LC) tumor sam-ples using Affymetrix U133A arrays. For further informa-tion refer to [26].

Head & neck vs. lung tumors (II)

Gene expression profiling was performed on a panel of 52patients with either primary lung (21 samples) or primaryhead and neck (31 samples) carcinomas, using the

Affymetrix HG_U95Av2 high-density oligonucleotide

microarray. For further information refer to Talbotet. al.[27].

The following two sections demonstrate the advantage ofthe SVM-RCE over SVM-RFE and PDA-RFE for selectinggenes and accuracy of classification.

Performance of SVM-RCE versus SVM-RFE and PDA-RFE

The three algorithms, SVM-RCE, PDA-RFE and SVM-RFE,were used to iteratively reduce the number of genes fromthe starting value in each dataset using intermediate clas-sification accuracy as a metric.

We report the accuracy of SVM-RCE at the final 2 geneclusters, and two intermediate levels, usually 8 and 32clusters, which correspond to 8 genes, 32 genes and 102genes respectively. For SVM-RFE and PDA-RFE we reportaccuracy for comparable numbers of genes.

The relative accuracies of SVM-RCE, SVM-RFE and PDA-RFE are shown in Table 1. With the Leukemia(I) dataset,

we observed an increased accuracy using SVM-RCE of 3%and 2% with ~12 and ~32 genes, respectively when com-pared to SVM-RFE. Comparable results with SVM-RFErequired ~102 genes. The results obtained from the CTCL(I) analysis showed an improvement, using the SVM-RCE

of about 11% and 6% with ~8 and ~32 genes, respectively, with a similar performance achieved with ~102 genesusing SVM-RFE. The second CTCL data set (CTCL II,Loboda et. al. unpublished) showed an improvementusing SVM-RCE of about 7%, 11% and 9% with ~8, ~34and ~104 genes, respectively.

We also compared results for two additional datasets:Head and Neck Squamous Cell carcinoma (HNSCC) andLung Squamous Cell carcinoma (LSCC) [26] (Head &Neck vs. Lung tumors (I)). SVM-RCE shows an increase inaccuracy over SVM-RFE of 8%, 10% and 10% with ~8,~32, and ~103 genes, respectively. A similar dataset com-

paring HNSCC and LSCC [27] (Head & Neck vs. Lungtumors (II)) was also subjected to both methods and a~2% increase was observed, with the SVM-RCE, using ~8,~32, and ~102 of genes (100% SVM-RCE and 98% SVM-RFE). The Prostate cancer dataset yielded better accuracyusing SVM-RFE with ~8 genes (an increase of about 6%over SVM-RCE), but similar performances were found athigher gene numbers. The same superiority of SVM-RCE isobserved when comparing the SVM-RCE with PDA-RFE.

These results are also shown in Table 1. Figures 1 and 2(Additional Material File 1: Hierarchical clustering and
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


4/12



Multidimensional scaling (MDS) of the top genesdetected by SVM-RCE and SVM-RFE) use hierarchal clus-tering and multidimensional scaling (MDS) [28] to helpillustrate the improved classification accuracy of SVM-RCE for two of the data sets, Head&Neck(I) and CTCL(I).

The genes selected by SVM-RCE clearly separate the twoclasses while the genes selected by SVM-RFE place one ortwo samples on the wrong side of the separating margin.

Comparison with Li and Yang study

Recently, Li and Yang [9] conducted a study comparingSVM and Ridge Regression(RR) to understand the successof RFE and to determine how the classification accuracy

depends on the specific classification algorithm that ischosen. They found that RR applied on the Leukemia(II)dataset has zero errors, with only 3 genes, while SVM [25]only attained the same zero errors with 8 genes. We com-pared these studies to our results, using SVM-RCE (n =100, m = 2, d = 0.1, n_g= 500), where 1 error was found

with 3 genes (KRT16, SELENBP1 and SUMO1) and zeroerrors with 7 genes. The one misclassified sample islocated at the margin, between the two classes.

Tuning and parameters

We have also examined the effect of using more genes(more than 300) selected by t-test from the training set as

input for SVM-RCE (See section "Choice of Parameters"for more details). While no dramatic changes areobserved, there is some small degradation in the perform-ance (12%) as progressively more genes are input. A sim-ilar observation has been reported when SVM-RFE isapplied to proteomic datasets by Rajapakse et. al. [29].

For demonstrating the convergence of the algorithm tothe optimal solution and to give a more visual illustrationof the SVM-RCE method, we have tracked the mean per-formance over all the clusters for each reduction level. Fig-

ure 1 illustrates the performance on SVM-RCE for theHead & Neck vs. Lung tumors (I) dataset over different lev-els of clusters. The analysis starts with 1000 genes selectedby t-test from the training set that were distributed into300 clusters (n = 300, m = 2, d = 0.3, n_g= 1000) and thenrecursively decreased to 2 clusters. The mean classificationperformance on the test set per cluster at each level ofreduction (Figure 1 line AVG) is dramatically improvedfrom ~55% to ~95% as the number of clusters decreasessupporting the suggestion that less-significant clusters arebeing removed while informative clusters are retained asRCE is employed.

Speed and stability The execution time for our SVM-RCE MATLAB code isgreater than PDA-RFE or SVM-RFE, which use the C pro-gramming language. For example, applying the SVM-RCEon a Personal Computer with P4-Duo-core 3.0 GHz and2GB RAM on Head & Neck vs. Lung tumors (I) took approx-imately 9 hours for 100 iterations (10-folds repeated 10times), while SVM-RFE (with the svm-gist package) took 4minutes.

To estimate the stability of the results, we have re-runSVM-RCE while tracking the performance at each itera-tion, over each level of gene clusters. The mean accuracy

and the standard deviation (stdv) are calculated at the endof the run. The Head & Neck vs. Lung tumors (I) data set

with SVM-RCE has a stdv of 0.040.07. Surprisingly, SVM-RFE with the same data yields a stdv range of 0.20.23. Asimilar stdv range (0.170.21) was returned when SVM-RFE was re-employed with 1000 iterations. Therefore,SVM-RCE is more robust and more stable than SVM-RFE.

K-means is sensitive to the choice of the seed clusters, butclustering results should converge to a local optimum onrepetition. For stability estimations, we have carried out

Table 1: Summary results for the SVM-RCE, SVM-RFE and PDA-RFE method. Summary results for the SVM-RCE, SVM-RFE and PDA-

RFE method applied on 6 public datasets. #c field is the number of clusters for the SVM-RCE method. The #g field is the number of

genes in the associated #c clusters for SVM-RCE, while for the SVM-RFE and PDA-RFE indicates the number of genes used.

Leukemia(I) CTCL(I) CTCL(II) Head & Neck vs. Lung tumors(I)

Head & Neck vs. Lung tumors(II)

Prostate

#c #g ACC #c #g ACC #c #g ACC #c #g ACC #c #g ACC #c #g ACC

SVM- 2 12 99% 2 8 100% 2 8 91% 2 8 100% 2 9 100% 2 8 87%

RCE 3 32 98% 9 32 100% 9 34 96% 8 32 100% 6 32 100% 11 36 95%

28 100 97% 32 101 100% 28 104 96% 28 103 100% 25 103 100% 32 100 93%

SVM-RFE 11 96% 9 89% 8 84% 8 92% 8 98% 8 93%

32 96% 32 94% 32 85% 32 90% 32 98% 36 95%

102 97% 102 100% 102 87% 102 90% 102 98% 102 94%

PDA-RFE 8 96% 8 92% 8 83% 8 89% 8 70% 8 94%

32 96% 32 92% 33 81% 31 96% 32 98% 32 94%

104 96% 104 95% 108 79% 109 96% 102 98% 104 90%
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


5/12



SVM-RCE (on Head & Neck vs. Lung tumors (I))with valuesof u of 1, 10, and 100 repetitions (see sub-section K-means Cluster), and compared the most informative 20genes returned from each experiment. ~80% of the genesare common to the three runs, which suggests that the

SVM-RCE results are robust and stable. Moreover, similaraccuracy was obtained from each experiment.

Is there an advantage, besides increased accuracy, to using

SVM-RCE for gene selection?

Our results suggest that SVM-RCE can reveal importantinformation that is not captured by methods that assessthe contributions of each gene individually. Although wehave limited our initial observations, for simplicity, to thetop 2 clusters needed for separation of datasets with 2known classes of samples, one can expand the analysis to

capture, for example, the top 4 clusters of genes. Wehypothesized that by increasing the number of clustersselected that we might be able to identify sample sub-clus-ters, which may be present in a specific dataset. TheCTCL(I) dataset illustrates this possibility. Figure 2 shows

the hierarchical clusters generated using the top 4 signifi-cant clusters (about 20 genes) revealed by SVM-RCE (Fig-ure 2(b)) and (Figure 2(a)) with comparable numbers ofgenes (20 genes) identified by SVM-RFE. The 4 clusters ofgenes in Figure 2(b) (two up-regulated in patients andanother two down-regulated) appear to identify sub-clus-ters of samples present in each class. For example, we seethat four samples from the control class (C021.3.Th2,C020.6.Th2, CO04.1.UT and C019.6.Th2) form a sub-cluster identified by the genes TNFRSF5, GAB2, IL1R1 andITGB4 (See Figure 2(b) "Control sub-cluster" label). Three

Classification performance of SVM-RCE of Head & Neck vs. Lung tumors (I)Figure 1Classification performance of SVM-RCE of Head & Neck vs. Lung tumors (I). All of the values are an average of 100iterations of SVM-RCE. ACC is the accuracy, TP is the sensitivity, and TN is the specificity of the remaining genes determinedon the test set. Avg is the average accuracy of the individual clusters at each level of clusters determined on the test set. Theaverage accuracy increases as low-information clusters are eliminated. The x-axis shows the average number of genes hostedby the clusters.
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


6/12


7/12



Kind [37] found in yeast, that clustered genes do to nothave correlated functions as might have been expected.One of the merits of the SVM-RCE is its ability to groupthe genes using different metrics. In the present study, thestatistical correlation metric was used. However, one

could use biological metrics such as gene interaction net-work information or functional annotation for clusteringgenes (Cluster step in the SVM-RCE algorithm) to be exam-ined with the SVM-RCE for their contribution to the clas-sification task [38]. In this way, the outcome would be aset of significant genes that share biological networks orfunctions.

The results presented suggest that the selection of signifi-cant genes for classification, using SVM-RCE, is more reli-able than the SVM-RFE or PDA-RFE. SVM-RFE uses the

weight coefficient, which appears in the SVM formula, toindicate the contribution of each gene to the classifier.

However, the exact relation between the weights and per-formance is not well understood. One could argue thatsome genes with low absolute weights are important andtheir low ranking is a result of other dominant correlatedgenes. The success of SVM-RCE suggests that estimatesbased on the contribution of genes, which share a similarprofile (correlated genes), is important and gives eachgroup of genes the potential to be ranked as a group.Moreover, the genes selected by SVM-RCE are guaranteedto be useful to the overall classification since the measure-ment of retaining or removing genes (cluster of genes) isbased on their contribution to the performance of theclassifier as expressed by the Score () measurement. Sim-

ilarly Tang et. al. [16] has shown that partitioning thegenes into clusters, followed by performing estimates ofthe ranks of each gene by SVM, generates improved resultscompared to the traditional SVM-RFE. Ma and Huang [17]have also shown improved results when feature selectiontakes account of the structure of the genes clusters. Theseresults suggest that clustering the genes and performing anestimation of individual gene clusters is the key toenhance the performance and improve the grouping ofsignificant genes. The unsupervised clustering used bySVM-RCE has the additional possibility of identifying bio-logically or clinically important sub-clusters of samples.

MethodsThe following sub-sections describe our method and itsmain components. SVM-RCE combines K-means, a clus-tering method, to identify correlated gene clusters, andSupport Vector Machines (SVMs), a supervised machinelearning classification method, to identify and score(rank) those gene clusters for accuracy of classification. K-means is used initially to group genes into clusters. Afterscoring by SVM the lowest scoring clusters are removed.

The remaining clusters are merged, and the process isrepeated.

The SVM-RCE method-scoring gene clusters

We assume that given dataset D with S genes. The datapartitioned into two parts, one for training (90% of thesamples) and the other (10% of the samples) for testing.

LetXdenote a two-class training dataset that consisting of samples and S genes. We define a score measurement forany listS of genes as the ability to differentiate the twoclasses of samples by applying linear SVM. To calculatethis score we carry out a random partition the training setXof samples intofnon-overlapping subsets of equal sizes(f-folds). Linear SVM is trained overf-1 subsets and theremaining subset is used to calculate the performance.

This procedure is repeated rtimes to take into account dif-ferent possible partitioning. We define Score(X(S),f, r) asthe average accuracy of the linear SVM over the dataXrep-resented by the S genes computed asf-folds cross valida-tion repeated rtimes. We setfto 3 and rto 5 as default

values. Moreover, if the S genes are clustered into sub-clusters of genes S1, S2,..., Sn then we define the Score(X(si),f, r) for each sub-cluster while X(si) is the data X repre-sented by the genes ofSi.

The central algorithm of SVM-RCE method is described asa flowchart in Figure 3. It consists of three main stepsapplied on the training part of the data: the Cluster step forclustering the genes, the SVM scoring step for computingthe Score(X(si),f, r) of each cluster of genes and the RCEstep to remove clusters with low score, as follows:

Algorithm SVM-RCE (input data D)

X= the training dataset

S = genes list (all the genes) or top n_ggenes by t-test

n = initial number of clusters

m = final number of clusters

d = the reduction parameter

While (n m) do

1. Cluster the given genes S into n clusters S1, S2,..., Sn

using K-means (Cluster step)

2. For each cluster i = 1..n calculate its Score(X(si), f, r)(SVM scoring step)

3. Remove the d% clusters with lowest score (RCE step)

4. Merge surviving genes again into one pool S

5. Decrease n byd%.
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


8/12



The description of the SVM-RCE algorithmFigure 3The description of the SVM-RCE algorithm. A flowchart of the SVM-RCE algorithm consists of main three steps: theCluster step for clustering the genes, the SVM scoring step for assessment of significant clusters and the RCE step to remove clus-ters with low score

`

Compute cluster significance by SVM and assign a

score as mean accuracy off-folds repeated rtimes

Cluster by K-means the poolSof genes into n

clusters

Remove clusters of genes with low scores.

Merge surviving genes into one poolS

Cluster 1 Cluster 2 Cluster n

Cluster step

SVM scoring step

RCE step

Is n less than the

desired number

of clusters m?

n= n

n*d

NO

STOP

YES

Training Set (90% ofD) Test Set 10% ofD

Input samples dataD

represented by Sgenes

Test Evaluate classification

accuracy using the poolgenesS


9/12



The basic approach of the SVM-RCE is to first cluster thegene expression profiles into n clusters, using K-means. Ascore Score(X(si),f, r), is assigned to each of the clusters bylinear SVM, indicating its success at separating samples inthe classification task. The d% clusters (ord clusters) with

the lowest scores are then removed from the analysis.Steps 1 to Step 5 are repeated until the numbern of clus-ters is decreased to m.

LetZdenote the testing dataset. At step 4 an SVM classifieris built from the training dataset using the surviving genesS. This classifier is then tested on Z to estimate the per-formance. See Figure 3 the "Test" panel on the right side.

For the current version, the choice ofn and m are deter-mined by the investigator. In this implementation, thedefault value of m is 2, indicating that the method isrequired to capture the top 2 significant clusters (groups)

of genes. However, accuracy is determined after eachround of cluster elimination and a higher number of clus-ters could be more accurate than the final two.

Various methods have been used for classification studiesto find the optimal subset of genes that gives the highestaccuracy [39] in distinguishing members of different sam-ple classes. With SVM-RCE, one can think of this processas a search in the gene-clusters space for the m clusters, ofcorrelated genes, that give the highest classification accu-racy. In the simplest case, the search is reduced to the iden-tification of one or two clusters that define the classdifferences. These might include the important up-regu-

lated and the important down-regulated genes. WhileSVM-RCE and SVM-RFE are related, in that they bothassess the relative contributions of the genes to the classi-fier, SVM-RCE assesses the contributions of groups of cor-related genes instead of individual genes (SVM scoringstep in Figure 3). Additionally, although both methodsremove the least important genes at each step, SVM-RCEscores and removes clusters of genes, while SVM-RFEscores and removes a single or small numbers of genes ateach round of the algorithm.

Implementation

The gist-svm [40] package was used for the implementa-

tion of SVM-RFE, with linear kernel function (dot prod-uct), with default parameters. In gist-svm the SVMemploys a two-norm soft margin with C = 1 as penaltyparameter. See [41] for more details.

SVM-RCE is coded in MATLAB while the BioinformaticsToolbox 2.1 release is used for the implementation of lin-ear SVM with two-norm soft margin with C = 1 as penaltyparameter. The core of PDA-RFE is implemented in C pro-gramming language in our lab (Showe Laboratory, The

Wistar Institute) with a JAVA user interface (Manuscript in

preparation). We haven't used any tuning parameter pro-cedure for optimization.

Choice of parameters

In order to ensure a fair comparison and to decrease the

computation time, we started with the top 300 (n_g =300) genes selected by t-test from the training set for allmethods. However, as was observed by Rajapakse et.al.(2005) [29], using t-statistics for reducing the numberof onset genes subjected to SVM-RFE is not only efficient,but it also enhances the performance of the classifier.However, a larger initial starting set might result in biolog-ically more informative clusters.

For all of the results presented, 10% (d = 0.1) is used forthe gene cluster reduction for SVM-RCE and 10% of thegenes with SVM-RFE and PDA-RFE. For SVM-RCE, westarted with 100 (n = 100) clusters and stopped when 2 (m

= 2) clusters remained. 3-fold (f= 3) repeated 5 (r= 5)times was used in the SVM-RCE method to evaluate thescore of each cluster (SVM scoring step in Figure 3). It isobvious that one can use more stringent evaluationparameters, by increasing the number of repeated cross-

validations, at the price of increasing the computationaltime. In some difficult classification cases, it is worthdoing this in order to enhance the prediction accuracy.

Evaluation

For evaluating the over-all performance of SVM-RCE andSVM-RFE (and PDA-RFE), 10-fold cross validation (9 foldfor training and 1 fold for testing), repeated 10 times, was

employed. After each round of feature or cluster reduc-tion, the accuracy was calculated on the hold-out test set.For each sample in the test set, a score assigned by SVMindicates its distance from the discriminate hyper-planegenerated from the training samples, where a positive

value indicates membership in the positive class and anegative value indicates membership in the negative class.

The class label for each test sample is determined by aver-aging all 10 of its SVM scores and it is based on this valuethat the sample is classified. This method for calculatingthe accuracy gives a more accurate measure of the per-formance, since it captures not only whether a specificsample is positively (+1) or negatively (-1) classified, but

how well it is classified into each category, as determinedby a score assigned to each individual sample. The scoreserves as a measure of classification confidence. The rangeof scores provides a confidence interval.

K-means cluster

Clustering methods are unsupervised techniques wherethe labels of the samples are not assigned. K-means [42] isa widely used clustering algorithm. It is an iterativemethod that groups genes with correlated expression pro-files into k mutually exclusive clusters. k is a parameter
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


10/12



that needs to be determined at the onset. The startingpoint of the K-means algorithm is to initiate k randomlygenerated seed clusters. Each gene profile is associated

with the cluster with the minimum distance (differentmetrics could be used to define distance) to its 'centroid'.

The centroid of each cluster is then recomputed as theaverage of all the cluster gene members' profiles. The pro-cedure is repeated until no changes in the centroids, forthe various clusters, are detected. Finally, this algorithmaims at minimizing an objective functionwith k clusters:

where || ||2 is the distance measurement between genegiprofile and the cluster centroid cj. The "correlation" dis-tance measurement was used as a metric for the SVM-RCEapproach. The correlation distance between genesgrandgs

is defined as:

K-means is sensitive to the choice of the seed clusters (ini-tial centroids) and different methods for choosing theseed clusters can be considered. At the K-means step(Cluster step in Figure 3) of SVM-RCE, k genes are ran-domlyselected to form the seed clusters and this process isrepeated several times (u times) in order to reach the opti-mal, with the lowest value of the objective functionF(data; k).

Clustering methods are widely used techniques for micro-array data analysis. Gasch and Eisen [43] used a heuris-tically modified version of Fuzzy K-means clustering toidentify overlapping clusters and a comparison with thestandard K-means method was reported. Monti et. al. [44]report a new methodology of class discovery, based onclustering methods, and present an approach for valida-tion of clustering and assess the stability of the discoveredclusters.

Support Vector Machines (SVMs)

Support Vector Machines (SVMs) is a learning machinedeveloped by Vapnik [11]. The performance of this algo-

rithm, as compared to other algorithms, has proven to beparticularly useful for the analysis of various classificationproblems, and has recently been widely used in the bioin-formatics field [45-47]. Linear SVMs are usually defined asSVMs with linear kernel. The training data for linear SVMscould be linear non-separable and then soft-margin SVMcould be applied. Linear SVM separates the two classes inthe training data by producing the optimal separatinghyper-plane with a maximal margin between the class 1and class 2 samples. Given a training set of labeled exam-ples(xi, yi), i = 1,..., wherexi R' and yi {+1, -1}, the

support vector machines (SVMs) find the separatinghyper-plane of the form wx+b = 0 w R', b R. Here, wis the "normal" of the hyper-plane. The constantb definesthe position of the hyper-plane in the space. One coulduse the following formula as a predictor for a new

instance: f(x) = sign(wx + b) for more information seeVapnik [11].

The application of SVMs to gene expression datasets canbe divided into two basic problems: one for gene functiondiscovery and the other for classification. As an exampleof the first category, Brown, Grundy et. al. [48] success-fully used SVM for the "identification of biological function-ally related genes", where essentially two group of genes areidentified. One group consists of genes that have a com-mon function and the other group consists of genes thatare not members of that functional class. Comparisons

with several SVMs, that use different similarity metrics,

were also conducted. SVMs performance was reported tobe superior to other supervised learning methods for func-tional classification. Similarly, Eisen, Spellman et. al. [49]used a clustering method with Pearson correlation, as ametric, in order to capture genes with similar expressionprofiles.

As an example of the second category, Fureyet. al. [50]used SVM for the classification of different samples intoclasses and as a statistical test for gene selection (filterapproach).

SVM Recursive Feature Elimination (SVM-RFE)

SVM-RFE [25] is a SVM based model that removes genes,recursively based on their contribution to the discrimina-tion, between the two classes being analyzed. The lowestscoring genes by coefficient weights are removed and theremaining genes are scored again and the procedure isrepeated until only a few genes remain. This method hasbeen used in several studies to perform classification andgene selection tasks [9,51].

Furlanello et. al. [51] developed an entropy recursive fea-ture elimination (E-RFE) in order to accelerate (100) theRFE step with the SVM. However, they do not demon-strate any improvement in the classification performance

compared to the regular SVM-RFE approach. Several otherpapers, as in Kai-Bo et. al. [6], propose a new techniquethat relies on a backward elimination procedure, which issimilar to SVM-RFE. They suggest that their method isselecting better sub-sets of genes and that the performanceis enhanced compared to SVM-RFE. Huang et. al. [52]explore the influence of the penalty parameter C on theperformance of SVM-RFE, finding that one dataset Ccould be better classified when Cwas optimized.

F date k g c t ij

ji

t

j

k

( ; ) .= ==

2

11

where is number of genes

dg g g g

g g g g g g g g rs

r r s s

r r r r s s s s

=

1

( )( )

( )( ) ( )( )wheregg

tg g

tgr rj

j s sj

j

= = 1 1

and
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


11/12



In general, choosing appropriate values of the algorithmparameters (penalty parameter, kernel-function, etc) canoften influence performance. Recently, Zhang et. al. [5]proposed R-SVM as a recursive support vector machinealgorithm to select important features in SELDI data. The

R-SVM was compared to the SVM-RFE and is suggested tobe more robust to noise. No improvement in the classifi-cation performance was found.

Competing interestsThe author(s) declare that they have no competing inter-ests.

Availability and requirementsThe Matlab version of SVM-RCE can be downloaded from[1] under the "Tools->SVM-RCE" tab.

Authors' contributions

Malik Yousef, Louise C. Showe and Michael K. Showeequally contributed to the manuscript while Segun Jungrevised the Matlab code, which was written by Malik

Yousef, to make it available over the web, obtained thePDA-RFE results, and measured the statistical significanceof the method. All authors approved the manuscript.

Additional material

AcknowledgementsThis project is funded in part by the Pennsylvania Department of Health (PA

DOH Commonwealth Universal Research Enhancement Program), and

Tobacco Settlement grants ME01-740 and SAP 4100020718 (L.C.S.), we

would like to thank Shere Billouin for preparing the manuscript and Michael

Nebozhyn for his valuable comments.

References1. Showe Laboratory [http://showelab.wistar.upenn.edu]2. Wang Y, Tetko IV, Hall MA, Frank E, Facius A, Mayer KFX, Mewes

HW: Gene selection from microarray data for cancer classi-

fication a machine learning approach. Computational Biologyand Chemistry2005, 29(1):37.

3. Li T, Zhang C, Ogihara M: A comparative study of feature selec-tion and multiclass classification methods for tissue classifi-cation based on gene expression. Bioinformatics 2004,20(15):2429-2437.

4. Inza I, Larranaga P, Blanco R, Cerrolaza AJ: Filter versus wrapper

gene selection approaches in DNA microarray domains.Arti-ficial Intelligence in Medicine 2004, 31(2):91.5. Zhang X, Lu X, Shi Q, Xu X-q, Leung H-c, Harris L, Iglehart J, Miron

A, Liu J, Wong W: Recursive SVM feature selection and sampleclassification for mass-spectrometry and microarray data.BMC Bioinformatics 2006, 7(1):197.

6. Kai-Bo D, Rajapakse JC, Haiying W, Azuaje F: Multiple SVM-RFEfor gene selection in cancer classification with expressiondata. NanoBioscience, IEEE Transactions on 2005, 4(3):228.

7. Yang X, Lin D, Hao Z, LIiang Y, Liu G, Han X: A fast SVM trainingalgorithm based on the set segmentation and k-means clus-tering. PROGRESS IN NATURAL SCIENCE 2003, 13(10):750-755.

8. Pan W: A comparative review of statistical methods for dis-covering differentially expressed genes in replicated micro-array experiments. Bioinformatics 2002, 18(4):546-554.

9. Li F, Yang Y: Analysis of recursive gene selection approachesfrom microarray data. Bioinformatics 2005, 21(19):3741-3747.

10. Guyon I, Weston J, Barnhill S, Vapnik V: Gene Selection for Can-

cer Classification using Support Vector Machines, MachineLearning.Machine Learning2002, 46(13):389-422.11. Vapnik V: The Nature of Statistical Learning TheorySpringer; 1995.12. Yu L, Liu H: Efficient Feature Selection via Analysis of Rele-

vance and Redundancy.J Mach Learn Res 2004, 5:1205-1224.13. Almeida MBd, Braga AndPd, Braga JoP: SVM-KM: speeding SVMs

learning with a priori cluster selection and k-means. In Pro-ceedings of the VI Brazilian Symposium on Neural Networks (SBRN'00)IEEE Computer Society; 2000:162.

14. Wang J, Wu X, Zhang C: Support vector machines based on K-means clustering for real-time business intelligence systems.International Journal of Business Intelligence and Data Mining 2005,1(1):54-64.

15. Li M, Cheng Y, Zhao H: Unlabeled data classification via sup-port vector machines and k-means clustering. In Proceedings ofthe International Conference on Computer Graphics, Imaging and Visuali-zation IEEE Computer Society; 2004:183-186.

16. Tang Y, Zhang Y-Q, Huang Z: FCM-SVM-RFE Gene Feature

Selection Algorithm for Leukemia Classification fromMicroarray Gene Expression Data. IEEE International Conferenceon Fuzzy Systems: May 2225 2005; Reno 2005:97-101.

17. Ma S, Huang J: Clustering threshold gradient descent regulari-zation: with applications to microarray studies. Bioinformatics2007, 23(4):466-472.

18. Nebozhyn M, Loboda A, Kari L, Rook AH, Vonderheid EC, Lessin S,Berger C, Edelson R, Nichols C, Yousef M, et al.:Quantitative PCRon 5 genes reliably identifies CTCL patients with 5% to 99%circulating tumor cells with 90% accuracy. Blood 2006,107(8):3189-3196.

19. Kari L, Loboda A, Nebozhyn M, Rook AH, Vonderheid EC, NicholsC, Virok D, Chang C, Horng W-H, Johnston J, et al.: Classificationand Prediction of Survival in Patients with the LeukemicPhase of Cutaneous T Cell Lymphoma. J Exp Med2003,197(11):1477-1488.

20. Hastie T, Buja A, Tibshirani R: Penalized discriminant analysis.Annals of Statistics 1995, 23:73-102.

21. BagBoosting for Tumor Classification with Gene ExpressionData [http://stat.ethz.ch/~dettling/bagboost.html]

22. Dettling M, Buhlmann P: Supervised clustering of genes. GenomeBiology2002, 3(12):research0069.0061-research0069.0015.

23. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, MesirovJP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.: MolecularClassification of Cancer : Class Discovery and Class Predic-tion by Gene Expression Monitoring. Science 1999,286(5439):531-537.

24. Dudoit SFJ, Speed T: Comparison of Discrimination Methodsfor the Classification of Tumors Using Gene ExpressionData.Journal of the American Statistical Association 2002, 97:77-87.

25. Isabelle Guyon JW, Stephen Barnhill, Vladimir Vapnik: Gene Selec-tion for Cancer Classification using Support Vector

Additional File 1Hierarchical clustering and Multidimensional scaling (MDS) of the

top genes detected by SVM-RCE and SVM-RFE. helps illustrate the

improved classification accuracy of SVM-RCE for two of the data sets,

Head&Neck(I) and CTCL(I).Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-144-S1.pdf]

Additional File 2Comparison of the CTCL(I) genes selected by SVM-RCE and SVM-

RFE and concomitant clustering of genes and samples. shows addi-

tional structure of the data obtained in the classifications with gene clus-

ters obtained using SVM-RCE compared with SVM-RFE.

Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-144-S2.pdf]
http://-/?-http://showelab.wistar.upenn.edu/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15087314http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15087314http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15087314http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15219288http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15219288http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16606446http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16606446http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12016052http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12016052http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12016052http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16118263http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16118263http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=17182700http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=17182700http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16403914http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16403914http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16403914http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16403914http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12782714http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12782714http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12782714http://stat.ethz.ch/~dettling/bagboost.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=10521349http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=10521349http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=10521349http://www.biomedcentral.com/content/supplementary/1471-2105-8-144-S1.pdfhttp://www.biomedcentral.com/content/supplementary/1471-2105-8-144-S1.pdfhttp://www.biomedcentral.com/content/supplementary/1471-2105-8-144-S2.pdfhttp://www.biomedcentral.com/content/supplementary/1471-2105-8-144-S2.pdfhttp://-/?-http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=10521349http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=10521349http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=10521349http://stat.ethz.ch/~dettling/bagboost.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12782714http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12782714http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12782714http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16403914http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16403914http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16403914http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=17182700http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=17182700http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16118263http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16118263http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12016052http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12016052http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12016052http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16606446http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16606446http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15219288http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15219288http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15087314http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15087314http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15087314http://showelab.wistar.upenn.edu/http://www.biomedcentral.com/content/supplementary/1471-2105-8-144-S2.pdfhttp://www.biomedcentral.com/content/supplementary/1471-2105-8-144-S1.pdf


12/12

Publish with BioMed Centraland everyscientist can read your work free of charge

"BioMed Central will be the most significant development for

disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours you keep the copyright

Submit your manuscript here:

http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral


Page 12 of 12

Machines, Machine Learning. Machine Learning2002, 46(13):389-422.

26. Vachani Anil, Nebozhyn Michael, Singhal Sunil, Alila Linda, ElliotWakeam, Ruth Muschel, Powell A Charles, Gaffney Patrick, SinghBhuvanesh, Brose Marcia S, et al.: Identification of 10 Gene Clas-sifier for Head and Neck Squamous Cell Carcinoma andLung Squamous Cell Carcinoma: Towards a Distinction

between Primary and Metastatic Squamous Cell Carcinomaof the Lung.Accepted Clinical Cancer Research 2007.27. Talbot SG, Estilo C, Maghami E, Sarkaria IS, Pham DK, O-charoenrat

P, Socci ND, Ngai I, Carlson D, Ghossein R, et al.: Gene ExpressionProfiling Allows Distinction between Primary and MetastaticSquamous Cell Carcinomas in the Lung. Cancer Res 2005,65(8):3063-3071.

28. Seber GAF:Multivariate Observations John Wiley & Sons Inc; 1984.29. Rajapakse JC, Duan K-B, Yeo K: Proteomic cancer classification

with mass spectra data. American Journal of Pharmacology2005,5(5):228-234.

30. Fraley C, Raftery AE: How Many Clusters? Which ClusteringMethod? Answers Via Model-Based Cluster Analysis. TheComputer Journal1998, 41(8):578-588.

31. Dudoit S, Fridlyand J: A prediction-based resampling methodfor estimating the number of clusters in a dataset. GenomeBiology2002, 3(7):research0036.0031-research0036.0021.

32. Bolshakova N, Azuaje F, Cunningham P: A knowledge-driven

approach to cluster validity assessment. Bioinformatics 2005,21(10):2546-2547.33. Gat-Viks I, Sharan R, Shamir R: Scoring clustering solutions by

their biological relevance. Bioinformatics 2003,19(18):2381-2389.

34. Toronen P:Selection of informative clusters from hierarchicalcluster tree with gene classes. BMC Bioinformatics 2004, 5(1):32.

35. Gibbons FD, Roth FP:Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation. GenomeRes 2002, 12(10):1574-1581.

36. Datta S, Datta S: Methods for evaluating clustering algorithmsfor gene expression data using a reference set of functionalclasses. BMC Bioinformatics 2006, 7(1):397.

37. Clare A, King RD: How well do we understand the clustersfound in microarray data? In Silico Biol2002, 2:511-522.

38. Pang H, Lin A, Holford M, Enerson BE, Lu B, Lawton MP, Floyd E,Zhao H: Pathway analysis using random forests classificationand regression. Bioinformatics 2006, 22(16):2028-2036.

39. Kohavi R, John GH:Wrappers for feature subset selection. Arti-ficial Intelligence 1997, 97(12):273.40. Pavlidis P, Wapinski I, Noble WS: Support vector machine classi-

fication on the web. Bioinformatics 2004, 20(4):586-587.41. gist-train-svm [http://www.bioinformatics.ubc.ca/gist/compute-

weights.html]42. MacQueen J : Some methods for classification and analysis of

multivariate observations. In Proceedings of the Fifth Berkeley Sym-posium on Mathematical Statistics and ProbabilityUniversity of CaliforniaPress; 1967:281-297.

43. Gasch A, Eisen M: Exploring the conditional coregulation ofyeast gene expression through fuzzy k-means clustering.Genome Biology2002, 3(11):research0059.0051-research0059.0059.

44. Monti S, Tamayo P, Mesirov J, Golub T: Consensus Clustering: AResampling-Based Method for Class Discovery and Visuali-zation of Gene Expression Microarray Data.Machine Learning2003, 52(12):91.

45. Haussler D: Convolution kernels on discrete structures. In

Technical Report UCSCCRL-99-10 Santa Cruz: Baskin School of Engi-neering, University of California; 1999.

46. Pavlidis P, Weston J, Cai J, Grundy WN: Gene functional classifi-cation from heterogeneous data. In Proceedings of the fifth annualinternational conference on Computational biology: 2001; Montreal, Que-bec, Canada ACM Press; 2001:249-255.

47. Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, ZhangS, Baskin B, Bader G, Michalickova K, et al.: PreBIND and Textomy mining the biomedical literature for protein-protein inter-actions using a support vector machine. BMC Bioinformatics2003, 4(1):11.

48. Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS,Ares M Jr, Haussler D: Knowledge-based analysis of microarraygene expression data by using support vector machines.PNAS 2000, 97(1):262-267.

49. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysisand display of genome-wide expression patterns. PNAS 1998,95(25):14863-14868.

50. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haus-sler D: Support vector machine classification and validationof cancer tissue samples using microarray expression data.Bioinformatics 2000, 16(10):906-914.

51. Furlanello C, Serafini M, Merler S, Jurman G: Entropy-based generanking without selection bias for the predictive classifica-tion of microarray data. BMC Bioinformatics 2003, 4(1):54.

52. Huang TM, Kecman V: Gene extraction for cancer diagnosis bysupport vector machines An improvement. Artificial Intelli-gence in Medicine 2005, 35(12):185.
http://www.biomedcentral.com/http://www.biomedcentral.com/http://www.biomedcentral.com/http://www.biomedcentral.com/info/publishing_adv.asphttp://www.biomedcentral.com/http://www.biomedcentral.com/http://www.biomedcentral.com/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15833835http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15833835http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15833835http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15713738http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15713738http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14668221http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14668221http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15043761http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15043761http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12368250http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12368250http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16945146http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16945146http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16945146http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12611631http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12611631http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16809386http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16809386http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14990457http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14990457http://www.bioinformatics.ubc.ca/gist/compute-weights.htmlhttp://www.bioinformatics.ubc.ca/gist/compute-weights.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12689350http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12689350http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12689350http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=10618406http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=10618406http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=9843981http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=9843981http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11120680http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11120680http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14604446http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14604446http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14604446http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16026974http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16026974http://www.biomedcentral.com/http://www.biomedcentral.com/info/publishing_adv.asphttp://www.biomedcentral.com/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16026974http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16026974http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14604446http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14604446http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14604446http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11120680http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11120680http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=9843981http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=9843981http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=10618406http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=10618406http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12689350http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12689350http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12689350http://www.bioinformatics.ubc.ca/gist/compute-weights.htmlhttp://www.bioinformatics.ubc.ca/gist/compute-weights.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14990457http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14990457http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16809386http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16809386http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12611631http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12611631http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16945146http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16945146http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16945146http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12368250http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12368250http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15043761http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15043761http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14668221http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14668221http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15713738http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15713738http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15833835http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15833835http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15833835

Documents

Recursive Cluster Elimination (RCE) for Classification and Feature Selection From Gene Expression Data