10
Random subspace method for multivariate feature selection Carmen Lai a, * , Marcel J.T. Reinders a , Lodewyk Wessels a,b a Information and Communication Theory Group, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands b The Netherland’s Cancer Institute, Amsterdam, The Netherlands Received 10 March 2005; received in revised form 1 November 2005 Available online 21 February 2006 Communicated by Prof. F. Roli Abstract In a growing number of domains data captured encapsulates as many features as possible. This poses a challenge to classical pattern recognition techniques, since the number of samples often still is limited with respect to the number of features. Classical pattern recog- nition methods suffer from the small sample size, and robust classification techniques are needed. In order to reduce the dimensionality of the feature space, the selection of informative features becomes an essential step towards the classification task. The relevance of the features can be evaluated either individually (univariate approaches), or in a multivariate manner. Univariate approaches are simple and fast, therefore appealing. However, possible correlation and dependencies between the features are not considered. Therefore, multivariate search techniques may be helpful. Several limitations restrict the use of multivariate searches. First, they are prone to overtraining, especially in p n (many features and few samples) settings. Secondly, they can be computationally too expensive when dealing with a large feature space. We introduce a new multivariate search technique, that is less sensitive to the noise in the data and computationally feasible as well. We compare our approach with several multivariate and univariate feature selection techniques, on an artificial dataset which provides us with ground truth information, and on a real dataset. The results show the importance of multivariate search techniques and the robust- ness and reliability of our novel multivariate feature selection method. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Feature selection; Random subspace method; Small sample size problem 1. Introduction In order to solve a classification task, the more features the better, since more information is present. However, addition of features beyond a certain point leads to a higher probability of error, as indicated in (Duda et al., 2001). This behavior is known in pattern recognition as the curse of dimensionality (Duda et al., 2001; Jain and Zongker, 1997; Trunk, 1979; Raudys and Jain, 1991), and it is caused by the finite number of samples. Nowadays there are a growing number of domains that produce data with a large number of features, while the number of samples is limited. For example, the acquisition of spectral data, which give for a single sample the informa- tion across a large range of wavelengths. Other examples are the microarray datasets, that measure the gene activity of thousands of genes while the number of samples is lim- ited to several hundreds, due to the high cost associated with the procedure and the sample availability. Assump- tions often made in the literature are that many features are uninformative or noisy (Bo and Jonassen, 2002; Ambroise and McLachlan, 2002; Xiong et al., 2001) and that features are likely to be correlated (Bo and Jonassen, 2002; Chow et al., 2001; Dudoit et al., 2002). 0167-8655/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.12.018 * Corresponding author. Tel.: +31 152 783 418; fax: +31 152 781 843. E-mail address: [email protected] (C. Lai). www.elsevier.com/locate/patrec Pattern Recognition Letters 27 (2006) 1067–1076

Random subspace method for multivariate feature selection

Embed Size (px)

Citation preview

Page 1: Random subspace method for multivariate feature selection

www.elsevier.com/locate/patrec

Pattern Recognition Letters 27 (2006) 1067–1076

Random subspace method for multivariate feature selection

Carmen Lai a,*, Marcel J.T. Reinders a, Lodewyk Wessels a,b

a Information and Communication Theory Group, Faculty of Electrical Engineering, Mathematics and Computer Science,

Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlandsb The Netherland’s Cancer Institute, Amsterdam, The Netherlands

Received 10 March 2005; received in revised form 1 November 2005Available online 21 February 2006

Communicated by Prof. F. Roli

Abstract

In a growing number of domains data captured encapsulates as many features as possible. This poses a challenge to classical patternrecognition techniques, since the number of samples often still is limited with respect to the number of features. Classical pattern recog-nition methods suffer from the small sample size, and robust classification techniques are needed.

In order to reduce the dimensionality of the feature space, the selection of informative features becomes an essential step towards theclassification task. The relevance of the features can be evaluated either individually (univariate approaches), or in a multivariate manner.Univariate approaches are simple and fast, therefore appealing. However, possible correlation and dependencies between the features arenot considered. Therefore, multivariate search techniques may be helpful. Several limitations restrict the use of multivariate searches.First, they are prone to overtraining, especially in p� n (many features and few samples) settings. Secondly, they can be computationallytoo expensive when dealing with a large feature space.

We introduce a new multivariate search technique, that is less sensitive to the noise in the data and computationally feasible as well.We compare our approach with several multivariate and univariate feature selection techniques, on an artificial dataset which provides uswith ground truth information, and on a real dataset. The results show the importance of multivariate search techniques and the robust-ness and reliability of our novel multivariate feature selection method.� 2006 Elsevier B.V. All rights reserved.

Keywords: Feature selection; Random subspace method; Small sample size problem

1. Introduction

In order to solve a classification task, the more featuresthe better, since more information is present. However,addition of features beyond a certain point leads to ahigher probability of error, as indicated in (Duda et al.,2001). This behavior is known in pattern recognition asthe curse of dimensionality (Duda et al., 2001; Jain andZongker, 1997; Trunk, 1979; Raudys and Jain, 1991),and it is caused by the finite number of samples.

0167-8655/$ - see front matter � 2006 Elsevier B.V. All rights reserved.

doi:10.1016/j.patrec.2005.12.018

* Corresponding author. Tel.: +31 152 783 418; fax: +31 152 781 843.E-mail address: [email protected] (C. Lai).

Nowadays there are a growing number of domains thatproduce data with a large number of features, while thenumber of samples is limited. For example, the acquisitionof spectral data, which give for a single sample the informa-tion across a large range of wavelengths. Other examplesare the microarray datasets, that measure the gene activityof thousands of genes while the number of samples is lim-ited to several hundreds, due to the high cost associatedwith the procedure and the sample availability. Assump-tions often made in the literature are that many featuresare uninformative or noisy (Bo and Jonassen, 2002;Ambroise and McLachlan, 2002; Xiong et al., 2001) andthat features are likely to be correlated (Bo and Jonassen,2002; Chow et al., 2001; Dudoit et al., 2002).

Page 2: Random subspace method for multivariate feature selection

1068 C. Lai et al. / Pattern Recognition Letters 27 (2006) 1067–1076

Therefore, a feature selection strategy is needed toreduce the dimensionality of the feature space and to iden-tify the relevant features to be used for a successful classi-fication task. Feature selection algorithms can be divided intwo categories: filters and wrappers (Kohavi and John,1997). Filter approaches evaluate the relevance of featuresbased on a criterion indicative of the capacity of a featureto separate the classes, while wrapper approaches employthe classification algorithm that will be used to build thefinal classifier to judge feature quality. Both approachesinvolve combinatorial searches through the space of pos-sible feature subsets. Several greedy procedures can beapplied, such as forward or backward elimination, or lessgreedy approaches such as the more computationallydemanding floating searches and genetic algorithms (Dudaet al., 2001; Kohavi and John, 1997; Pudil et al., 1994; Liet al., 2001).

The relevance of a feature can be evaluated either indi-vidually (univariate approaches), or in a multivariate man-ner. Univariate approaches are simple and fast, thereforeappealing and popular (Golub et al., 1999; Ben-Doret al., 2000; Tibshirani et al., 2002; van ’t Veer et al.,2002; Khan et al., 2001; Xing et al., 2001). However, theyassume that the features are independent. Multivariateapproaches, on the contrary, evaluate the relevance of thefeatures considering how they function as a group, takinginto account their dependencies (Xiong et al., 2001; Boand Jonassen, 2002; Guyon et al., 2002; Bhattacharyyaet al., 2003). Several limitations however restrict the useof multivariate approaches. Firstly, they are prone to over-training, especially in p� n (many features and few sam-ples) settings. Secondly, they may be computationallyexpensive, which prevents them from being applied to alarge feature space.

The large number of features compared to the numberof samples causes over-training when proper measuresare not taken. In order to overcome this problem, we intro-duce a new multivariate approach for feature selectionbased on the random subspace method (RSM) proposedby Ho (1995, 1998) and studied further by Skurichinaand Duin (2002). Ho introduced the RSM to avoid overfit-ting on the training set while preserving the maximumaccuracy when training decision tree classifiers. Skurichinaet al. used the RSM in order to obtain weak classifiers to becombined in a second step of the classification process. Wepropose to use the RSM in order to better evaluate theinformativeness of the features and, therefore, select a rel-evant feature subset on which to train a single classifier.

In this study, we apply a multivariate search techniqueon a subspace randomly selected from the original featurespace. In this reduced feature space the multivariate featureselection may better handle the noise in the data and willconsequently be able to retrieve the informative features.In order to take into account all the measured features ofthe dataset, the procedure is repeated many times. As aresult several feature subsets are selected. These are com-bined into a final list of selected features, by ordering the

features based on their relevance derived from their accu-racy in the individuals runs. The final classifier can thenbe trained by using the final list of features. Our methodcan be applied in combination with existing classifiersand feature selection approaches, and is computationallyfeasible.

We compare our algorithm with other multivariateapproaches, such as forward selection (Duda et al., 2001)and base-pair selection (Bo and Jonassen, 2002), as wellas univariate techniques (Golub et al., 1999; Ben-Doret al., 2000; Xing et al., 2001). The comparison is per-formed on both a real dataset and on an artificial datasetwhich provides a controlled environment, and models thementioned assumptions of correlation between featuresand the presence of a large number of uninformativefeatures. The results show the importance of multivariatesearch techniques and the robustness and reliability ofour new method.

The paper is organized as follows: Section 2 describesthe feature selection algorithms and gives a detaileddescription of our random subspace method for multivari-ate feature selection. The datasets used are presented inSection 3. Section 4 illustrates the experimental results ofseveral multivariate and univariate feature selection tech-niques. Finally, the conclusions follow in Section 5.

2. Feature selection techniques

First, in Section 2.1, we briefly describe the univariateand multivariate techniques employed in comparisonexperiments. Then, in Section 2.2, the random subspacemethod is introduced. Although the techniques areapplied to a two class problem, they can be extended toa multi-class problem. A solution for multiclass problemscould be to apply the complete technique as describedhere for all pairs of classes, or for one class against theothers, and then use multiclass combiner strategies to cre-ate the final classifier. Examples can be found e.g. in(Allwein et al., 2000; Dietterich and Bakiri, 1995; Taxand Duin, 2002). Similarly the evaluation of a featureon the basis of the SNR for multiclass problem can beapproached by computing the SNR criterion for all pairsof classes, and assign the minimum/maximum value tothe features. Another alternative could be the comparisonof the distribution of one class against the overall distri-bution (as been adopted by Tibshirani et al. (2002)),and again assign to the feature the minimum/maximumvalue.

2.1. Existing feature selection techniques

2.1.1. Univariate search technique

In the univariate approach the informativeness of eachfeature is evaluated individually, according to a criterion,such as the signal-to-noise ratio (SNR) (Golub et al.,1999; Chow et al., 2001) for a two class problem. The signalto noise ratio is defined as follows:

Page 3: Random subspace method for multivariate feature selection

C. Lai et al. / Pattern Recognition Letters 27 (2006) 1067–1076 1069

SNR ¼ jm1 � m2jffiffiffiffiffiffiffiffiffiffiffiffiffiffis2

1 þ s22

p ; ð1Þ

where m1 and m2 are the estimated means of the two classesand s1 and s2 are the estimates of the respective standarddeviations. The higher the SNR the more informative thecorresponding features, which are ranked accordingly, i.e.from the most to the least informative. This provides anordered feature list L, and a cross-validation procedure isemployed to judge the number of features from the topof this list to use.

2.1.2. Base-pair selectionThe base-pair selection algorithm was proposed for

microarray datasets by Bo and Jonassen (2002). The rele-vance of features is judged by evaluating pairs of features.For each pair the data is first projected by the diagonal lin-ear discriminant (DLD) onto a one-dimensional space. Thescore can then be computed by the t-statistic in this space.In our implementation we have used the Fisher discrimi-nant and the SNR instead of the DLD and the t-statistic,respectively. This enables a better comparison with theother studied techniques. Both a full search and a less com-putationally demanding greedy search are investigated.The complete search evaluates all pairs and rank them ina list without repetition according to the scores. The com-putational complexity is a serious limitation of the method,therefore a faster greedy search is also employed. The fea-tures are first ranked according to the individual SNR. Thebest one is taken and then the method searches for a fea-ture among all the remaining features, which together withthe individual best one, obtains the highest score. This pro-vides the first two features of the ordered list. From theremaining 2 � p features the best individual one is againtaken and matched with the feature with which it achievesthe highest score. This provides the second pair of features.By iterating the process the features are added, two at atime, until all of them are ordered.

2.1.3. Forward selection

Forward feature selection starts with the single mostinformative feature and iteratively adds the next mostinformative feature in a greedy fashion. Here, we selectthe features based on the criterion proposed by Bo andJonassen (2002). The first two features are obtained asthe best pair described in the base-pair approach. For eachof the p � 2 features, a third one is added to the best twofeatures. The obtained three-dimensional feature space isprojected onto the one dimensional space using again theFisher discriminant, and the SNR criterion is computed.The best feature triplet will be the one that achieves thehighest value of the SNR. By iterating the procedure, thefeatures are added one by one, providing an ordered listof features of length L. Now the length of the list is limitedto n. This upper limit stems from the fact that the Fisherclassifier cannot be solved (without taking additional mea-sures) if L > n.

2.1.4. Recursive feature elimination (RFE)

RFE is an iterative backward selection proposed byGuyon et al. (2002). Initially a Support Vector Machine(SVM) classifier is trained with the full feature set. Thequality of a feature is characterized by the weight thatthe SVM optimization assigns to that feature. A portionof the features with the smallest weights is removed ateach iteration of the selection process. In order to buildthe ordered list of features length (L), the features thatare removed are added at the bottom of the list. By iter-ating the procedure this list grows from the least to themost relevant feature. Note that the features are notconsidered individually, since their assigned weights aredependent on all the features considered during a giveniteration.

2.1.5. Liknon

Recently Bhattacharyya et al. (2003) and Grate et al.(2002) proposed a classifier called Liknon that simul-taneously performs classification and relevant featureidentification. Liknon is trained by optimizing a linear dis-criminant function with a penalty constraint via linearprogramming. This yields a sparse hyper-plane that isparameterized by a limited set of features (that are assignednon-zero weights by Liknon). By varying the influence ofthe penalty term the size of the selected features set canbe varied.

2.2. Random subspace method

In case of a high dimensional feature space, it may bedifficult for a multivariate search technique to identify therelevant features. In order to lower this risk we propose anew multivariate approach for feature selection based onthe random subspace method (RSM) introduced by Ho(1995, 1998). A multivariate search technique is appliedon a subspace randomly selected from the original featurespace. In this reduced space, the search technique can bet-ter handle the problem of dimensionality, and thus retrievethe informative features, since the number of samples perfeature increases. In order to cover a large portion of thefeatures in the dataset, we repeat the selection t times. Asa result, t feature subsets are evaluated resulting in a weightassociated with each feature for each of the selections. Theweight is proportional to the relevance of the feature. Wecombine the results of all iterations in a final list of L fea-tures, ordered according to their relevance. Since this list isbuilt upon the results of the more reliable feature evalua-tions that were performed in subspaces, the combined listis of better quality than a list constructed in the originalcomplete feature space.

Our method can be applied together with different exist-ing feature selection techniques. In the following, two algo-rithms are proposed using respectively Liknon and RFE asbasal feature selection methods. Algorithm 1 applies ineach of the t selections the Liknon classifier to a randomfeature set of size s. Only some of the s features will have

Page 4: Random subspace method for multivariate feature selection

1070 C. Lai et al. / Pattern Recognition Letters 27 (2006) 1067–1076

a non-zero weight, that is proportional to the relevance ofthe feature in the subspace considered. The average of theweights across the number of times the feature was selectedin the t subspaces is computed for each feature. Finally thefeatures are sorted according to their computed averageweights, then the top L features are selected, where L isoptimized according to a cross-validation procedure. AgainLiknon is employed as classifier in the greedily selected setof features.

Algorithm 1 (Random subspace method with Liknon (RSM-Liknon)).

1. Input: training set X, label set y, number of selectionst, size of the subspace s, matrix of zeros with t rowsand p columns Z = 0t·p to store the feature scores,vector c = 01·p to count the number of times each fea-ture is selected across the t selections.

2. Repeat for i = 1:t3. • Generate the random permutation index vector

pi = perm({1, . . . ,p}).4. • Generate the index vector vi ¼ fpi

1; pi2; . . . ; pi

sg.5. • Extract the features indicated by vi : eX vi � X .6. • Train the Liknon classifier on the labeled dataset

to obtain the weights: wi ¼ CðeX vi; yÞ, with C the

Liknon classification rule applied on the subspacevi of the training set X.

7. • Save the weights in the score matrix Z: Zði;vijÞ ¼

wij 8j ¼ 1; . . . ; s.

8. • Update the counter c : cj = cj + 1 "j 2 vi.9. Compute the score vector z:

zj ¼

Xt

i¼1Zij

cj; if cj 6¼ 0;

0; otherwise.

8><>:

10. Sort the list of features according to the score vector z.11. Output: Ordered list of top L features.

Algorithm 2 applies RFE instead of Liknon on eachof the t randomly selected subspaces. Also here allfeatures of this subspace will be assigned a weight. Incontrast to RSM-Liknon, the weights, however, are notcomparable, since each weight is computed in a differentfeature space. Remember, RFE subsequently eliminatesfeatures on the basis of their weights. To establish therelevance of a feature we, therefore, have used the orderin which the features are removed, i.e. the features thatsurvive the RFE pruning the longest are the best. Insteadof using the rank position itself as relevance indicator,we choose to quantize this. Only when the feature j iseliminated during the last l iterations of the RFE schemeit is indeed relevant and its score is incremented by 1.Finally, after the t selections, the score of each feature isdetermined, and the features are ordered accordingly.The top L features are selected, where L is optimizedaccording to a cross-validation procedure using the Fisherclassifier.

Algorithm 2 (Random subspace method with RFE (RSM–

RFE)).

1. Input: training set X, label set y, number of selections t,size of the subspace s, threshold l (if a feature is elimi-nated during the last l iterations of RFE it is judged rel-evant), vector z = 01·p to store the feature scores.

2. Repeat for i = 1:t3. • Generate the random permutation index vectorpi =

perm({1, . . . ,p}).4. • Generate the index vector vi ¼ fpi

1; pi2; . . . ; pi

sg.5. • Extract the features indicated by vi : eX vi � X .6. • By applying the RFE procedure we obtain the order

in which the features are removed: Li ¼ UðeX vi; yÞ.

Here Lik is the kth to last feature and U is the RFE

procedure applied to the subspace vi of the trainingset X. to be removed.

7. • Update the score vector z : zj¼ zjþ1 8j2fLi1; . . . ;L

ilg.

8. Sort the list of features according to the score vector z.9. Output: Ordered list of top L features.

Two parameters need to be set in both RSM–RFE andRSM-Liknon: the subspace size s and the number ofselections t. The smaller the subspace size s the faster thealgorithm, but at the same time, the larger the chance ofmissing informative features or missing dependencesbetween many features. Similarly the smaller the num-ber of selections, t, the faster the algorithm, but thesmaller the amount of data available for the evaluationof the feature occurrences when building the final featureset. In the case of RSM–RFE an extra choice regardsthe number l of features judged relevant. The smaller l

the smaller the number of features judged relevant, butif the subset is too small good features may be missed.Large l may include irrelevant features, adding noisydimensions to the subspace. The parameters are optimizedempirically. For each parameter a set of possible values ischosen. Any available knowledge of the specific datasetadopted can guide the choice, e.g. if the number of infor-mative features is expected to be low, a small value ofthe threshold l should be considered. The final choiceof the best parameter combination is based on the cross-validation error on the training set, as described in theSection 4.1.

3. Datasets

3.1. The artificial dataset

In order to investigate the multivariate selection algo-rithms we generated an artificial dataset in which there iscorrelation between pairs of features. In this way, a pairof features is informative if considered together, and a mul-tivariate selection strategy is necessary to find the trulyinformative features.

The informative features are generated in pairs: for eachpair the samples are sampled from a Gaussian distribution

Page 5: Random subspace method for multivariate feature selection

C. Lai et al. / Pattern Recognition Letters 27 (2006) 1067–1076 1071

with mean �l1 ¼ 0 0½ �T for the first class and �l2 ¼2ffiffi2p d 0½ �T for the second class. The covariance matrix,

equal for both classes, is R1 ¼ R2 ¼ 12

vþ 1 v� 1v� 1 vþ 1

� �. Pairs

of correlated features are added until q informative fea-tures are generated. The remaining p � q features are unin-formative, i.e. the two classes are drawn from a sphericalGaussian distribution Nð0; vffiffi

2p IÞ, where I is the identity

matrix.The artificial dataset allows us to investigate the behav-

ior of the methods in a controlled environment. Theground truth knowledge of the information present makesit possible to investigate the retrieval ability of the differenttechniques. In the following experiments we have set thenumber of the samples n = 100 and the number of featuresp = 300 to simulate a small sample size problem. In orderto have a class overlap we set d = 3 and v ¼

ffiffiffiffiffi40p

. Further-more, we investigated the role of the informativeness in thedataset by varying q as follows: q = [20, 50,100,150].

3.2. The spectra dataset

The algorithms are also compared on a small sample sizeversion of the spectra dataset. The original dataset consistsof autofluorescence spectra measured in multiple locationsof the oral cavity (de Veld et al., 2004). There are two clas-ses: 96 healthy volunteers and 155 patients with lesions inthe oral cavity. After preprocessing, each spectrum consistof 199 bins (wavelengths). Although for the same personmultiple spectra were acquired, we adopted one locationonly, in order to reduce the redundancy in the data. There-fore, the number of features is 199. We study the role of thesample size by using a dataset with 50, 100 and 200patients, balanced per class.

4. Experimental results

First the set up of the experiments are described. Thenthe experimental results are presented, followed by adiscussion on the effect of a change in informativenessand number of samples in the datasets. The experimentsare implemented in a Matlab environment using thePRTools (Duin et al., 2004) and PRExp (Paclı́k et al.,2003) toolboxes.

4.1. Experimental set up

As a criterion to judge the feature relevance in the one-dimensional space the SNR is adopted, due to its simplicityand popularity (Golub et al., 1999; van ’t Veer et al., 2002;Khan et al., 2001). We use the Fisher criterion to projectthe data from multiple dimensions to a single dimension,as is required in both the forward and base-pair selectionmethods. We have used the Fisher classifier since it canexploit feature correlation. Also the Nearest Mean Classi-fier (NMC) with the cosine correlation as distance measure

is employed in order to compare the results with other pub-lished approaches (Ein-Dor et al., 2004; van ’t Veer et al.,2002).

The Liknon classifier requires the optimization of thestrength of the penalty term. The optimization of thestrength of the penalty term is done beforehand in a 10fold cross-validation procedure for a range of values[10�1, . . . , 103]. Since the average error was constant acrossthis range, we chose 0.1 as a value for the strength of thepenalty term because it selects a small feature subset.

The RSM–RFE approach was applied in combinationwith the Fisher classifier to allow for a better comparisonwith the other studied approaches which also employ theFisher classifier. We chose to remove one feature per iter-ation. Concerning the optimization of the parameters,we varied the subspace size across s = [40,70,100] ands = [40, 70,100,130] for the artificial and real datasets,respectively. We selected as the best subset size on bothdatasets the value s* = 70 that reached the lowest cross-val-idation error on the training set. The number of selectionsis set to t* = 130. Smaller values, i.e. t = 100 and t = 50,were also tested. Although the performances are not alwayssensitive to this parameter, the larger t the more data isavailable for the evaluation of the feature relevance. Forthe RSM–RFE, several settings of the threshold l whereexperimentally investigated: l = [3,5,8,10,15,20] for theartificial dataset and l = [3,5,10,15] for the real dataset.The settings l = 3 and l = 5, respectively reached the lowestcross-validation error on the training set and are furtherpresented in the experimental results.

When designing a classification system, two steps needto be taken. The first is the classifier training, and the sec-ond is the estimation of the classifier performance. Due tothe small number of samples, a cross-validation procedureis a preferable approach to estimate the classificationerror. In order to have an unbiased error estimate thetwo steps should be performed independently (Ambroiseand McLachlan, 2002). Therefore, we employ a doubleloop cross-validation procedure (Wessels et al., 2005). Inthe inner loop for each fold the feature selection is per-formed, giving the feature list L. The classifier is trainedstarting with the first two features of L till all featuresare used. The subset of L that shows the smallest classifica-tion error is selected. The selected feature sets from eachfold are merged in a final subset Lw, i.e. a list withoutduplication of all features present. In the external loopthe performance of the classifier is estimated. This proce-dure ensures that the training and evaluation of the classi-fier are completely decoupled, as to prevent any bias in theperformance evaluation. We chose to use 10 fold cross-validation for both the inner and external loop. This choiceis also suggested by Kohavi (1995).

4.2. Results

Fig. 1 shows the behavior of the different methodologieson the artificial dataset with n = 100, p = 300, q = 20. The

Page 6: Random subspace method for multivariate feature selection

0

5

10

15

20

25

30

35

40

45

50

Cla

ssifi

catio

n er

ror

(%)

univa

riate

,NM

C

univa

riate

,FLD

Forwar

d

base

-pair

base

-pair

gre

edy

RSM-R

FERFE

RSM-L

iknon

Likno

n

Fig. 1. Artificial dataset with n = 100, p = 300 and informative features q = 20. Average classification error of the 10 fold cross-validation procedure forthe different approaches.

1072 C. Lai et al. / Pattern Recognition Letters 27 (2006) 1067–1076

average classification error (computed in the outer loop),and its standard deviation are given. It is visible that theunivariate approaches (univariate NMC, univariate FLD)perform the worst, while the base-pair approaches (bothfull and greedy search) reach the best performance. Thisis expected due to the fact that pairs of features are stronglycorrelated. Classical Liknon does not perform well. In Sec-

0 20 40 60

10

20

30

40

50

60

70

80

90

100

Fe

Per

cent

age

of r

etrie

ved

info

rmat

ive

feat

ures

Fig. 2. Relevant features of the artificial dataset r

tion 4.3 we will further explore this topic. The use of RSMwith both Liknon and RFE improves the results dramati-cally, such that the results are comparable with the base-pair approaches. The same classification methodologieswere also applied to different settings of the artificial data-set, i.e. a smaller number of informative features (q = 10)and sample sizes n = [250, 50] (data not shown). The larger

80 100 120 140

atures size

unirankForwardBase PairBase Pair GreedyRSM-RFERFERSM-LikLiknon

etrieved by the different selection techniques.

Page 7: Random subspace method for multivariate feature selection

C. Lai et al. / Pattern Recognition Letters 27 (2006) 1067–1076 1073

the number of informative features and/or samples thehigher the performances of all classifiers. However, thedifferent methodologies show the same behavior: univar-iate selection and classical Liknon perform poorly, whilebase-pair selection and RSM approaches give good results.

The knowledge of which features are informative in theartificial dataset allows us to study the retrieval capability

0

5

10

15

20

25

30

35

40

45

50

Cla

ssifi

catio

n er

ror

(%)

univa

riate

,NM

C

univa

riate

,FLD

Forwar

d

base

-pair

base

-pair

gre

ed

Fig. 3. Spectra dataset with n = 199 and p = 100. Average classification erro

0 500

5

10

15

20

25

30

35

40

45

50

Featu

Inne

r cr

oss-

valid

atio

n er

ror

(%)

Fig. 4. Average error of the inner cross-validation procedure as a function ofdifferent approaches: RSM-Liknon, RSM–RFE and base-pair.

of the different feature selection strategies. Fig. 2 showsthe number of informative features retrieved by the differ-ent methodologies as a function of the subset size. Theresults are the average of the 100 folds of the inner 10 foldcross-validation loop. Surprisingly, both forward searchand RFE retrieve more uninformative features than theunivariate approach, since these selection methods should

y

RSM-R

FERFE

RSM-L

iknon

Likno

n

r of the 10 fold cross-validation procedure for the different approaches.

100 150re size

RSM-Liknon artificialRSM-RFE artificialbase-pair artificialRSM-Liknon spectraRSM-RFE spectrabase-pair spectra

the number of features for the artificial and the spectra datasets and three

Page 8: Random subspace method for multivariate feature selection

1074 C. Lai et al. / Pattern Recognition Letters 27 (2006) 1067–1076

be capable of detecting the multivariate informative fea-tures. Apparently the small sample size hampers thesemethods severely. This will be tested in Section 4.3. The fullsearch base-pair approach recovers all 20 informative fea-tures perfectly. RSM improves the number of informativefeatures retrieved for both RFE and Liknon.

Fig. 3 shows the average classification error on the spec-tra dataset with 100 samples and 199 features. The best per-forming approach is the RSM-liknon, which is statisticallysignificantly better than all other methods but Liknon.RSM–RFE does not perform as well as expected, and oth-ers methods, such as the base-pair approach give surpris-ingly high error rate. To further investigate these aspectswe looked at the error obtained in the inner cross-valida-tion step for the RSM-Liknon, RSM–RFE and base-pairapproaches.

Fig. 4 displays the average error for both the artificialand the spectra datasets as a function of the first 150 fea-tures. In the techniques that employ the Fisher classifier,namely RSM–RFE and the base-pair approach, the ‘peak-ing behavior’ is visible, which occurs when the number ofsamples is comparable to the number of features. This phe-nomenon has been studied in (Skurichina, 2001; Friedman,1989; Dai and Yuen, 2003). The peaking behavior is notaffecting the artificial dataset. This is because the Fisherclassifier is applied to the selected subset of Lw genes. Inall mentioned methods the size of Lw is not in the rangeof the peaking behavior, e.g. for the base-pair approach,the median value of Lw in the 10 folds is 10 features. There-fore, no further actions need to be taken. However, in thecase of the spectra dataset the peaking effect is not negligi-ble, since now the median value of Lw is in the peaking

q = 20 q = 500

5

10

15

20

25

30

35

40

45

50

Cla

ssifi

catio

n er

ror

(%)

Fig. 5. Classification error of three methodologies, namely univariate selectionof the same size (n = 100,p = 300) but a varying number of informative featu

region, e.g. in the base-pair case the median Lw value inthe 10 folds is 96 features. This explains the poor perfor-mances of the methods that use the Fisher classifier. Possi-ble solutions to this problem are extensively described bySkurichina (2001) and involve regularization by noiceinjection or by addition of redundant features. Unfortu-nately, these solutions are beyond the scope of this paper.

4.3. Discussion on the effect of informative features and

samples

The Liknon classifier does not perform well on the arti-ficial set, as shown in Fig. 1. Our hypothesis is that this isdue to the presence of too few informative features (20) rel-ative to the total number of features. Therefore, we investi-gated how the number of informative features influences theclassification performance of the three approaches, namelyunivariate with NMC, RSM-Liknon and classical Liknon.

Fig. 5 shows the classification error of the mentionedmethodologies when applied to four artificial datasets withn = 100, p = 300, and q = [20,50,100,150]. The univariateapproach with the NMC classifier benefits only when thenumber of informative features increased from 20 to 50.Thereafter it is hampered by the fact that it does not exploitthe correlation between the features. RSM-Liknon is a sta-ble methodology that proves to perform well also underdifficult conditions, i.e. even when little information ispresent. Classical Liknon clearly decreases the error withan increase in the number of features, even up to the pointwhere it outperforms RSM-Liknon significantly (q = 150).This behavior supports our hypothesis and exemplifies theneed for the RSM technique.

q = 100 q = 150

unirank NMCRSM-LikLiknon

with NMC, RSM-Liknon, classical Liknon for different artificial datasetsres, i.e. q = [20,50,100,150].

Page 9: Random subspace method for multivariate feature selection

n = 50 n = 100 n = 2000

5

10

15

20

25

30

35

40

45

50

Cla

ssifi

catio

n er

ror

(%)

univariate NMCRSM-LiknonLiknon

Fig. 6. Classification error of three methodologies, namely univariate selection with NMC, RSM-Liknon, classical Liknon for the spectra dataset withdifferent number of samples (n = [50,100,200]).

C. Lai et al. / Pattern Recognition Letters 27 (2006) 1067–1076 1075

The good performance achieved by Liknon on the spec-tra dataset would suggests that this dataset contains manyinformative features. In order to evaluate the role of thesample size on the RSM approach, we consider the spectradataset with different number of samples, i.e. n = [50,100,200]. Fig. 6 shows the classification error of the follow-ing approaches: univariate with NMC, RSM-Liknon andclassical Liknon. When the sample size is too small themultivariate search techniques are not able to retrieveany additional information, and the performance is compa-rable with the univariate approach. For an increased sam-ple size the multivariate approaches are beneficial, and theRSM approach obtains the best performances. A furtherincrease in the number of samples lower the need for theresampling in a subspace. In this case the classical methodis not improved by the RSM.

5. Conclusions

In small sample size problems an important step is fea-ture selection. This should lead to an informative featurespace in which the classification task can be successfullyperformed.

In order to perform the selection, the informativeness ofthe features must be evaluated. We studied severalapproaches both univariate, where each feature is consid-ered individually, and multivariate, where the criterion isdependent on multiple dimensions.

A limitation of the multivariate approaches is the sensi-tivity to the dimensionality of the feature space, which oftencauses over-training. In order to overcome this difficulty weproposed a new method based on random subspace selec-tion (RSM). A multivariate search technique is applied on

a subspace randomly selected from the original featurespace. In this reduced space the technique may better handlethe noise in the data and retrieve the informative features.

We introduced an artificial dataset in order to haveground truth information. The artificial dataset models asmall sample size dataset with both large number of unin-formative features and a correlation between the informa-tive ones, since both conditions are assumed to be presentin real datasets. We tested our algorithm on a spectra data-set and illustrated the sensitivity to the sample size of thedifferent studied approaches.

The results point out the importance of multivariatesearch techniques and the robustness and reliability ofour new method. The univariate approach is outperformedby the multivariate methodologies. The RSM–RFE andthe other methods that use the Fisher classifier are ham-pered by the sensitivity of the Fisher linear discriminantto the dimensionality of the feature space. Future study willbe done on better overcoming these limitations. The RSMis a robust and a powerful approach for feature selectionand classification especially in the small sample sizeconditions.

Acknowledgements

The authors would like to thank Pavel Paclı́k and Ser-guei Verzakov for fruitful discussions on the manuscript.

References

Allwein, E., Schapire, R., Singer, Y., 2000. Reducing multiclass to binary:a unifying approach for margin classifiers. In: Kaufmann, M., (Ed.),Proc. 17th Internat. Conf. on Machine Learning, p. 9–16.

Page 10: Random subspace method for multivariate feature selection

1076 C. Lai et al. / Pattern Recognition Letters 27 (2006) 1067–1076

Ambroise, C., McLachlan, G.J., 2002. Selection bias in gene extraction onthe basis of microarray gene-expression data. Proc. Natl. Acad. Sci.USA 99 (10), 6562–6566.

Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M.,Yakhini, Z., 2000. Tissue classification with gene expression profiles.In: Proc. 4th Ann. Internat. Conf. on Computational MolecularBiology. ACM Press, pp. 54–64.

Bhattacharyya, C., Grate, L.R., Rizki, A., Radisky, D., Molina, F.J.,Jordan, M.I., Bissell, M.J., Mian, I.S., 2003. Simultaneous classifica-tion and relevant feature identification in high-dimensional spaces:application to molecular profiling data. Signal Process. 83 (4), 729–743.

Bo, T., Jonassen, I., 2002. New feature subset selection procedures forclassification of expression profiles. Genome Biol. 3.

Chow, M.L., Moler, I.S., Mian, E.J., 2001. Identifying marker genes intranscription profiling data using a mixture of feature relevanceexperts. Physiol Genomics 5, 99–111.

Dai, D.Q., Yuen, P.C., 2003. Regularized discriminant analysis and itsapplication to face recognition. Pattern Recognition 36, 845–847.

de Veld, D.C.G., Skurichina, M., Witjes, M.J.H., Duin, R.P.W.,Sterenborg, D.J.C.M., Star, W.M., Roodenburg, J.L.N., 2004. Clinicalstudy for classification of benign, dysplastic, and malignant orallesions using autofluorescence spectroscopy. J. Biomed. Opt. 9 (5),940–950.

Dietterich, T., Bakiri, G., 1995. Solving multiclass learning problems viaerror-correcting output codes. J. Artif. Intell. Res., 263–286.

Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern Classification, seconded. John Wiley & Sons, Inc.

Dudoit, S., Fridlyand, J., Speed, T.P., 2002. Comparison of discriminationmethods for the classification of tumors using gene expression data. J.Amer. Statist. Assoc. 97 (457), 77–87.

Duin, R.P.W., Juszczak, P., de Ridder, D., Paclı́k, P., Pekalska, E., Tax,D.M.J., 2004. PR-Tools 4.0, a Matlab toolbox for pattern recognition.Technical report, ICT Group, TU Delft, The Netherlands, January2004. Available from: <http://www.prtools.org>.

Ein-Dor, L., Kela, I., Getz, G., Givol, D., Domany, E., 2004. Outcomesignature genes in breast cancer: is there a unique set? Bioinformatics(12).

Friedman, J.H., 1989. Regularized discriminant analysis. J. Amer. Statist.Assoc.

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M.,Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A.,Bloomfield, C.D., Lander, E.S., 1999. Molecular classification ofcancer: class discovery and class prediction by gene expressionmonitoring. Science 286, 531–537.

Grate, L.R., Bhattacharyya, C., Jordan, M.I., Mian, I.S., 2002.Simultaneous classification and relevant feature identification in high-dimensional spaces. In: Workshop on Algorithms in Bioinformatics.

Guyon, I., Weston, J., Barnhill, S., 2002. Gene selection for cancerclassification using support vector machines. Machine Learn. (46),389–422.

Ho, T.K., 1995. Random decision forests. In: 3rd Internat. Conf. onDocument Analysis and Recognition, pp. 278–282.

Ho, T.K., 1998. The random subspace method for constructing decisionforests. IEEE Trans. Pattern Anal. Machine Intell. 20 (8), 832–844.

Jain, A., Zongker, D., 1997. Feature selection: evaluation, application,and small sample performance. IEEE Trans. Pattern Anal. MachineIntell. 19 (2).

Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann,F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., Meltzer,P.S., 2001. Classification and diagnostic prediction of cancers usinggene expression profiling and artificial neural networks. Nature Med.7 (6), 673–679.

Kohavi, R., 1995. The power of decision tables. In: Proc. European Conf.on Machine Learning.

Kohavi, G., John, R., 1997. Wrappers for feature subset selection. Artif.Intell. 97, 273–324.

Li, L., Weinberg, C.R., Darden, T.A., Pedersen, L.G., 2001. Geneselection for sample classification based on gene expression data: studyof sensitivity to choice of parameters of the ga/knn method. Bioinfor-matics 17 (12), 1131–1142.

Paclı́k, P., Landgrebe, T.C.W., Duin, R.P.W., 2003. Prexp 2.0, a Matlabtoolbox for evaluation of pattern recognition experiment. Technicalreport, ICT Group, TU Delft, The Netherlands, December.

Pudil, P., Novovicova, J., Kittler, J., 1994. Floating search methods infeature selection. PRL 15, 1119–1125.

Raudys, S., Jain, A., 1991. Small sample size effect in statistical patternrecognition: recommendations for practitioners. IEEE Trans. PatternAnal. Machine Intell. 13 (3), 252–264.

Skurichina, M., 2001. Stabilizing weak classifiers. Ph.D. thesis, Delft,Technical University.

Skurichina, M., Duin, R.P.W., 2002. Bagging, boosting and the randomsubspace method for linear classifiers. Pattern Anal. Appl. 5, 121–135.

Tax, D.M.J., Duin, R.P.W., 2002. Using two-class classifiers for multiclassclassification. In: Proceedings of 16th International Conference onPattern Recognition, vol. 2.

Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G., 2002. Diagnosis ofmultiple cancer types by shrunken centroids of gene expression. Proc.Natl. Acad. Sci. USA 99 (10), 6567–6572.

Trunk, G.V., 1979. A problem af dimensionality: a simple example. IEEETrans. Pattern Anal. Machine Intell. 1 (3).

van ’t Veer, L.J., Dai, H., van de Vijver, M.J., He Yudong, D., Hart,A.A.M., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J.,Witteveen, A.T., Schreiber, G.J., Kerkhoven, R.M., Roberts, C.,Linsley, P.S., Bernards, R., Friend, S.H., 2002. Gene expression profilingpredicts clinical outcome of breast cancer. Nature 415, 530–536.

Wessels, L.F.A., Reinders, M.J.T., Hart, A.A.M., Veenman, C.J., Dai, H.,He, Y.D., van’t Veer, L.J., 2005. A protocol for building andevaluating predictors of disease state based on microarray data.Bioinformatics 21 (19), 3755–3762.

Xing, E., Jordan, M., Karp, R., 2001. Feature selection for high-dimensional genomic microarray data. In: Internat. Conf. on MachineLearning.

Xiong, M., La, W., Zhao, J., Jin, L., Boerwinkle, E., 2001. Feature (gene)selection in gene expression-based tumor classification. Mol. Genet.Metab. 73, 239–247.