Bagging Linear Sparse Bayesian Learning Models for Variable Selection in Cancer Diagnosis

338 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 11, NO. 3, MAY 2007

Bagging Linear Sparse Bayesian Learning Modelsfor Variable Selection in Cancer Diagnosis

Chuan Lu, Andy Devos, Johan A. K. Suykens, Senior Member, IEEE, Carles Arus,and Sabine Van Huffel, Senior Member, IEEE

Abstract—This paper investigates variable selection (VS) andclassification for biomedical datasets with a small sample size anda very high input dimension. The sequential sparse Bayesian learn-ing methods with linear bases are used as the basic VS algorithm.Selected variables are fed to the kernel-based probabilistic clas-sifiers: Bayesian least squares support vector machines (BayLS-SVMs) and relevance vector machines (RVMs). We employ thebagging techniques for both VS and model building in order toimprove the reliability of the selected variables and the predictiveperformance. This modeling strategy is applied to real-life medi-cal classification problems, including two binary cancer diagnosisproblems based on microarray data and a brain tumor multiclassclassification problem using spectra acquired via magnetic reso-nance spectroscopy. The work is experimentally compared to otherVS methods. It is shown that the use of bagging can improve thereliability and stability of both VS and model prediction.

Index Terms—Bagging, kernel-based probabilistic classifiers,magnetic resonance spectroscopy (MRS), microarray, sparseBayesian learning (SBL), variable selection (VS).

I. INTRODUCTION

R ECENT advances in technologies such as microarrays andmagnetic resonance (MR) have facilitated the collection

of genomic, proteomic, and metabolic data that can be usedfor medical decision support. For example, DNA microarraysenable us to simultaneously monitor the expression of thousandsof genes [1], [2]. It is, then, possible to compare the overalldifferences in gene expression between normal and diseasedcells. Magnetic resonance spectroscopy (MRS) [3] is able toprovide detailed chemical information about the metabolitespresented in living tissue. In particular, in vivo proton MRSoffers considerable potential for clinical applications, e.g., forbrain tumor diagnosis [3], [4].

Manuscript received July 2, 2004; revised February 14, 2006 and June 8,2006. This work was supported in part by the IUAP under Project Phase V-22, Project KUL MEFISTO-666, and Project IDO/99/03, in part by the FWOunder Project G.0407.02 and Project G.0269.02, and in part by the EU un-der Project BIOPATTERN (FP6-2002-IST 508803), Project eTUMOUR (FP6-2002-LIFESCIHEALTH 503094), and Project HEALTHagents (FP6-2005-IST027214). The work of C. Lu was supported by a doctoral grant from theKatholieke Universiteit Leuven. The work of A. Devos was supported by anIWT grant (IWT-Vlaanderen).

C. Lu is with the Department of Computer Science, University of Wales,Aberystwyth SY23 3DB, U.K. (e-mail: [email protected]).

A. Devos, J. A. K. Suykens, and S. Van Huffel are with SCD-SISTA, ESAT, Department of Electrical Engineering, Katholieke Univer-siteit Leuven, B-3001 Leuven, Belgium (e-mail: [email protected];[email protected]; [email protected]).

C. Arus is with the Departament de Bioquımica i Biologia Molecular,Universitat Autonoma de Barcelona, 08193 Barcelona, Spain (e-mail: [email protected]).

Digital Object Identifier 10.1109/TITB.2006.889702

Much attention has been paid to class prediction in the contextof such new diagnostic tools, particularly for cancer diagnosis.The task is to classify and predict the category of a sample onthe basis of its gene-expression profile or the MRS spectrum.Conventional cancer diagnosis has been based on biopsy andexamination by a pathologist of morphological appearance ofstained tissue specimens in the microscope [1], [2]. This methoddepends on the high expertise of the pathologists. Microarraysand MRS offer the hope that cancer classification can be ob-jective and highly accurate, helping the clinicians to chooseappropriate treatments. The challenge of classification usingmicroarrays and MR spectra lies in: 1) the large number of in-put variables and a relatively small number of samples, and 2)the presence of noise and artefacts.

Kernel-based methods are of particular interest for this tasksince they can deal with high-dimensional data in nature andhave been supported by both statistical learning theory and em-pirical results [5]–[7]. Despite the early success, the presenceof a significant amount of irrelevant variables (features) or mea-surement noise might hamper the performance and interpreta-tion of predictive models.

Variable selection (VS) is, therefore, used to identify the vari-ables most relevant for classification. This is important for med-ical classification, as it can have an impact not only on theaccuracy and complexity of the classifiers, but also on the eco-nomics of data acquisition. It is also helpful for understandingthe underlying mechanisms of the disease. This could assist drugdiscovery and early diagnosis.

Several statistical and computational approaches to VS existfor classifying such data. The first approach is variable rankingfollowed by a selection (or filtering) step, usually accompaniedby cross-validation (CV), to determine the number of variablesto be used in the classifier. The ranking criteria could be basedon, e.g., correlation, t-statistics, and some multivariate methodssuch as recursive feature elimination (RFE) with support vectormachines (SVMs) [5], [6]. The second approach is the so-calledwrapper approach, which searches for the optimal combinationof variables according to some performance measures of themodels [8]–[10].

The embedded approach combines the two tasks of VS andmodel fitting into one optimization procedure. The embeddedSVM-based algorithms typically reformulate the standard SVMoptimization problem in order to select only a fixed number ofvariables. This can be done via imposing additional constraintsand adopting objective functions such as generalization bound[7]. Nevertheless, these methods usually require an additionalCV step for choosing the predefined number of variables.

1089-7771/$25.00 © 2007 IEEE

https://www.researchgate.net/publication/220521536_Improving_reliability_of_gene_selection_from_microarray_functional_genomics_data?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/220521536_Improving_reliability_of_gene_selection_from_microarray_functional_genomics_data?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/220344030_Gene_Selection_for_Cancer_Classification_Using_Support_Vector_Machines?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/13565104_Nikulin_AE_Dolenko_B_Bezabeh_T_Somorjai_RL_Near-optimal_region_selection_for_feature_space_reduction_novel_preprocessing_methods_for_classifying_MR_spectra_NMR_Biomed11_209-216?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/11662059_Biomarker_Identification_by_Feature_Wrappers?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/12938746_Alon_U_Barkai_N_Notterman_DA_Gish_K_Ybarra_S_Mack_D_Levine_AJBroad_patterns_of_gene_expression_revealed_by_clustering_analysis_of_tumor_and_normal_colon_tissues_probed_by_oligonucleotide_arrays_Proc_N?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/12938746_Alon_U_Barkai_N_Notterman_DA_Gish_K_Ybarra_S_Mack_D_Levine_AJBroad_patterns_of_gene_expression_revealed_by_clustering_analysis_of_tumor_and_normal_colon_tissues_probed_by_oligonucleotide_arrays_Proc_N?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/12779876_Molecular_Classification_of_Cancer_Class_Discovery_and_Class_Prediction_by_Gene_Expression?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/12779876_Molecular_Classification_of_Cancer_Class_Discovery_and_Class_Prediction_by_Gene_Expression?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/220319834_Use_of_the_Zero-Norm_with_Linear_Models_and_Kernel_Methods?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/220319834_Use_of_the_Zero-Norm_with_Linear_Models_and_Kernel_Methods?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/227834966_Clinical_Applications_of_Magnetic_Resonance_Spectroscopy?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/227834966_Clinical_Applications_of_Magnetic_Resonance_Spectroscopy?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

LU et al.: BAGGING LINEAR SBL MODELS FOR VS IN CANCER DIAGNOSIS 339

In [11], the Bayesian automatic relevance determination(ARD) algorithms were exploited. These types of embeddedmethods can automatically determine the number of selectedvariables. However, they seem to be sensitive to a small per-mutation of the training set, rendering their results less reliablefrom a biological point of view.

In this paper, we explore an alternative method that followsthe Bayesian ARD approach, and show how the reliability ofthe selected variables and classification performance can be im-proved by using bagging and feeding various bootstrap variablesto various types of probabilistic models. Moreover, by utilizingthe sparse Bayesian learning (SBL) with logistic functions, thismethod requires no nuisance hyperparameters tuning.

The rest of the paper is organized as follows. Section II intro-duces SBL algorithms. A brief review of the two kernel-basedprobabilistic classification algorithms is given in Section III,namely Bayesian least squares support vector machines (Bay-LS-SVM) and relevance vector machines (RVMs). The bag-ging strategy for VS and modeling are proposed in Section IV.Section V lists the compared methods for VS and modeling.These methods are applied to the three cancer-classificationproblems, as described in Section VI with results and the bio-logical interpretation of some selected variables. Sections VIIand VIII present a discussion and some conclusions.

II. BASIC ALGORITHM FOR VS

A. SBL

Supervised learning infers a functional relation y ↔ f(x)from a training set D = xn, ynN

n=1, with xn ∈ Rd and y ∈ R.

SBL applies Bayesian ARD to models linear in their parametersso that sparse solutions (i.e., with many parameters equal tozero) can be obtained [12]. Its prediction on y given x can bebased upon

f(x;w) =M∑

m=0

wmφm(x) = wTφ(x). (1)

Two forms of basis functions φm(z) are considered here:φm(z) = zm,m = 1, . . . , d (i.e., φm(z) as the original in-put variable), and φm(z) = K(z,xm),m = 1, . . . , N , whereK(. , .) denotes some symmetric kernel function. φ0(z) is setto 1 in order to include an intercept (bias) term in the emerg-ing model. The basic VS selection algorithm relies on the SBLmodel using the first form of basis functions, termed the linearbasis functions. In contrast, the RVMs take the kernel repre-sentation for the basis function. The RVM will be revisited inSection III-B as a probabilistic classifier.

For a regression problem, the likelihood of the data for a SBLmodel can be expressed as

p(y|w, σ2) = (2πσ2)−N/2 exp− 1

2σ2‖y − Φw‖2

(2)

where σ2 is the variance of the i.i.d. noise, the N × M designmatrix Φ = [φ(x1), φ(x2), . . . , φ(xN )]T. The parameters w are

given a Gaussian prior

p(w|α) =M∏

m=0

N (wm|0, α−1m ) (3)

where α = αm is a vector of hyperparameters, with one hy-perparameter αm assigned to each model parameter wm. Asillustrated in [12] and [13], this is equivalent to using a regu-larization with a penalty of

∑m log |wm|, which encourages

sparsity. Given α, the posterior parameter distribution can bederived via the Bayes’ rule

p(w|y, α, σ2) = p(y|w, σ2)p(w|α)/p(y|α, σ2) (4)

which is also Gaussian, with variance and mean of

Σ = (σ2ΦTΦ + A)−1 and µ = σ−2ΣΦTy. (5)

The hyperparameters α can be estimated using type II maxi-mum likelihood, in which the marginal likelihood is maximized,and the marginal likelihood can be computed by

p(y|α, σ2) =∫ ∞

−∞p(y|w, σ2)p(w|α)dw

= (2π)−N/2|C|−1/2 exp(−1

2yTC−1y

)(6)

where C = B−1 + ΦA−1ΦT, with B = σ−2I and A =diag(α1, . . . , αM ).

For binary classification problems, one can utilize the logis-tic function g(a) = 1/(1 + e−a) [12]. The computation of thelikelihood is based on Bernoulli distribution

p(y|w) =N∏

n=1

g(f(xn;w))yn [1 − g(f(xn;w))]1−yn (7)

where yn ∈ 0, 1. There is no noise variance in this case, and alocal Gaussian approximation is used to compute the posteriordistribution of the weights p(w|y, α) and the marginal likeli-hood p(y|α). For a given α, we can estimate the mean and vari-ance of the weights (µ and Σ) by an iteratively reweighted leastsquares algorithm (e.g., the Newton–Raphson method [14]). Thefollowing expressions are exploited [12]:

Σ = (ΦTBΦ + A)−1

µ = ΣΦTBy

and

y = Φµ + B−1(y − g(Φµ))

where B = diag(β1, . . . , βN ), with βn = g(f(xn; µ))[1 −g(f(xn; µ))].

The optimization process, i.e., maximization of the marginallikelihood with respect to α, and possibly σ2, can be performedefficiently using an iterative re-estimation procedure [12], [13].A fast sequential learning algorithm has also been introducedin [15], which enables us to efficiently process data of highdimensionality. We have adapted this algorithm to our applica-tions, which will be detailed in the next section.

The most relevant variables for the classifier can be obtainedfrom the resulting sparse solutions, if the original variables are

https://www.researchgate.net/publication/11084458_Bayesian_automatic_relevance_determination_algorithms_for_classifying_gene_expression_data_Bioinformatics?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/215721451_Neural_Networks_For_Pattern_Recognition?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/224881814_Tipping_ME_Sparse_bayesian_learning_and_the_relevance_vector_machine_JMLR_1?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=






taken as basis functions in the SBL model. This type of modelis referred to as the linear SBL models in this paper.

B. Sequential SBL Algorithm

The sequential SBL algorithm [15] starts from a zero basis,adds or deletes a basis function at each iteration step, or updatesa hyperparameter αm until convergence.

For optimization of the hyperparameters α, the objec-tive function uses the logarithm of the marginal likelihoodL(α) = log p(y|α, σ2). It is shown in [13] and [15] thatwe may analyze the properties of L(α) by decomposing itinto the marginal likelihood L(α−i) with φi (the ith columnof Φ) excluded, and the marginal likelihood (αi) includingonly φi. That is L(α) = L(α−i) + (αi), where (αi) =(1/2)[log αi − log(αi + si) + q2

i /(αi + si)], si = φTi C−1

−i φi

and qi = φTi C−1

−i y,C−i is C with the contribution of φi

removed. Since si and qi are independent of αi, one can obtaina unique maximum of L(α) with respect to αi by setting thefirst derivative of (αi) to zero. The optimal values for αi are

αi =s2

i

q2i − si

, if q2i > si (8)

αi = ∞, if q2i ≤ si. (9)

One convenient way to derive si and qi is to utilize theseexpressions: si = αiSi/(αi − Si), qi = αiQi/(αi − Si), andwhen αi = ∞, si = Si and qi = Qi. In practice, the two quan-tities Si and Qi are computed using the following equations:

Si = φTi C−1φi = φT

i Bφi − φTi BΦΣΦTBφi (10)

Qi = φTi C−1y = φT

i By − φTi BΦµ (11)

where Φ, Σ, and µ contain only the parts corresponding to thebasis functions included in the model (with αm < ∞).

The marginal likelihood maximization algorithm jointly op-timizes the weights and the hyperparameters αmMall

m=0, withMall the maximum index for the basis functions. In case of lin-ear basis functions, Mall = d; in case of kernel basis functions,Mall = N . Define the complete set of possible indices for thebasis functions as Iall, containing the integer numbers from 0 toMall. Our modified algorithm of SBL for classification utilizingthe logistic function is as follows.

1) Initialize the model with only an intercept: α0 < ∞ (e.g.,α0 = (yTy/N)−2), and ∀m > 0, αm = ∞. Initialize theindex set of the bases in the model Isel ← 0.

2) Given current α, estimate Σ and µ using the IRLS al-gorithm for the logit model. Note that Σ and µ are onlyrelated to the basis functions included in the current model,initially with only one scalar element. And Φ starts withonly one column vector φ0 = [1, . . . , 1]T1×N .

3) Randomly select Mout bases with the indices of Iout ⊆(Iall \ Isel). Set Ican ← Iout ∪ Isel.

4) For each basis vector in the candidate sets φm,m ∈ Ican,compute the value of sm and qm, find the optimal actionwith respect to each αm, then calculate ∇m, the corre-sponding change in marginal likelihood L(α) after takingthat action. The following rules are used:

If q2m > sm and αm < ∞ (i.e., φi is

in the model), estimate αm using (8),∇m = Q2

m/[Sm + [α−1m − α−1

m ]−1] − log 1 +Sm[α−1

m − α−1m ].

If q2m > sm and αm = ∞, add φm to the model,

compute αm using (8), ∇m = (Q2m − Sm)/Sm +

log (Sm/Q2m).

If q2m ≤ sm and αm < ∞, then delete φm, set αm =

∞, ∇m = Q2m/(Sm − αm) − log(1 − Sm/αm).

5) Select one basis m∗ = arg max∇m, take the correspond-ing action, i.e., αm∗ ← αm∗ and update Φ and Isel.

6) If convergence is reached, then stop, otherwise gotostep 2).

The number of bases to be screened for updating αm is thenumber of bases in the model plus Mout, the predefined num-ber of randomly selected bases from those not used by themodel. Although Mout should be chosen empirically over com-putational efficiency, quite a wide range of values may givesatisfactory results. In our experiments, it is fixed at 100. Herethe optimization procedure is considered to be converged whenthe maximum value of | log(αm/αm)|m∈Ican in step 4 is verysmall, e.g., lower than 10−6.

However, we should also be aware of the uncertainty involvedin the basis function selection, which might result from the ex-istence of multiple solutions and the sensitivity of the algorithmto small perturbations of experimental conditions. Attempts totackle this problem are, for example, bagging and committeemachines. Here, we will focus on the very simple bagging ap-proach, which will be described in Section IV.

III. KERNEL-BASED PROBABILISTIC CLASSIFIERS

SVMs are now a state-of-the-art technique for pattern recog-nition [16]. A standard SVM classifier takes the form y(x) =sign[wT

f ϕ(x) + b] in the feature (primal) space with ϕ(·) :R

d → Rdf , where df is the dimension of the feature space. It is

inferred from data with binary targets yi ∈ ±1 by solving thefollowing optimization problem:

minwf ,b,ξ

J (wf , b, ξ) =12wT

f wf + C

N∑i=1

ξi (12)

subject to yi(wfϕ(xi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , N .This can be conveniently solved in its dual for-

mulation. It turns out that f(x;wf , b) = wTf ϕ(x) + b =∑N

i=1 ai yi K(x,xi) + b, where ai is called a support value,and K(·, ·) is a chosen positive definite kernel. The most com-mon kernels include linear kernels and radial basis function(RBF) kernels. Here, we only considered models with linearkernels, defined as K(x, z) = xTz.

A. Bayesian LS-SVM Classifier

The LS-SVM is a least squares version of SVM, and isclosely related to Gaussian processes and kernel Fisher discrim-inant analysis [17], [18]. The training procedure for LS-SVM is

https://www.researchgate.net/publication/221997066_The_Nature_of_Satistical_Learning_Theory?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=


reformulated as

minwf ,b,e

J (wf , b, e) =12wT

f wf +λ

2

N∑i=1

e2i (13)

subject to yi[wTf ϕ(xi) + b] = 1 − ei, i = 1, . . . , N .

This optimization problem can be transformed and solvedthrough a linear system in the dual space instead of a quadraticprogramming problem as for the standard SVM case [18][

0 yT

y Ω + λ−1I

] [ba

]=

[01

](14)

where a = [a1, . . . , aN ]T, and 1 = [1 . . . 1]T. The matrix Ω isdefined as Ωij = yiyjϕ(xi)Tϕ(xj) = yiyj K(xi,xj).

In Bayesian LS-SVM (BayLSSVM) [18], [19], the LS-SVMis integrated with the evidence framework [14], within whichthe regularization parameters λ is optimized by maximizing theposterior probability of the model. And the posterior class prob-abilities can be calculated incorporating the prior class proba-bilities via the Bayes’ rule.

B. RVMs for Classification

The RVM is a special case of SBL models, in which thebasis functions are given by kernel functions of the same typeas for SVMs. The sequential learning algorithm introduced inSection II-B is again applied to the optimization of RVMs. Thepredicted probability of being positive for a given input x∗ canbe computed using the logistic function

p(y∗ = 1|x∗,y, α) =1

1 + e−wT φ(x∗). (15)

No simulation results are reported for the models with nonlin-ear kernels in this paper. As linear classifiers perform sufficientlywell for our problems, and nonlinear models have shown no im-provement over the simple linear classifiers, according to theresults of our preliminary experiments and some other studieson the same datasets [6], [20].

IV. BAGGING STRATEGY

A. Bagging the Selected Variables and Models

Bagging is a bootstrap ensemble method that generates indi-viduals for its ensemble by training each classifier on a randomredistribution of the training set [21]. Each classifier’s trainingset is generated by randomly drawing, with replacement, thesame number of examples as in the original training set.

It is shown that the bootstrap mean is approximately a poste-rior average of a quantity of interest [22]. Suppose a model is fit-ted to our training setD, obtaining the prediction f(x) at inputx.This prediction could be the latent outcome of, e.g., a standardSVM model, or the predicted class probability of a probabilisticmodel. Bootstrap aggregation or bagging averages this predic-tion over a collection of bootstrap samples, thereby reducing itsvariance. For each bootstrap sample D∗b, b = 1, 2, . . . , B, wefit the model, giving prediction f ∗b(x). The bagging estimate is

defined by

fbag(x) =1B

B∑b=1

f ∗b(x). (16)

The final class label will be decided by thresholding the boot-strap estimate of the class probability or the latent outcome.Bagging can push a good but unstable procedure a significantstep toward optimality, which has been witnessed both empiri-cally and theoretically [21], [22].

An alternative bagging strategy is to bag only the predictedclass labels and the final prediction will be given by voting.However, a reliable estimate of the class probability is essen-tial for medical diagnosis. The prediction averaging strategytends to produce bagged estimates with lower variance, espe-cially for small B. Therefore, the prediction averaging strategyis preferred and advocated here.

The bagging strategy for VS and modeling is outlined asfollows. Given a training set, B bootstrap data are randomlygenerated with replacement. For each bootstrap training set, onesubset of variables is selected via the linear SBL logit models,followed by feeding these variables to a model of interest suchas Bayesian LS-SVM. Then, B subsets of variables are chosenand B corresponding models built based on the B bootstraptraining data. Given input x, the class probability or the latentoutcome for the bagged binary model will be the average of theB class probabilities or latent outcomes. The pseudo code forthis is given in Algorithm 1. B is empirically set to 30 in ourexperiments.

Algorithm 1: Ensemble Modeling Using LinSBL+Bag forVS.

B. Strategy for the Multiclass Classification Problems

For multiclass classification problems, we reduce the k-classclassification problem into k(k − 1)/2 pairwise binary classifi-cation problems, which yield the conditional pairwise probabil-ity estimates for a given input x. The conditional probabilitiesare, then, coupled to obtain the joint posterior probability foreach class by using Hastie’s method [23]. The final predictionof the class will be the one for which the highest joint prob-ability is achieved. Accordingly, the variables used should bethe union of the k(k − 1)/2 sets of variables that are used inthe same number of binary classifiers. Bagging is applied toeach binary classification individually. Only the mean predictedprobabilities from the bagged binary classifiers are coupled inorder to get the final joint posterior probability for the multiclassclassification problems.

https://www.researchgate.net/publication/220344030_Gene_Selection_for_Cancer_Classification_Using_Support_Vector_Machines?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/220499712_Bayesian_Framework_for_Least-Squares_Support_Vector_Machine_Classifiers_Gaussian_Processes_and_Kernel_Fisher_Discriminant_Analysis?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/226270700_Bagging_Predictors?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/226270700_Bagging_Predictors?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/229058060_Least_Square_Support_Vector_Machine?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/229058060_Least_Square_Support_Vector_Machine?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/215721451_Neural_Networks_For_Pattern_Recognition?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/225734295_The_Elements_of_Statistical_Learning_Second_Edition_Data_Mining_Inference_and_Prediction?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/225734295_The_Elements_of_Statistical_Learning_Second_Edition_Data_Mining_Inference_and_Prediction?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=

https://www.researchgate.net/publication/8386897_Classification_of_brain_tumours_using_short_echo_time_1H_MRS_spectra?el=1_x_8&enrichId=rgreq-953b788b-5910-42bf-9900-23b6fa7ea955&enrichSource=Y292ZXJQYWdlOzYzMTE4OTg7QVM6MTAyMzgzMjI0ODg1MjY4QDE0MDE0MjE0NzE5MTM=


V. COMPARED METHODS

In order to see the potential performance gain of using ourproposed methods, we have also assessed the performance ofsome reference methods. We denote the proposed VS approachas “LinSBL+Bag" (see Algorithm 1), which bags the variablesselected from the linear SBL logit models. Accordingly, themodel fitting and prediction will be “bagged” as well. Its coun-terpart method “LinSBL" forms a classifier using only one sub-set of variables selected from a single linear SBL logit model,which is based on the whole training set without bootstrap rep-etition.

Other two VS methods adopted for comparison involve vari-able ranking followed by selecting Nv variables with the highestranks. One is the popular SVM-based RFE method [6]. The ideaof this method is to eliminate recursively the variable that con-tributes the least in the SVM model, and then, rank the variablesbased on the reverse order of their elimination. The contributionof the mth variable is evaluated by means of the change in thecost function ∇Jm caused by removing the mth variable. Whena linear kernel is used, ∇Jm = w2

m with wm the correspondingweight in the linear SVM model: wm = ΣN

i=1aiximyi.The variables can also be ranked using Fisher’s criterion [14],

which is a measure of the correlation between the variablesand the class labels. For a binary classification, the Fisherdiscriminant criterion for an individual variable is given by(µm,+ − µm,−)2/(σ2

m,− + σ2m,+), where µm,+ and µm,− are

the means of variable m within the positive and negative class,respectively, and σm,+ and σm,− are the standard deviations ofthe variable within each class. The larger the Fisher’s criterion,the higher the ranking of the variable.

Nv is tuned by tenfold CV using SVMs. A coarse-to-finestrategy is utilized to search for Nv within a range of possiblevalues. The Nv with the lowest tenfold CV error rate wereselected, and the tie breaking rule is to choose the smallestnumber of variables. These two VS methods are denoted by“RFE+CV" and “Fisher+CV," respectively.

Note that, no bagging has been applied for these referencemethods. Our preliminary experiments show that the effect inbagging the models is more prominent when the variables varyamong different bootstrap data. However, it becomes too time-consuming to bootstrap VS with the methods of “Ranking +CV” (see Algorithm 2).

Algorithm 2: Modeling using Ranking +CV for VS.

Concerning the modeling techniques, in addition to the ad-vocated probabilistic models, we use the standard linear SVMclassifier as a baseline model. We fix the regularization hyper-parameter of SVM to 106, high enough to keep the training errorlow. Unlike other probabilistic models, the final SVM classifiersdo not generate naturally the probability output. Hence, for the

multiclass classification problem, the final predicted class labelsare decided by voting, using the pairwise binary SVM classifi-cation results. We also compare the kernel-based models withthe decision tree models obtained from C4.5 [24], which is aclassical machine learning algorithm. The output of the baggedC4.5 models is given by voting.

VI. EXPERIMENTS

A. Experimental Settings

The generalization performance of the models in conjunctionwith VS was evaluated by 30 runs of randomized holdout CV.For each run, a fixed proportion of data were taken for trainingand the rest for test, and the spiltting of the dataset was randomstratified.

We applied a full CV, where the VS was conducted prior toeach model fitting process for each realization of the trainingdata. An incomplete CV, i.e., a CV after VS may lead to a seriousunderestimation of the prediction error [25].

The performance is measured by the mean accuracy (Acc)and the mean area under the ROC curve (AUC) [26] and thecorresponding standard error (SE) of the mean.

Our Matlab programs used for these experiments were builtupon several toolboxes, including the SparseBayes V1.01 (withmodifications) for the sequential SBL, the Spider2 for RFE,SVM and C4.5 modeling, and LS-SVMlab3 for Bayesian LS-SVM modeling.

Note that, in our experiments, all classifiers were tested withthe same series of VS techniques.

B. Binary Cancer Classification Based on Microarray Data

Two benchmark binary cancer classification problems basedon DNA microarray data have been considered. The first prob-lem aims to discriminate between two types of leukemia: acutelymphoblastic leukemia (ALL) and acute myeloid leukemia(AML). The dataset includes 72 samples (47 of AML and 25 ofALL) with 7129 gene expression values4 [2], [6].

The second problem aims to differentiate tumors from normaltissues using colon cancer data. The dataset contains informa-tion of 40 tumor and 22 normal colon tissues. Each tissue isassociated with 2000 gene expression values5 [1], [6].

All microarray data have been normalized to zero mean andunit variance. Each realization of the training set contains 50data points, the test set includes the rest of the data of size 22and 12 in leukemia and colon cancer data, respectively. Thesetwo problem cases are both linearly separable, however, theyboth have very high dimensionality and small sample size.

By default, the class priors were set to the proportions ofthe classes in the training set in the binary classifications.The accuracy and the AUC for the leukemia and colon cancer

1http://research.microsoft.com/mlp/RVM/SparseBayesV1.00.tar.gz2http://www.kyb.tuebingen.mpg.de/bs/people/spider/index.html3http://www.esat.kuleuven.ac.be/sista/lssvmlab/4[Online]. Available: www.genome.wi.mit.edu/MPR/data_set_ALL_AML.html

5[Online].Available: microarray.princeton.edu/oncology/affydata/index.html


TABLE ITEST RESULTS FOR LEUKEMIA CANCER CLASSIFICATION

TABLE IITEST RESULTS FOR COLON CANCER CLASSIFICATION

classification problems are reported in Table I and Table II, re-spectively, where the highest value of accuracy or AUC for eachtype of classifier (in row) is indicated in bold. The mean numberof selected variables (Nv) for each VS method within one trialis also given. Note that, only the accuracy measure is reportedfor C4.5 decision tree classifiers, as the latent outcome is notavailable for C4.5 models for the ROC analysis.

We see that RFE+CV and LinSBL selected less variablesand resulted in a consistently lower test performance thandid Fisher+CV and LinSBL+Bag. Models using only a smallsubset of variables selected by Fisher+CV and LinSBL+Bagachieved a similar test result as the models without VS. Com-paring different C4.5 models in terms of accuracy, the baggedC4.5 performed significantly better than did the other singleC4.5 models. All the other kernel-based models performed bet-ter than these decision trees. Among the kernel-based models,the use of VS is considered more significant than is the choiceof any particular type of classifier with regard to the modelperformance.

Additionally, the test performance from the LinSBL andLinSBL+Bag were compared via paired t-tests for each classi-fier. The p-values of comparison on AUC are all <10−4 for theleukemia data, and all <0.015 for the colon data.

C. Classification of Brain Tumors Based on MRS Data

The method has also been applied to a multiclass classifica-tion problem of brain tumors using short echo time 1H MRSdata. The dataset consists of 205 spectra in the frequency do-main. The full spectrum (a row vector of magnitude values) hasbeen normalized to unit norm. Only the frequency region ofinterest from 4.17 to 0 ppm (a measure of the chemical shift in afield-independent frequency scale) was used in this study, corre-sponding to 138 input variables. The dataset contains the records

from four major types of brain tumor: meningiomas (Class 1, 57spectra), astrocytomas grade II (Class 2, 22 spectra), glioblas-tomas (87 spectra) and metastases (39 spectra), [20]. However,the last two types of tumor are very difficult to distinguish. Ourexperience on this dataset is that, the trained models did notperform as well as a majority classifier, which assigns the ma-jority class in the training set to all the test cases. Therefore, wemerged the two tumor types—glioblastomas and metastases—into one class of aggressive tumors (Class 3), and only dealtwith the three-class classification problem. For details of thedata acquisition and preprocessing procedure for this dataset,the readers are referred to [20].

Since the data are unbalanced, the model using the default pri-ors will lead to a relatively low sensitivity for astrocytomas gradeII. Thus, we decided to use equal priors for all binary classifiers,which resulted in a “satisfactory” sensitivity and specificity forall three classes. Table III reports the average test AUC for eachpairwise binary classification, and Table IV presents both thetraining and test accuracy of the brain tumor classification prob-lems using equal class priors. Again, in these tables, the highestAUC of each binary classification for each type of model (inrow) is indicated in bold.

For the pairwise binary classification, similar observationscan be found as with the microarray data. For the three-classbrain tumor diagnosis, as reported in Table IV, LinSBL+Bag gota consistently higher accuracy than did the rest of VS methods.Also, the paired t-tests on accuracy indicate that LinSBL+Bagperformed significantly better than did LinSBL with a p-value< 10−4 for each model type.

D. Biological Relevance of the Selected Variables

We examined the most frequently selected variables from theLinSBL+Bag method and their biological relevance for each


TABLE IIITEST AUC (%) FOR PAIRWISE BINARY CLASSIFICATION OF BRAIN TUMORS

TABLE IVTRAINING AND TEST ACCURACY FOR BRAIN TUMOR THREE-CLASS CLASSIFICATION

Fig. 1. Genes selected by LinSBL+Bag from the 30 realizations of the train-ing sets for leukemia cancer microarray data. The x-axis labels in the bottomrepresent the rank of the gene, and on the top, the index of the gene in theoriginal microarray data matrix. The y-axis refers to the run number in the 30randomized CVs. Only the genes that were selected more than 30 times in allthe 30 × 30 = 900 linear SBL models are listed in the plot. The gray level ineach cell corresponds to the number of occurrences that a gene was selected inbootstrapping for one realization of the training set.

dataset. The heatmap in Figs. 1 and 2 show the occurrenceof such highly selected genes in each randomized CV for theleukemia data and the colon cancer data, respectively.

It is noteworthy that the genes that were selected by theLinSBL+Bag method are mostly biologically interpretable. Inthe leukemia cancer classification, the three top ranked genesidentified by our algorithm are all among the informative genesaccording to [2]. The highest ranked gene is zyxin (Gene 4847according to its order in the original dataset), which encodes

Fig. 2. Genes selected by LinSBL+Bag from the 30 realizations of the trainingset for colon cancer microarray data. See Fig. 1 for details.

an LIM domain protein localized at focal contacts in adherenterythroleukemia cells. CD33 (gene 1834) is the differentiationantigen-encoding cell surface protein, for which monoclonal an-tibodies have been demonstrated to be useful in distinguishinglymphoid from myeloid lineage cells.

In the colon cancer data, the most important gene that is iden-tified by our method corresponds to mRNA for uroguanylinprecursor (gene 377). Guanylin and uroguanylin have beenrecently found to be linked to colon cancer, and treatmentwith uroguanylin was found to have possible therapeutic sig-nificance [11], [27]. The gene with the second highest rank(Gene 1772) is a collagen alpha2 (XI) chain that is involvedin cell adhesion, and collagen degrading activity is part of themetastatic process for colon carcinoma cells [6], [28].


Fig. 3. Mean spectrum of the brain tumors and the selection rate of the vari-ables using LinSBL+Bag from the 30 runs of CV for the pairwise binaryclassification. The dotted line is mean ± SD spectrum.

For the class prediction of brain tumors, we examined the cor-responding metabolites of which the magnitude values appearedto be significant in the pairwise binary classification problems.Fig. 3 depicts the mean spectra of the three classes, as well asthe metabolites associated with their resonance frequencies [29].The two horizontal bars, below each mean spectrum, representthe selection rate of each variable in pairwise discriminationwith another tumor class. The selection rate for variable m wascomputed by dividing the total number of selection occurrencesby 30 × 30 = 900. We can take the selection rate as a meansof importance measure for the variables. By examining the fig-ure and incorporating the domain knowledge, we were able tofigure out the metabolites that are important or useful for theclassification.

For example, in contrast to the other two classes, the astro-cytomas grade II have a relatively high level6 in the frequencyregions of both total creatine (Cr) and myo-inositol (mI)/glycine(Gly). These variables were also selected most frequently inall three pairwise binary classification problems, particularlyfor differentiating Class 1 from Class 2. Indeed, in these re-gions, the selection rate has the darkest color and reaches avalue close to 0.5. To discriminate meningiomas from the ag-gressive class, more frequency regions are used: not only Crand mI/Gly, but also glutamate (Glu), glutamine (Gln), lipids,N -acetyl containing macromolecules (NAC). Interestingly, Crdoes play a role in the maintenance of energy metabolism.While NAC resonances at the usual N -acetyl aspartate (NAA)chemical shift may appear in the solid or cystic areas of braintumors.

However, one must be cautious when interpreting the selectedvariables for such MR spectra: the resonances at the same po-sition may originate in different compounds depending on the

6The “level” here refers to relative intensity in the spectra as they are scaledto unit norm.

tumor type. For example, the 2.03 ppm peak originates mostlyfrom Lipids in Class 3 tumors, while it is safe to label it NACfor Classes 1 and 2. It may have NAA contribution, but other N -acetyl compounds are contributing varying amounts [30]. Thewhole region at 2–2.6 ppm may have variable contribution frommacromolecules (mostly proteins).

VII. DISCUSSION

From both the binary and multiclass examples, we can clearlysee how bagging improves the performance of the single modelusing only one subset of variables. There is a significant gapin predictive performance between the methods of LinSBL andLinSBL+Bag in our experiments. This gap might result from:on the one hand, the large uncertainty because of the smallsample size of the training data, and on the other hand, thesensitivity of SBL itself.

The results also show that models with only a small portion ofvariables can perform as well as, and sometimes even better than,the models with complete set of variables. More importantly,selected variables could help in interpreting and understandingthe underlying mechanism of the diseases.

As with the modeling techniques, the kernel-based modelsperformed consistently better than did the decision-tree models.And the Bayesian probabilistic models performed somehow bet-ter than did the standard SVMs. This might be partially due tothe fact that the hyperparameters for SVMs were not optimized.Our main focus was on the models with probabilistic output thatis important in biomedical diagnosis, without the burden of CVfor hyperparameter tuning.

It is also worth mentioning that the linear SBL model is it-self a probabilistic model with an automatic VS mechanism. Itachieved a similar performance as did the probabilistic kernelmodels in all these experiments. For example, in the brain tu-mor diagnosis problem, a single LinSBL model and a baggedLinSBL model yielded a mean test accuracy of 86.03 ± 0.63%and 89.46 ± 0.52%, respectively.

To get an idea of the computational efficiency of our VSmethods, we computed the average CPU time in a CV trial con-sumed by LinSBL+Bag on 30 bootstrap training samples. Thesimulations were conducted on cluster machines with Pentiumprocessors (1 GHz). Around 7 and 3 min were needed for theleukemia data and the colon data, respectively. For predictionof brain tumors, in total, 24 min were used for all the three pairsof binary classification problems.

One limitation of the bagging strategy is that there is nosingle model to be returned. To deal with this problem, one canadapt a similar approach as described in [9], in which the lineardiscriminant analysis (LDA) was bootstrapped, to generate an“averaged” classifier using the weighted average of the B setsof model parameters. To do so, a transformation from our linearkernel-based models to variable-based models could be doneand the selection frequency for the variables should also betaken into account. One direction for future investigation couldbe to establish a mechanism for integrating multiple models intoa single structure model, which would become easy to explainclinically.


VIII. CONCLUSION

The most significant problem for classification addressed herelies in the use of datasets with a small sample size and a hugedimensionality. The populations are usually underrepresentedin this situation, which might result in a serious bias towardthe training set, i.e., with a high training performance for asingle model after VS, and possibly, a much lower generalizationperformance on the unseen data. This motivates the use of abagging strategy in order to improve the reliability and lowerthe uncertainty in both VS and modeling.

Experimental results confirm the advantages of the baggingstrategy. Indeed, bagging can enhance the reliability of VS andmodel prediction, thereby increasing the generalization perfor-mance of the models. Unlike the popular variable ranking meth-ods such as RFE and the Fisher’s criterion methods, our pro-posed method requires no additional step in order to decide onthe number of variables to be used in the models. The variablesare selected within a Bayesian framework, and the procedureis shown to be computationally efficient if the sample size issmall. The number of occurrences of a variable being selectedcan serve as an importance measure for the variable. Our re-sults imply that the linear SBL plus bagging deserves to play animportant role in VS for biomedical classification tasks.

ACKNOWLEDGMENT

The authors would like to thank the anonymous review-ers for the detailed and valuable comments. They also grate-fully acknowledge the use of the brain tumor data pro-vided by the EU-funded INTERPRET Project IST-1999-10310(http://carbon.uab.es/INTERPRET).

REFERENCES

[1] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, andA. J. Levine, “Broad patterns of gene expression revealed by clusteringanalysis of tumor and normal colon tissues probed by oligonucleotidearrays,” Proc. Natl. Acad. Sci. USA, vol. 96, pp. 6745–6750, 1999.

[2] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caliguiri, C.D. Bloomeld, and E. S. Lander, “Molecular classication of cancer: Classdiscovery and class prediction by gene expression monitoring,” Science,vol. 286, pp. 531–537, 1999.

[3] S. K. Mukherji, Ed., Clinical Applications of Magnetic Resonance Spec-troscopy. New York: Wiley-Liss, 1998.

[4] S. J. Nelson, “Multivoxel magnetic resonance spectroscopy of brain tu-mors,” Mol. Cancer Therapeutics, vol. 2, pp. 497–507, 2003.

[5] L. M. Fu and E. S. Youn, “Improving reliability of gene selection frommicroarray functional genomics data,” IEEE Trans. Inf. Technol. Biomed.,vol. 7, no. 3, pp. 191–196, Sep. 2003.

[6] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancerclassification using support vector machines,” Mach. Learn., vol. 46,pp. 389–422, 2002.

[7] J. Weston, A. Elisseeff, M. Tipping, and B. Scholkopf, “Use of the zeronorm with linear models and kernel methods,” J. Mach. Learn. Res.,vol. 3, pp. 1439–1461, 2002.

[8] A. E. Nikulin, B. Dolenko, T. Bezabeh, and R. L. Somorjai, “Near-optimalregion selection for feature space reduction: Novel preprocessing methodsfor classifying MR spectra,” NMR Biomed., vol. 11, pp. 209–216, 1998.

[9] R. L. Somorjai, B. Dolenko, A. Nikulina, P. Nickersonb, D. Rushb,A. Shawa, M. Glogowskia, J. Rendella, and R. Deslauriers, “Distinguish-ing normal from rejecting renal allografts: Application of a three–stageclassification strategy,” Vibr. Spectrosc, vol. 28, pp. 97–102, 2002.

[10] M. Xiong, X. Fang, and J. Zhao, “Biomarker identification by featurewrappers,” Genome Res., vol. 11, pp. 1878–1887, 2001.

[11] Y. Li, C. Campbell, and M. Tipping, “Bayesian automatic relevance deter-mination algorithms for classifying gene expression data,” Bioinformat-ics, vol. 18, pp. 1332–1339, 2002.

[12] M. E. Tipping, “Sparse Bayesian learning and the relevance vector ma-chine,” J. Mach. Learn. Res., vol. 1, pp. 211–244, 2001.

[13] C. M. Bishop and M. E. Tipping, “Bayesian regression and classification,”in Advances in Learning Theory: Methods, Models and Applications, vol.190 NATO Science Series III: Computer and Systems Sciences, J. A.K. Suykens, et al., Ed. Amsterdam, The Netherlands: IOS Press, 2003,pp. 267–288.

[14] C. M. Bishop, Neural Networks for Pattern Recognition. New York:Oxford Univ. Press, 1995.

[15] M. E. Tipping and A. Faul, “Fast marginal likelihood maximisation forsparse Bayesian models,” presented at the 9th Int. Workshop Artif. Intell.Stat., Key West, FL, 2003.

[16] V. Vapnik, The Nature of Statistical Learning Theory. New York:Springer-Verlag, 1995.

[17] J. A. K. Suykens and J. Vandewalle, “Least squares support vector machineclassifiers,” Neural Process. Lett., vol. 9, no. 3, pp. 293–300, 1999.

[18] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, andJ. Vandewalle, Least Squares Support Vector Machines. Singapore: WorldScientific, 2002.

[19] T. Van Gestel, J. A. K. Suykens, G. Lanckriet, A. Lambrechts, B. De Moor,and J. Vandewalle, “A Bayesian framework for least squares support vectormachine classifiers, Gaussian processes and kernel Fisher discriminantanalysis,” Neural Comput., vol. 14, pp. 1115–1148, 2002.

[20] A. Devos, L. Lukas, J. A. K. Suykens, L. Vanhamme, A. R. Tate,F. A. Howe, C. Majos, A. Moreno-Torres, M. Van der Graaf, C. Arus,and S. Van Huffel, “Classification of brain tumours using short echo time1H MR spectra,” J. Magn. Reson., vol. 173, no. 2, pp. 218–228, 2004.

[21] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, pp. 123–140,1996.

[22] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statisti-cal Learning—Data Mining, Inference, and Prediction. New York:Springer, 2001.

[23] T. Hastie and R. Tibshirani, “Classification by pairwise coupling,” inAdvances in Neural Information Processing Systems, vol. 10, M. I. Jordan,M. J. Kearns, and S. A. Solla, Eds. Cambridge, MA: MIT Press, 1998.

[24] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA:Morgan Kaufmann, 1993.

[25] R. Simon, M. Radmacher, K. Dobbin, and L. McShane, “Pitfalls in theuse of DNA microarray data for diagnostic and prognostic classification,”J. Natl. Cancer Inst., vol. 95, pp. 14–18, 2003.

[26] J. A. Hanley and B. McNeil, “The meaning and use of the area under aReceiver Operating Characteristic (ROC) curve,” Radiology, vol. 143,pp. 29–36, 1982.

[27] K. Shailubhai, H. H. Yu, K. Karunanandaa, J. Y. Wang, S. L. Eber,Y. Wang, N. S. Joo, H. D. Kim, B. W. Miedema, and S. Z. Abbas,et al., “Uroguanylin treatment suppress polyp formation in the ApcM in/+

mouse and induces apoptosis in human colon adenocarcinoma cells viacyclic GMP,” Cancer Res., vol. 60, pp. 5151–5163, 2000.

[28] G. Karakiulakis, C. Papanikolaou, S. M. Jankovic, A. Aletras, E. Pa-pakonstantinou, E. Vretou, and V. Mirtsou-Fidani, “Increased type IVcollagen-degrading activity in metastases originaing from priorary tumorsof human colon,” Invasion Metastasis, vol. 17, no. 3, pp. 158–168, 1997.

[29] V. Govindaraju, K. Young, and A. A. Maudsley, “Proton NMR chemi-cal shifts and coupling constants for brain metabolites,” NMR Biomed.,vol. 13, pp. 129–153, 2003.

[30] A. P. Candiota, C. Majos, A. Bassols, M. E. Cabanas, J. J. Acebes,M. R. Quintero, and C. AruS, “Assignment of the 2.03 ppm resonancein in vivo1H MRS of human brain tumour cystic fluid: Contribution ofmacromolecules,” MAGMA, vol. 17, pp. 36–46, 2004.

Chuan Lu received the B.E. degree in managementinformation systems from Tianjin University, Tianjin,China, in 1995, and the Master’s degree in artificialintelligence and the Ph.D. degree in electrical en-gineering from the Katholieke Universiteit Leuven,Leuven, Belgium, in 2000 and 2005, respectively.

She is currently a Research Associate in the Com-putational Biology Group, Department of ComputerScience, University of Wales, Aberystwyth, U.K. Hercurrent research interests include statistics and ma-chine learning applied to biology and medicine.


Andy Devos received two Master’s degrees in mathe-matics and artificial intelligence and the Ph.D. degreein electrical engineering from the Katholieke Univer-siteit Leuven, Leuven, Belgium, in 1999, 2000, 2005,respectively.

He is currently with the Department of ElectricalEngineering, Katholieke Universiteit Leuven.

Johan A. K. Suykens (SM’04) received the degreein electro-mechanical engineering and the Ph.D. de-gree in applied sciences from the Katholieke Univer-siteit Leuven, Leuven, Belgium, in 1989 and 1995,respectively.

He has been a Postdoctoral Researcher withthe Fund for Scientific Research (FWO), Flanders,Belgium. During 1996, he was a Visiting PostdoctoralResearcher at the University of California, Berkeley.He is currently a Professor at the Katholieke Uni-versiteit Leuven. His current research interests are

mainly in the areas of the theory and application of neural networks and non-linear systems. He is the author or coauthor of more than 300 papers and threebooks, and has edited two books.

Dr. Suykens is the recipient of the IEEE Signal Processing Society 1999Best Paper (Senior) Award, and several Best Paper Awards at internationalconferences. He has also received the International Neural Networks Society(INNS) 2000 Young Investigator Award for significant contributions in the fieldof neural networks. From 1997 to 1999, he was an Associate Editor of the IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS, PART I, and since 2004, of theIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, PART II. Since 1998, he hasbeen an Associate Editor of the IEEE TRANSACTIONS ON NEURAL NETWORKS.

Carles Arus received the B.Sc. degree in biologyand the Ph.D. degree in chemistry from the Universi-tat Autonoma de Barcelona (UAB), Barcelona, Spain,in 1976 and 1981, respectively.

From 1982 to 1985, he was a Postdoctoral Re-searcher at the University of Illinois, Chicago, andPurdue University, West Lafayette, IN. He is cur-rently a “Catedratico” (Full Professor) in the Depar-tament de Bioquımica i Biologia Molecular, UAB.His current research interests include the field of tu-mor spectroscopy, and the use of 1H MRS of human

brain tumors, biopsies, and cell models for diagnosis, prognosis, and therapyplanning. Since 1977, he has published 56 PubMed-accessible articles.

Dr. Arus received the Best Thesis Award from the Faculty of Sciences, UAB,in 1982.

Sabine Van Huffel (SM’99) received two Master’sdegrees in computer science engineering and biomed-ical engineering and the Ph.D. degree in electricalengineering from the Katholieke Universiteit Leu-ven, Leuven, Belgium, in 1981, 1985, and 1987,respectively.

She is currently a Full Professor in the Depart-ment of Electrical Engineering, Katholieke Univer-siteit Leuven. Her current research interests includesignal processing, numerical linear algebra, errors-in-variables regression, system identification, pat-

tern recognition, (non)linear modeling, software, and statistics, applied tobiomedicine. She is the author or coauthor of two books, two edited books,more than 150 papers in international journals, four book chapters, and morethan 140 conference contributions.

Documents

Bagging Linear Sparse Bayesian Learning Models for Variable Selection in Cancer Diagnosis