View
8
Download
0
Category
Preview:
Citation preview
Bayesian Variable Selection in ClusteringHigh-Dimensional Data
Mahlet G TADESSE Naijun SHA and Marina VANNUCCI
Over the last decade technological advances have generated an explosion of data with substantially smaller sample size relative to thenumber of covariates (p n) A common goal in the analysis of such data involves uncovering the group structure of the observations andidentifying the discriminating variables In this article we propose a methodology for addressing these problems simultaneously Given a setof variables we formulate the clustering problem in terms of a multivariate normal mixture model with an unknown number of componentsand use the reversible-jump Markov chain Monte Carlo technique to define a sampler that moves between different dimensional spacesWe handle the problem of selecting a few predictors among the prohibitively vast number of variable subsets by introducing a binaryexclusioninclusion latent vector which gets updated via stochastic search techniques We specify conjugate priors and exploit the conjugacyby integrating out some of the parameters We describe strategies for posterior inference and explore the performance of the methodologywith simulated and real datasets
KEY WORDS Bayesian variable selection Bayesian clustering Label switching Reversible-jump Markov chain Monte Carlo
1 INTRODUCTION
Over the last decade technological advances have generatedan explosion of data for which the number of covariates is con-siderably larger than the sample size These data pose a chal-lenge to standard statistical methods and have revived a stronginterest in clustering algorithms The goals are to uncover thegroup structure of the observations and to determine the dis-criminating variables The analysis of DNA microarray datasetsis a typical high-dimensional example where variable selectionand outcome prediction have become a major focus of researchThe technology provides an automated method for simultane-ously quantifying thousands of genes but its high cost con-strains researchers to a few experimental units The discoveryof different types of tissues or subtypes of a disease and theidentification of genes that best distinguish them is believedto provide a better understanding of the underlying biologicalprocesses This in turn could lead to better treatment choicefor patients in different risk groups and the selected genescan serve as biomarkers to improve diagnosis and therapeuticintervention
When dealing with high-dimensional datasets the clus-ter structure of the observations is often confined to a smallsubset of variables As pointed out by several authors (egFowlkes Gnanadesikan and Kettering 1988 Milligan 1989Gnanadesikan Kettering and Tao 1995 Brusco and Cradit2001) the inclusion of unnecessary covariates could com-plicate or even mask the recovery of the clusters Commonapproaches to mitigating the effect of noisy variables or iden-tifying those that define true cluster structure involve differen-tially weighting the covariates or selecting the discriminatingones Gnanadesikan et al (1995) showed that variable weight-ing schemes were often outperformed by variable selection
Mahlet G Tadesse is Assistant Professor Department of BiostatisticsUniversity of Pennsylvania Philadelphia PA 19104 (E-mail mtadesseccebupennedu) Naijun Sha is Assistant Professor Department of Mathemati-cal Sciences University of Texas at El Paso El Paso TX 79968 (E-mailnshautepedu) Marina Vannucci is Associate Professor Department of Sta-tistics Texas AampM University College Station TX 77843 (E-mailmvannuccistattamuedu) Tadessersquos research is supported by grant CA90301from the National Cancer Institute Sharsquos work is supported in part byBBRCRCMI National Institute of Health grant 2G12RR08124 and a Uni-versity Research Institute (UTEP) grant and Vannuccirsquos research is supportedby National Science Foundation CAREER award number DMS-00-93208 Theauthors thank the associate editor and two referees for their thorough reading ofthe manuscript and their input which have improved the article considerably
procedures In addition to improving the prediction of clus-ter membership the variable selection task reduces the mea-surement and storage requirements for future samples therebyproviding more cost-effective predictors
In this article we focus on the analysis of high-dimensionaldatasets characterized by a sample size n that is substan-tially smaller than the number of covariates p We proposea Bayesian approach for simultaneously selecting discriminat-ing variables and uncovering the cluster structure of the obser-vations The article is organized as follows In Section 2 wegive a brief review of existing procedures and discuss theirshortcomings In Section 3 we describe our model specificationwhich makes use of model-based clustering and introduces la-tent indicators to identify discriminating variables In Section 4we present the prior assumptions and the resulting full condi-tionals We also provide guidelines for choosing the hyperpa-rameters In Section 5 we discuss in detail the Markov chainMonte Carlo (MCMC) updates for each of the model parame-ters and in Section 6 we address the issue of label switchingand describe the inference mechanism In Section 7 we assessthe performance of our methodology using various datasetsWe conclude the article with a brief discussion in Section 8
2 REVIEW OF EXISTING METHODS
The most widely used approaches separate the variableselectionweighting and clustering tasks One common vari-able selection approach involves fitting univariate models oneach covariate and selecting a small subset that passes somethreshold for significance (McLachlan Bean and Peel 2002)Another standard approach uses dimension-reduction tech-niques such as principal component analysis and focuses onthe leading components (Ghosh and Chinnaiyan 2002) The re-duced data are then clustered using the procedure of choiceThese selection steps are suboptimal The first approach doesnot assess the joint effect of multiple variables and could throwaway potentially valuable covariates which are not predic-tive individually but may provide significant improvement inconjunction with others The second approach results in lin-ear combinations and does not allow evaluation of the original
copy 2005 American Statistical AssociationJournal of the American Statistical Association
June 2005 Vol 100 No 470 Theory and MethodsDOI 101198016214504000001565
602
Tadesse Sha and Vannucci Bayesian Variable Selection 603
variables In addition it has been shown that the leading com-ponents do not necessarily contain the most information aboutthe cluster structure (Chang 1983)
The literature includes several procedures that combine thevariable selection and clustering tasks Fowlkes et al (1988)used a forward selection approach in the context of completelinkage hierarchical clustering The variables are added usinginformation of the between-cluster and total sum of squaresand their significance is judged based on graphical informa-tion The authors characterized this as an ldquoinformalrdquo assess-ment Brusco and Cradit (2001) proposed a variable selectionheuristic for k-means clustering using a similar forward selec-tion procedure Their approach uses the adjusted Rand indexto measure cluster recovery Recently Friedman and Meulman(2003) proposed a hierarchical clustering procedure that uncov-ers cluster structure on separate subsets of variables The al-gorithm does not explicitly select variables but rather assignsthem different weights which can be used to extract relevantcovariates This approach is somewhat different from ours inthat it detects subgroups of observations that preferentially clus-ter on different subsets of variables rather than clustering on thesame subset of variables These methods all work in conjunc-tion with k-means or hierarchical clustering algorithms whichhave several limitations One major shortcoming is the lack ofa statistical criterion for assessing the number of clusters Theseprocedures also do not provide a measure of the uncertainty as-sociated with the sample allocations In addition with respectto variable selection the existing approaches use greedy deter-ministic search procedures which can be stuck at local min-ima They also have the limitation of presuming the existence ofa single ldquobestrdquo subset of clustering variables In practice how-ever there may be several equally good subsets that define thetrue cluster structure
The Bayesian paradigm offers a coherent framework foraddressing the problems of variable selection and clusteringsimultaneously Recent developments in MCMC techniques(Gilks Richardson and Spiegelhalter 1996) have led to sub-stantial research in both areas Bayesian variable selectionmethods so far have been developed mainly in the context ofregression analysis (See George and McCulloch 1997 for a re-view on prior specifications and MCMC implementations andBrown Vannucci and Fearn 1998b for extensions to the caseof multivariate responses) The Bayesian clustering approachuses a mixture of probability distributions each distributionrepresenting a different cluster Diebolt and Robert (1994) pre-sented a comprehensive treatment of MCMC strategies whenthe number of mixture components is known For the generalcase of an unknown number of components Richardson andGreen (1997) successfully applied the reversible-jump MCMC(RJMCMC) technique but considered only univariate dataStephens (2000a) proposed an approach based on continuous-time Markov birthndashdeath processes and applied it to a bivariatesetting
We propose a method that simultaneously selects the dis-criminating variables and clusters the samples into G groupswhere G is unknown The computational burden of searchingthrough all possible 2p variable subsets is handled through theintroduction of a latent p-vector with binary entries We usea stochastic search method to explore the space of possible val-ues for this latent vector The clustering problem is formulated
in terms of a multivariate mixture model with an unknown num-ber of components We work with a marginalized likelihoodwhere some of the model parameters are integrated out andadapt the RJMCMC technique of Richardson and Green (1997)to the multivariate setting Our inferential procedure providesselection of discriminating variables estimates for the numberof clusters and the sample allocations and class prediction forfuture observations
3 MODEL SPECIFICATION
31 Clustering via Mixture Models
The goal of cluster analysis is to partition a collection of ob-servations into homogeneous groups Model-based clusteringhas received great attention recently and has provided promis-ing results in various applications In this approach the data areviewed as coming from a mixture of distributions each distri-bution representing a different cluster Let X = (x1 xn) beindependent p-dimensional observations from G populationsThe problem of clustering the n samples can be formulated interms of a mixture of G underlying probability distributions
f (xi|w θ) =Gsum
k=1
wk f (xi|θk) (1)
where f (xi|θk) is the density of an observation xi from thekth component and w = (w1 wG)T are the componentweights (wk ge 0
sumGk=1 wk = 1)
To identify the cluster from which each observation is drawnlatent variables y = ( y1 yn)
T are introduced where yi = kif the ith observation comes from component k The yirsquos areassumed to be independently and identically distributed withprobability mass function p( yi = k) = wk We consider the casewhere f (xi|θk) is a multivariate normal density with parametersθk = (microkk) Thus for sample i we have
xi|yi = kw θ simN (microkk) (2)
32 Identifying Discriminating Variables
In high-dimensional data it is often the case that a large num-ber of variables provide very little if any information about thegroup structure of the observations Their inclusion could bedetrimental because it might obscure the recovery of the truecluster structure To identify the relevant variables we intro-duce a latent p-vector γ with binary entries γj takes value 1if the jth variable defines a mixture distribution for the dataand 0 otherwise We use (γ ) and (γ c) to index the discriminat-ing variables and those that favor a single multivariate normaldensity Notice that this approach is different from what is com-monly done in Bayesian variable selection for regression mod-els where the latent indicator is used to induce mixture priorson the regression coefficients The likelihood function is thengiven by
L(Gγ wmicroη|Xy)
= (2π)minus(pminuspγ )n2∣∣(γ c)
∣∣minusn2
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
604 Journal of the American Statistical Association June 2005
timesGprod
k=1
(2π)minuspγ nk2∣∣k(γ )
∣∣minusnk2wnk
k
times exp
minus1
2
sum
xiisinCk
(xi(γ ) minus microk(γ )
)Tminus1
k(γ )
(xi(γ ) minus microk(γ )
)
(3)
where (η(γ c)(γ c)) denote the mean and covariance parame-ters for the nondiscriminating variables (microk(γ )k(γ )) are themean and covariance parameters of cluster k Ck = xi|yi = kwith cardinality nk and pγ = sump
j=1 γj
4 PRIOR SETTING AND FULL CONDITIONALS
41 Prior Formulation and Choice of Hyperparameters
For the variable selection indicator γ we assume that itselements γj are independent Bernoulli random variables
p(γ ) =pprod
j=1
ϕγj(1 minus ϕ)1minusγj (4)
The number of covariates included in the model pγ thusfollows a binomial distribution and ϕ can be elicited as theproportion of variables expected a priori to discriminate the dif-ferent groups ϕ = ppriorp This prior assumption can be re-laxed by formulating a beta(ab) hyperprior on ϕ which leadsto a beta-binomial prior for pγ with expectation p a
a+b A vagueprior can be elicited by setting a + b = 2 which leads to settinga = 2ppriorp and b = 2 minus a
Natural priors for the number of components G are a trun-cated Poisson
P(G = g) = eminusλλgg1 minus (eminusλ(1 + λ) + suminfin
j=Gmax+1 (eminusλλ j)j)
g = 2 Gmax (5)
or a discrete uniform on [2 Gmax] P(G = g) = 1Gmaxminus1
where Gmax is chosen arbitrarily large For the vector of com-ponent weights w we specify a symmetric Dirichlet priorw|G sim Dirichlet(α α)
Updating the mean and covariance parameters is somewhatintricate When deleting or creating new components the corre-sponding appropriate changes must be made for the componentparameters However it is not clear whether adequate proposalscan be constructed for the reversible jump in the multivariatesetting Even if this difficulty can be overcome as γ gets up-dated the dimensions of microk(γ ) k(γ ) η(γ c) and (γ c) changerequiring another sampler that moves between different dimen-sional spaces The algorithm is much more efficient if we canintegrate out these parameters This also helps substantially ac-celerate model fitting because the set of parameters that needto be updated would then consist only of (γ wyG) The inte-gration can be facilitated by taking conjugate priors
microk(γ )
∣∣k(γ )G sim N(micro0(γ )h1k(γ )
)
η(γ c)
∣∣(γ c) sim N(micro0(γ c)h0(γ c)
)
(6)k(γ )|G sim IW
(δQ1(γ )
)
(γ c) sim IW(δQ0(γ c)
)
where IW(δQ1(γ )) indicates the inverse-Wishart distributionwith dimension pγ shape parameter δ = n minus pγ + 1 n degreesof freedom and mean Q1(γ )(δ minus 2) (Brown 1993) Weak priorinformation is specified by taking δ small which we set to 3 thesmallest integer such that the expectations of the covariance ma-trices are defined We take Q1 = 1κ1Iptimesp and Q0 = 1κ0IptimespSome care in the choice of κ1 and κ0 is needed These hyperpa-rameters need to be in the range of variability of the data At thesame time we do not want the prior information to overwhelmthe data and drive the inference We experimented with differ-ent values and found that values commensurate with the orderof magnitude of the eigenvalues of cov(X) where X is the fulldata matrix lead to reasonable results We suggest setting κ1between 1 and 10 of the upper decile of the n minus 1 non-zeroeigenvalues and κ0 proportional to their lower decile
For the mean parameters we take the priors to be fairly flatover the region where the data are defined Each element of micro0is set to the corresponding covariate interval midpoint andh0 and h1 are chosen arbitrarily large This prevents the speci-fication of priors that do not overlap with the likelihood and al-lows for mixtures with widely different component means Wefound that values of h0 and h1 between 10 and 1000 performedreasonably well
42 Full Conditionals
As we stated earlier the parameters micro η and are inte-grated out and we need to sample only from the joint posteriorof (Gyγ w) We have
f (Xy|Gwγ )
=int
f (Xy|γ Gwmicroη)f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d (7)
For completeness we provide the full conditionals under theassumption of equal and unequal covariances across clustersHowever in practice unless there is prior knowledge that fa-vors the former we recommend working with heterogeneouscovariances After some algebra (see the Appendix) we get thefollowing
1 Under the assumption of homogeneous covariance
f (Xy|Gwγ )
= πminusnp2K(γ ) middot ∣∣Q1(γ )
∣∣(δ+pγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(n+δ+pγ minus1)2
times H(γ c) middot ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(n+δ+pminuspγ minus1)2 (8)
K(γ ) =[
Gprod
k=1
wnkk (h1nk + 1)minuspγ 2
]
timespγprod
j=1
( 12 (n + δ + pγ minus j))
( 12 (δ + pγ minus j))
Tadesse Sha and Vannucci Bayesian Variable Selection 605
H(γ c) = (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
( 12 (n + δ + p minus pγ minus j))
( 12 (δ + p minus pγ minus j))
Sk(γ ) =sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
S0(γ c) =nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
xk(γ ) is the sample mean of cluster k and x(γ c) corre-sponds to the sample means of the nondiscriminating vari-ables As h1 rarr infin Sk(γ ) reduces to the within-group sumof squares for the kth cluster and
sumGk=1 Sk(γ ) becomes the
total within-group sum of squares2 Assuming heterogeneous covariances across clusters
f (Xy|Gwγ )
= πminusnp2Gprod
k=1
Kk(γ ) middot ∣∣Q1(γ )
∣∣(δ+pγ minus1)2
times ∣∣Q1(γ ) + Sk(γ )
∣∣minus(nk+δ+pγ minus1)2
times H(γ c) middot ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(n+δ+pminuspγ minus1)2 (9)
where Kk(γ ) = wnkk (h1nk + 1)minuspγ 2 prodpγ
j=1( 1
2 (nk+δ+pγ minusj))
( 12 (δ+pγ minusj))
The full conditionals for the model parameters are then givenby
f (y|Gwγ X) prop f (Xy|Gwγ ) (10)
f (γ |GwyX) prop f (Xy|Gwγ )p(γ |G) (11)
and
w|Gγ yX sim Dirichlet(α + n1 α + nG) (12)
5 MODEL FITTING
Posterior samples for the parameters of interest are obtainedusing Metropolis moves and RJMCMC embedded within aGibbs sampler Our MCMC procedure comprises of the follow-ing five steps
1 Update γ from its full conditional in (11)2 Update w from its full conditional in (12)3 Update y from its full conditional in (10)4 Split one mixture component into two or merge two into
one5 Birth or death of an empty component
The birthdeath moves specifically deal with creatingdele-ting empty components and do not involve reallocation ofthe observations We could have extended these to include
nonempty components and removed the splitmerge movesHowever as described by Richardson and Green (1997) in-cluding both steps provides more efficient moves and improvesconvergence
We would like to point out that as we iterate through thesesteps the cluster structure evolves with the choice of variablesThis makes the problem of variable selection in the context ofclustering much more complicated than classification or regres-sion analysis In classification the group structure of the ob-servations is prespecified in the training data and guides theselection Here however the analysis is fully data-based Inaddition in the context of clustering including unnecessaryvariables can have severe implications because it may obscurethe true grouping of the observations
51 Variable Selection Update
In step 1 we update γ using a Metropolis search as suggestedfor model selection by Madigan and York (1995) and applied inregression by Brown et al (1998a) among others and recentlyby Sha et al (2004) in multinomial probit models for classifi-cation In this approach a new candidate γ new is generated byrandomly choosing one of the following two transition moves
1 Adddelete Randomly choose one of the p indices in γ oldand change its value from 0 to 1 or from 1 to 0 to be-come γ new
2 Swap Choose independently and at random a 0 and a 1in γ old and switch their values to get γ new
The new candidate γ new is accepted with probabilitymin1
f (γ new|XywG)
f (γ old|XywG)
52 Update of Weights and Cluster Allocation
In step 2 we use a Gibbs move to sample w from its fullconditional in (12) This can be done by drawing independentgamma random variables with common scale and shape para-meters α + n1 α + nG then scaling them to sum to 1
In step 3 we update the allocation vector y one element ata time via a sub-Gibbs sampling strategy The full conditionalprobability that the ith observation is in the kth cluster is givenby
f(yi = k
∣∣Xy(minusi)γ wG) prop f
(X yi = ky(minusi)
∣∣Gwγ) (13)
where y(minusi) is the cluster assignment for all observations exceptthe ith one
53 Reversible-Jump MCMC
The last two steps splitmerge and birthdeath moves causethe creation or deletion of new components and consequentlyrequire a sampler that jumps between different dimensionalspaces A popular method for defining such a sampler is RJM-CMC (Green 1995 Richardson and Green 1997) whichextends the MetropolisndashHastings algorithm to general statespaces Suppose that a move type m from a countable familyof move types is proposed from ψ to a higher-dimensionalspace ψ prime This will very often be implemented by drawinga vector of continuous random variables u independent of ψ and setting ψ prime to be a deterministic and invertible function ofψ and u The reverse of this move (from ψ prime to ψ) can be ac-complished by using the inverse transformation so that in this
606 Journal of the American Statistical Association June 2005
direction the proposal is deterministic The move is then ac-cepted with probability
min
1
f (ψ prime|X)rm(ψ prime)f (ψ |X)rm(ψ)q(u)
∣∣∣∣partψ prime
part(ψu)
∣∣∣∣
(14)
where rm(ψ) is the probability of choosing move type m whenin state ψ and q(u) is the density function of u The final termin the foregoing ratio is a Jacobian arising from the change ofvariable from (ψu) to ψ prime
531 Split and Merge Moves In step 4 we make a randomchoice between attempting to split a nonempty cluster or com-bine two clusters with probabilities bk = 5 and dk = 1 minus bk
(k = 2 Gmax minus 1 and bGmax = 0)The split move begins by randomly selecting a nonempty
cluster l and dividing it into two components l1 and l2 withweights wl1 = wlu and wl2 = wl(1 minus u) where u is a beta(22)
random variable The parameters ψ = (Gwoldγ yold) are up-dated to ψ prime = (G + 1wnewγ ynew) The acceptance probabil-ity for this move is given by min(1A) where
A = f (G + 1wnewγ ynew|X)
f (Gwoldγ yold|X)times q(ψ |ψ prime)
q(ψ prime|ψ) times f (u)
times∣∣∣∣part(wl1wl2)
part(wlu)
∣∣∣∣
= f (Xynew|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)
f (Xyold|woldγ G) times f (wold|G) times f (G)
times dG+1 times Pchos
bG times Gminus11 times Palloc times f (u)
times wl (15)
f (wnew|G + 1)
f (wold|G)= wαminus1
l1 wαminus1l2
wαminus1l B(αGα)
and
f (G + 1)
f (G)=
1 if G sim uniform [2 Gmax]
λG+1 if G sim truncated Poisson (λ)
Here G1 is the number of nonempty components before thesplit and Palloc is the probability that the particular allocation ismade and is set to unl1(1 minus u)nl2 where nl1 and nl2 are the num-ber of observations assigned to l1 and l2 Note that Richardsonand Green (1997) calculated Palloc using the full conditional ofthe allocation variables P( yi = k|rest) In our case because thecomponent parameters micro and are integrated out the alloca-tion variables are no longer independent and calculating thesefull conditionals is computationally prohibitive The random al-location approach that we took is substantially faster even withthe longer chains that need to be run to compensate for thelow acceptance rate of proposed moves Pchos is the probabilityof selecting two components for the reverse move Obviouslywe want to combine clusters that are closest to one anotherHowever choosing an optimal similarity metric in the multi-variate setting is not straightforward We define closeness interms of the Euclidean distance between cluster means and con-sider three possible ways of splitting a cluster
a Split results into two mutually adjacent components
rArr Pchos = Glowast1
Glowast times 2Glowast
1
b Split results into components that are not mutually ad-jacent that is l1 has l2 as its closest component but there is
another cluster that is closer to l2 than l1 rArr Pchos = Glowast1
Glowast times 1Glowast
1
c One of the resulting components is empty rArr Pchos =Glowast
0Glowast times 1
Glowast0
times 1Glowast
1
Here Glowast is the total number of clusters after the splitGlowast
0 and Glowast1 are the number of empty and nonempty components
after the split Thus splits that result in mutually close com-ponents are assigned a larger probability and those that yieldempty clusters have lower probability
The merge move begins by randomly choosing two com-ponents l1 and l2 to be combined into a single cluster lThe updated parameters are ψ = (Gwoldγ yold) rarr ψ prime =(G minus 1wnewγ ynew) where wl = wl1 + wl2 The acceptanceprobability for this move is given by min(1A) where
A = f (Xynew|γ wnewG minus 1) times f (wnew|G minus 1) times f (G minus 1)
f (Xyold|γ woldG) times f (wold|G) times f (G)
times bGminus1 times Palloc times (G1 minus 1)minus1 times f (u)
dG times Pchos
times (wl1 + wl2)minus1 (16)
where f (wnew|Gminus1)f (wold|G)
= wαminus1l B(α(Gminus1)α)
wαminus1l1 wαminus1
l2and f (Gminus1)
f (G)equals 1 or G
λ
for the Uniform and Poisson priors on G G1 is the numberof nonempty clusters before the combining move f (u) is thebeta(22) density for u = wl1
wl and Palloc and Pchos are defined
similarly to the split move but now Glowast0 and Glowast
1 in Pchos corre-spond to the number of empty and nonempty clusters after themerge
532 Birth and Death Moves We first make a randomchoice between a birth move and a death move using thesame probabilities bk and dk = 1 minus bk as earlier For a birthmove the updated parameters are ψ = (Gwoldγ y) rarr ψ prime =(G + 1wnewγ y) The weight for the proposed new compo-nent is drawn using wlowast sim beta(1G) and the existing weightsare rescaled to wprime
k = wk(1minuswlowast) k = 1 G (where k indexesthe component labels) so that all of the weights sum to 1 LetG0 be the number of empty components before the birth Theacceptance probability for a birth is min(1A) where
A = f (Xy|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)
f (Xy|woldγ G) times f (wold|G) times f (G)
times dG+1 times (G0 + 1)minus1
bG times f (wlowast)times (1 minus wlowast)Gminus1 (17)
f (wnew|G+1)f (wold|G)
= wlowastαminus1(1minuswlowast)G(αminus1)
B(αGα) and (1 minus wlowast)Gminus1 is the
Jacobian of the transformation (woldwlowast) rarr wnewFor the death move an empty component is chosen at
random and deleted Let wl be the weight of the deleted com-ponent The remaining weights are rescaled to sum to 1 (wprime
k =wk(1 minus wl)) and ψ = (Gwoldγ y) rarr ψ prime = (G minus 1wnew
γ y) The acceptance probability for this move is min(1A)where
A = f (Xy|wnewγ G minus 1) times f (wnew|G minus 1) times f (G minus 1)
f (Xy|woldγ G) times f (wold|G) times f (G)
times bGminus1 times f (wl)
dG times Gminus10
times (1 minus wl)minus(Gminus2) (18)
Tadesse Sha and Vannucci Bayesian Variable Selection 607
f (wnew|Gminus1)f (wold|G)
= B(α(Gminus1)α)
wαminus1l (1minuswl)
(Gminus1)(αminus1) G0 is the number of empty
components before the death move and f (wl) is the beta(1G)
density
6 POSTERIOR INFERENCE
We draw inference on the sample allocations conditionalon G Thus we first need to compute an estimate for this para-meter which we take to be the value most frequently visited bythe MCMC sampler In addition we need to address the label-switching problem
61 Label Switching
In finite mixture models an identifiability problem arisesfrom the invariance of the likelihood under permutation of thecomponent labels In the Bayesian paradigm this leads to sym-metric and multimodal posterior distributions with up to Gcopies of each ldquogenuinerdquo mode complicating inference on theparameters In particular it is not possible to form ergodic av-erages over the MCMC samples Traditional approaches to thisproblem impose identifiability constraints on the parametersfor instance w1 lt middot middot middot lt wG These constraints do not alwayssolve the problem however Recently Stephens (2000b) pro-posed a relabeling algorithm that takes a decision-theoretic ap-proach The procedure defines an appropriate loss function andpostprocesses the MCMC output to minimize the posterior ex-pected loss
Let P(ξ) = [pik(ξ)] be the matrix of allocation probabilities
pik(ξ) = Pr( yi = k|Xγ w θ)
= wk f (xi(γ )|θk(γ ))sumj wjf (xi(γ )|θ j(γ ))
(19)
In the context of clustering the loss function can be defined onthe cluster labels y
L0(y ξ) = minusnsum
i=1
logpiyi(ξ ) (20)
Let ξ (1) ξ (M) be the sampled parameters and let ν1 νM
be the permutations applied to them The parameters ξ (t) corre-spond to the component parameters w(t) micro(t) and (t) sampledat iteration t However in our methodology the componentmean and covariance parameters are integrated out and wedo not draw their posterior samples Therefore to calculatethe matrix of allocation probabilities in (19) we estimatethese parameters and set ξ (t) = w(t)
1 w(t)G x(t)
1(γ ) x(t)
G(γ )
S(t)1(γ )
S(t)G(γ )
where x(t)k(γ )
and S(t)k(γ )
are the estimates ofcluster krsquos mean and covariance based on the model visited atiteration t
The relabeling algorithm proceeds by selecting initial valuesfor the νtrsquos which we take to be the identity permutation theniterating the following steps until a fixed point is reached
a Choose y to minimizesumM
t=1 L0yνt(ξ(t))
b For t = 1 M choose νt to minimize L0yνt(ξ(t))
62 Posterior Densities and Posterior Estimates
Once the label switching is taken care of the MCMC samplescan be used to draw posterior inference Of particular interestare the allocation vector y and the variable selection vector γ
We compute the marginal posterior probability that sample iis allocated to cluster k yi = k as
p( yi = k|XG) =int
p(yi = ky(minusi)γ w
∣∣XG)
dy(minusi) dγ dw
propint
p(X yi = ky(minusi)γ w
∣∣G)
dy(minusi) dγ dw
=int
p(X yi = ky(minusi)
∣∣Gγ w)
times p(γ |G)p(w|G)dy(minusi) dγ dw
asympMsum
t=1
p(X y(t)
i = ky(t)(minusi)
∣∣Gγ (t)w(t))
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (21)
where y(t)(minusi) is the vector y(t) at the tth MCMC iteration without
the ith element The posterior allocation of sample i can then beestimated by the mode of its marginal posterior density
yi = arg max1lekleG
p( yi = k|XG) (22)
Similarly the marginal posterior for γj = 1 can be com-puted as
p(γj = 1|XG) =int
p(γj = 1γ (minusj)wy|XG
)dγ (minusj) dy dw
propint
p(Xy γj = 1γ (minusj)w
∣∣G)
dγ (minusj) dy dw
=int
p(Xy
∣∣G γj = 1γ (minusj)w)
times p(γ |G)p(w|G)dγ (minusj) dy dw
asympMsum
t=1
p(Xy(t)
∣∣G γ(t)j = 1γ
(t)(minusj)w(t))
times p(γj = 1γ
(t)(minusj)
∣∣G)p(w(t)
∣∣G) (23)
where γ(t)(minusj) is the vector γ (t) at the tth iteration without the
jth element The best discriminating variables can then beidentified as those with largest marginal posterior p(γj = 1|XG) gt a where a is chosen arbitrarily
γj = Ip(γj=1|XG)gta (24)
An alternative variable selection can be performed by con-sidering the vector γ with largest posterior probability amongall visited vectors
γ lowast = arg max1letleM
p(γ (t)
∣∣XG w y)
(25)
where y is the vector of sample allocations estimated via (22)and w is given by w = 1
M
sumMt=1 w(t) The estimate given in (25)
considers the joint density of γ rather than the marginal distri-butions of the individual elements as (24) In the same spirit an
608 Journal of the American Statistical Association June 2005
estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability
ylowast = arg max1letleM
p(y(t)
∣∣XG w γ)
(26)
where γ is the estimate in (24)
63 Class Prediction
The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by
p( yf = k|Xf XG)
=int
p( yf = kywγ |Xf XG)dy dγ dw
propint
p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw
asympMsum
t=1
p(Xf X yf = ky(t)
∣∣γ (t)w(t)G)
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (27)
and the observations are allocated according to yf
yf = arg max1lekleG
p( yf = k|Xf XG) (28)
7 PERFORMANCE OF OUR METHODOLOGY
In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data
using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study
71 Iris Data
This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array
We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference
Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped
(a) (b)
Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance
Tadesse Sha and Vannucci Bayesian Variable Selection 609
Table 1 Iris Data Posterior Distribution of G
Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]
k Different k Equal Different Σk Equal Σ Different Σk Equal Σ
3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0
together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823
We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters
We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means
Table 2 Iris Data Sample Allocation y
Heterogeneouscovariance
Homogeneouscovariance
I II III I II III IV
Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12
Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4
72 Simulated Data
We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that
xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ
22 )
+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ
24 )
i = 1 15 j = 1 20
where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2
1 = 15 σ 22 = 1 σ 2
3 = 5 and σ 24 = 2 We drew an
additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables
We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1
Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10
Posterior distribution of G
k 4 5 6 7 8 9 10
p(G = k |X) 0747 1874 3326 2685 1120 0233 0015
Sample allocations yI II III IV V VI
Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1
610 Journal of the American Statistical Association June 2005
Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6
h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in
Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all
datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables
We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot
(a) (b)
Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output
Tadesse Sha and Vannucci Bayesian Variable Selection 611
(a) (b)
Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ
Table 4 Simulated Data With p = 1000Posterior Distribution of G
k p(G = k|X )
2 01113 14474 34375 33756 13267 02168 00409 0015
10 000811 000512 000713 001214 0002
Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4
73 Application to Microarray Data
We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis
Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)
612 Journal of the American Statistical Association June 2005
(a) (b)
(c) (d)
Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster
Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range
As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G
with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model
To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in
We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in
Tadesse Sha and Vannucci Bayesian Variable Selection 613
(a) (b)
Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ
Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed
We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues
Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains
We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both
8 DISCUSSION
We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors
Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive
Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
Tadesse Sha and Vannucci Bayesian Variable Selection 603
variables In addition it has been shown that the leading com-ponents do not necessarily contain the most information aboutthe cluster structure (Chang 1983)
The literature includes several procedures that combine thevariable selection and clustering tasks Fowlkes et al (1988)used a forward selection approach in the context of completelinkage hierarchical clustering The variables are added usinginformation of the between-cluster and total sum of squaresand their significance is judged based on graphical informa-tion The authors characterized this as an ldquoinformalrdquo assess-ment Brusco and Cradit (2001) proposed a variable selectionheuristic for k-means clustering using a similar forward selec-tion procedure Their approach uses the adjusted Rand indexto measure cluster recovery Recently Friedman and Meulman(2003) proposed a hierarchical clustering procedure that uncov-ers cluster structure on separate subsets of variables The al-gorithm does not explicitly select variables but rather assignsthem different weights which can be used to extract relevantcovariates This approach is somewhat different from ours inthat it detects subgroups of observations that preferentially clus-ter on different subsets of variables rather than clustering on thesame subset of variables These methods all work in conjunc-tion with k-means or hierarchical clustering algorithms whichhave several limitations One major shortcoming is the lack ofa statistical criterion for assessing the number of clusters Theseprocedures also do not provide a measure of the uncertainty as-sociated with the sample allocations In addition with respectto variable selection the existing approaches use greedy deter-ministic search procedures which can be stuck at local min-ima They also have the limitation of presuming the existence ofa single ldquobestrdquo subset of clustering variables In practice how-ever there may be several equally good subsets that define thetrue cluster structure
The Bayesian paradigm offers a coherent framework foraddressing the problems of variable selection and clusteringsimultaneously Recent developments in MCMC techniques(Gilks Richardson and Spiegelhalter 1996) have led to sub-stantial research in both areas Bayesian variable selectionmethods so far have been developed mainly in the context ofregression analysis (See George and McCulloch 1997 for a re-view on prior specifications and MCMC implementations andBrown Vannucci and Fearn 1998b for extensions to the caseof multivariate responses) The Bayesian clustering approachuses a mixture of probability distributions each distributionrepresenting a different cluster Diebolt and Robert (1994) pre-sented a comprehensive treatment of MCMC strategies whenthe number of mixture components is known For the generalcase of an unknown number of components Richardson andGreen (1997) successfully applied the reversible-jump MCMC(RJMCMC) technique but considered only univariate dataStephens (2000a) proposed an approach based on continuous-time Markov birthndashdeath processes and applied it to a bivariatesetting
We propose a method that simultaneously selects the dis-criminating variables and clusters the samples into G groupswhere G is unknown The computational burden of searchingthrough all possible 2p variable subsets is handled through theintroduction of a latent p-vector with binary entries We usea stochastic search method to explore the space of possible val-ues for this latent vector The clustering problem is formulated
in terms of a multivariate mixture model with an unknown num-ber of components We work with a marginalized likelihoodwhere some of the model parameters are integrated out andadapt the RJMCMC technique of Richardson and Green (1997)to the multivariate setting Our inferential procedure providesselection of discriminating variables estimates for the numberof clusters and the sample allocations and class prediction forfuture observations
3 MODEL SPECIFICATION
31 Clustering via Mixture Models
The goal of cluster analysis is to partition a collection of ob-servations into homogeneous groups Model-based clusteringhas received great attention recently and has provided promis-ing results in various applications In this approach the data areviewed as coming from a mixture of distributions each distri-bution representing a different cluster Let X = (x1 xn) beindependent p-dimensional observations from G populationsThe problem of clustering the n samples can be formulated interms of a mixture of G underlying probability distributions
f (xi|w θ) =Gsum
k=1
wk f (xi|θk) (1)
where f (xi|θk) is the density of an observation xi from thekth component and w = (w1 wG)T are the componentweights (wk ge 0
sumGk=1 wk = 1)
To identify the cluster from which each observation is drawnlatent variables y = ( y1 yn)
T are introduced where yi = kif the ith observation comes from component k The yirsquos areassumed to be independently and identically distributed withprobability mass function p( yi = k) = wk We consider the casewhere f (xi|θk) is a multivariate normal density with parametersθk = (microkk) Thus for sample i we have
xi|yi = kw θ simN (microkk) (2)
32 Identifying Discriminating Variables
In high-dimensional data it is often the case that a large num-ber of variables provide very little if any information about thegroup structure of the observations Their inclusion could bedetrimental because it might obscure the recovery of the truecluster structure To identify the relevant variables we intro-duce a latent p-vector γ with binary entries γj takes value 1if the jth variable defines a mixture distribution for the dataand 0 otherwise We use (γ ) and (γ c) to index the discriminat-ing variables and those that favor a single multivariate normaldensity Notice that this approach is different from what is com-monly done in Bayesian variable selection for regression mod-els where the latent indicator is used to induce mixture priorson the regression coefficients The likelihood function is thengiven by
L(Gγ wmicroη|Xy)
= (2π)minus(pminuspγ )n2∣∣(γ c)
∣∣minusn2
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
604 Journal of the American Statistical Association June 2005
timesGprod
k=1
(2π)minuspγ nk2∣∣k(γ )
∣∣minusnk2wnk
k
times exp
minus1
2
sum
xiisinCk
(xi(γ ) minus microk(γ )
)Tminus1
k(γ )
(xi(γ ) minus microk(γ )
)
(3)
where (η(γ c)(γ c)) denote the mean and covariance parame-ters for the nondiscriminating variables (microk(γ )k(γ )) are themean and covariance parameters of cluster k Ck = xi|yi = kwith cardinality nk and pγ = sump
j=1 γj
4 PRIOR SETTING AND FULL CONDITIONALS
41 Prior Formulation and Choice of Hyperparameters
For the variable selection indicator γ we assume that itselements γj are independent Bernoulli random variables
p(γ ) =pprod
j=1
ϕγj(1 minus ϕ)1minusγj (4)
The number of covariates included in the model pγ thusfollows a binomial distribution and ϕ can be elicited as theproportion of variables expected a priori to discriminate the dif-ferent groups ϕ = ppriorp This prior assumption can be re-laxed by formulating a beta(ab) hyperprior on ϕ which leadsto a beta-binomial prior for pγ with expectation p a
a+b A vagueprior can be elicited by setting a + b = 2 which leads to settinga = 2ppriorp and b = 2 minus a
Natural priors for the number of components G are a trun-cated Poisson
P(G = g) = eminusλλgg1 minus (eminusλ(1 + λ) + suminfin
j=Gmax+1 (eminusλλ j)j)
g = 2 Gmax (5)
or a discrete uniform on [2 Gmax] P(G = g) = 1Gmaxminus1
where Gmax is chosen arbitrarily large For the vector of com-ponent weights w we specify a symmetric Dirichlet priorw|G sim Dirichlet(α α)
Updating the mean and covariance parameters is somewhatintricate When deleting or creating new components the corre-sponding appropriate changes must be made for the componentparameters However it is not clear whether adequate proposalscan be constructed for the reversible jump in the multivariatesetting Even if this difficulty can be overcome as γ gets up-dated the dimensions of microk(γ ) k(γ ) η(γ c) and (γ c) changerequiring another sampler that moves between different dimen-sional spaces The algorithm is much more efficient if we canintegrate out these parameters This also helps substantially ac-celerate model fitting because the set of parameters that needto be updated would then consist only of (γ wyG) The inte-gration can be facilitated by taking conjugate priors
microk(γ )
∣∣k(γ )G sim N(micro0(γ )h1k(γ )
)
η(γ c)
∣∣(γ c) sim N(micro0(γ c)h0(γ c)
)
(6)k(γ )|G sim IW
(δQ1(γ )
)
(γ c) sim IW(δQ0(γ c)
)
where IW(δQ1(γ )) indicates the inverse-Wishart distributionwith dimension pγ shape parameter δ = n minus pγ + 1 n degreesof freedom and mean Q1(γ )(δ minus 2) (Brown 1993) Weak priorinformation is specified by taking δ small which we set to 3 thesmallest integer such that the expectations of the covariance ma-trices are defined We take Q1 = 1κ1Iptimesp and Q0 = 1κ0IptimespSome care in the choice of κ1 and κ0 is needed These hyperpa-rameters need to be in the range of variability of the data At thesame time we do not want the prior information to overwhelmthe data and drive the inference We experimented with differ-ent values and found that values commensurate with the orderof magnitude of the eigenvalues of cov(X) where X is the fulldata matrix lead to reasonable results We suggest setting κ1between 1 and 10 of the upper decile of the n minus 1 non-zeroeigenvalues and κ0 proportional to their lower decile
For the mean parameters we take the priors to be fairly flatover the region where the data are defined Each element of micro0is set to the corresponding covariate interval midpoint andh0 and h1 are chosen arbitrarily large This prevents the speci-fication of priors that do not overlap with the likelihood and al-lows for mixtures with widely different component means Wefound that values of h0 and h1 between 10 and 1000 performedreasonably well
42 Full Conditionals
As we stated earlier the parameters micro η and are inte-grated out and we need to sample only from the joint posteriorof (Gyγ w) We have
f (Xy|Gwγ )
=int
f (Xy|γ Gwmicroη)f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d (7)
For completeness we provide the full conditionals under theassumption of equal and unequal covariances across clustersHowever in practice unless there is prior knowledge that fa-vors the former we recommend working with heterogeneouscovariances After some algebra (see the Appendix) we get thefollowing
1 Under the assumption of homogeneous covariance
f (Xy|Gwγ )
= πminusnp2K(γ ) middot ∣∣Q1(γ )
∣∣(δ+pγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(n+δ+pγ minus1)2
times H(γ c) middot ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(n+δ+pminuspγ minus1)2 (8)
K(γ ) =[
Gprod
k=1
wnkk (h1nk + 1)minuspγ 2
]
timespγprod
j=1
( 12 (n + δ + pγ minus j))
( 12 (δ + pγ minus j))
Tadesse Sha and Vannucci Bayesian Variable Selection 605
H(γ c) = (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
( 12 (n + δ + p minus pγ minus j))
( 12 (δ + p minus pγ minus j))
Sk(γ ) =sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
S0(γ c) =nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
xk(γ ) is the sample mean of cluster k and x(γ c) corre-sponds to the sample means of the nondiscriminating vari-ables As h1 rarr infin Sk(γ ) reduces to the within-group sumof squares for the kth cluster and
sumGk=1 Sk(γ ) becomes the
total within-group sum of squares2 Assuming heterogeneous covariances across clusters
f (Xy|Gwγ )
= πminusnp2Gprod
k=1
Kk(γ ) middot ∣∣Q1(γ )
∣∣(δ+pγ minus1)2
times ∣∣Q1(γ ) + Sk(γ )
∣∣minus(nk+δ+pγ minus1)2
times H(γ c) middot ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(n+δ+pminuspγ minus1)2 (9)
where Kk(γ ) = wnkk (h1nk + 1)minuspγ 2 prodpγ
j=1( 1
2 (nk+δ+pγ minusj))
( 12 (δ+pγ minusj))
The full conditionals for the model parameters are then givenby
f (y|Gwγ X) prop f (Xy|Gwγ ) (10)
f (γ |GwyX) prop f (Xy|Gwγ )p(γ |G) (11)
and
w|Gγ yX sim Dirichlet(α + n1 α + nG) (12)
5 MODEL FITTING
Posterior samples for the parameters of interest are obtainedusing Metropolis moves and RJMCMC embedded within aGibbs sampler Our MCMC procedure comprises of the follow-ing five steps
1 Update γ from its full conditional in (11)2 Update w from its full conditional in (12)3 Update y from its full conditional in (10)4 Split one mixture component into two or merge two into
one5 Birth or death of an empty component
The birthdeath moves specifically deal with creatingdele-ting empty components and do not involve reallocation ofthe observations We could have extended these to include
nonempty components and removed the splitmerge movesHowever as described by Richardson and Green (1997) in-cluding both steps provides more efficient moves and improvesconvergence
We would like to point out that as we iterate through thesesteps the cluster structure evolves with the choice of variablesThis makes the problem of variable selection in the context ofclustering much more complicated than classification or regres-sion analysis In classification the group structure of the ob-servations is prespecified in the training data and guides theselection Here however the analysis is fully data-based Inaddition in the context of clustering including unnecessaryvariables can have severe implications because it may obscurethe true grouping of the observations
51 Variable Selection Update
In step 1 we update γ using a Metropolis search as suggestedfor model selection by Madigan and York (1995) and applied inregression by Brown et al (1998a) among others and recentlyby Sha et al (2004) in multinomial probit models for classifi-cation In this approach a new candidate γ new is generated byrandomly choosing one of the following two transition moves
1 Adddelete Randomly choose one of the p indices in γ oldand change its value from 0 to 1 or from 1 to 0 to be-come γ new
2 Swap Choose independently and at random a 0 and a 1in γ old and switch their values to get γ new
The new candidate γ new is accepted with probabilitymin1
f (γ new|XywG)
f (γ old|XywG)
52 Update of Weights and Cluster Allocation
In step 2 we use a Gibbs move to sample w from its fullconditional in (12) This can be done by drawing independentgamma random variables with common scale and shape para-meters α + n1 α + nG then scaling them to sum to 1
In step 3 we update the allocation vector y one element ata time via a sub-Gibbs sampling strategy The full conditionalprobability that the ith observation is in the kth cluster is givenby
f(yi = k
∣∣Xy(minusi)γ wG) prop f
(X yi = ky(minusi)
∣∣Gwγ) (13)
where y(minusi) is the cluster assignment for all observations exceptthe ith one
53 Reversible-Jump MCMC
The last two steps splitmerge and birthdeath moves causethe creation or deletion of new components and consequentlyrequire a sampler that jumps between different dimensionalspaces A popular method for defining such a sampler is RJM-CMC (Green 1995 Richardson and Green 1997) whichextends the MetropolisndashHastings algorithm to general statespaces Suppose that a move type m from a countable familyof move types is proposed from ψ to a higher-dimensionalspace ψ prime This will very often be implemented by drawinga vector of continuous random variables u independent of ψ and setting ψ prime to be a deterministic and invertible function ofψ and u The reverse of this move (from ψ prime to ψ) can be ac-complished by using the inverse transformation so that in this
606 Journal of the American Statistical Association June 2005
direction the proposal is deterministic The move is then ac-cepted with probability
min
1
f (ψ prime|X)rm(ψ prime)f (ψ |X)rm(ψ)q(u)
∣∣∣∣partψ prime
part(ψu)
∣∣∣∣
(14)
where rm(ψ) is the probability of choosing move type m whenin state ψ and q(u) is the density function of u The final termin the foregoing ratio is a Jacobian arising from the change ofvariable from (ψu) to ψ prime
531 Split and Merge Moves In step 4 we make a randomchoice between attempting to split a nonempty cluster or com-bine two clusters with probabilities bk = 5 and dk = 1 minus bk
(k = 2 Gmax minus 1 and bGmax = 0)The split move begins by randomly selecting a nonempty
cluster l and dividing it into two components l1 and l2 withweights wl1 = wlu and wl2 = wl(1 minus u) where u is a beta(22)
random variable The parameters ψ = (Gwoldγ yold) are up-dated to ψ prime = (G + 1wnewγ ynew) The acceptance probabil-ity for this move is given by min(1A) where
A = f (G + 1wnewγ ynew|X)
f (Gwoldγ yold|X)times q(ψ |ψ prime)
q(ψ prime|ψ) times f (u)
times∣∣∣∣part(wl1wl2)
part(wlu)
∣∣∣∣
= f (Xynew|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)
f (Xyold|woldγ G) times f (wold|G) times f (G)
times dG+1 times Pchos
bG times Gminus11 times Palloc times f (u)
times wl (15)
f (wnew|G + 1)
f (wold|G)= wαminus1
l1 wαminus1l2
wαminus1l B(αGα)
and
f (G + 1)
f (G)=
1 if G sim uniform [2 Gmax]
λG+1 if G sim truncated Poisson (λ)
Here G1 is the number of nonempty components before thesplit and Palloc is the probability that the particular allocation ismade and is set to unl1(1 minus u)nl2 where nl1 and nl2 are the num-ber of observations assigned to l1 and l2 Note that Richardsonand Green (1997) calculated Palloc using the full conditional ofthe allocation variables P( yi = k|rest) In our case because thecomponent parameters micro and are integrated out the alloca-tion variables are no longer independent and calculating thesefull conditionals is computationally prohibitive The random al-location approach that we took is substantially faster even withthe longer chains that need to be run to compensate for thelow acceptance rate of proposed moves Pchos is the probabilityof selecting two components for the reverse move Obviouslywe want to combine clusters that are closest to one anotherHowever choosing an optimal similarity metric in the multi-variate setting is not straightforward We define closeness interms of the Euclidean distance between cluster means and con-sider three possible ways of splitting a cluster
a Split results into two mutually adjacent components
rArr Pchos = Glowast1
Glowast times 2Glowast
1
b Split results into components that are not mutually ad-jacent that is l1 has l2 as its closest component but there is
another cluster that is closer to l2 than l1 rArr Pchos = Glowast1
Glowast times 1Glowast
1
c One of the resulting components is empty rArr Pchos =Glowast
0Glowast times 1
Glowast0
times 1Glowast
1
Here Glowast is the total number of clusters after the splitGlowast
0 and Glowast1 are the number of empty and nonempty components
after the split Thus splits that result in mutually close com-ponents are assigned a larger probability and those that yieldempty clusters have lower probability
The merge move begins by randomly choosing two com-ponents l1 and l2 to be combined into a single cluster lThe updated parameters are ψ = (Gwoldγ yold) rarr ψ prime =(G minus 1wnewγ ynew) where wl = wl1 + wl2 The acceptanceprobability for this move is given by min(1A) where
A = f (Xynew|γ wnewG minus 1) times f (wnew|G minus 1) times f (G minus 1)
f (Xyold|γ woldG) times f (wold|G) times f (G)
times bGminus1 times Palloc times (G1 minus 1)minus1 times f (u)
dG times Pchos
times (wl1 + wl2)minus1 (16)
where f (wnew|Gminus1)f (wold|G)
= wαminus1l B(α(Gminus1)α)
wαminus1l1 wαminus1
l2and f (Gminus1)
f (G)equals 1 or G
λ
for the Uniform and Poisson priors on G G1 is the numberof nonempty clusters before the combining move f (u) is thebeta(22) density for u = wl1
wl and Palloc and Pchos are defined
similarly to the split move but now Glowast0 and Glowast
1 in Pchos corre-spond to the number of empty and nonempty clusters after themerge
532 Birth and Death Moves We first make a randomchoice between a birth move and a death move using thesame probabilities bk and dk = 1 minus bk as earlier For a birthmove the updated parameters are ψ = (Gwoldγ y) rarr ψ prime =(G + 1wnewγ y) The weight for the proposed new compo-nent is drawn using wlowast sim beta(1G) and the existing weightsare rescaled to wprime
k = wk(1minuswlowast) k = 1 G (where k indexesthe component labels) so that all of the weights sum to 1 LetG0 be the number of empty components before the birth Theacceptance probability for a birth is min(1A) where
A = f (Xy|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)
f (Xy|woldγ G) times f (wold|G) times f (G)
times dG+1 times (G0 + 1)minus1
bG times f (wlowast)times (1 minus wlowast)Gminus1 (17)
f (wnew|G+1)f (wold|G)
= wlowastαminus1(1minuswlowast)G(αminus1)
B(αGα) and (1 minus wlowast)Gminus1 is the
Jacobian of the transformation (woldwlowast) rarr wnewFor the death move an empty component is chosen at
random and deleted Let wl be the weight of the deleted com-ponent The remaining weights are rescaled to sum to 1 (wprime
k =wk(1 minus wl)) and ψ = (Gwoldγ y) rarr ψ prime = (G minus 1wnew
γ y) The acceptance probability for this move is min(1A)where
A = f (Xy|wnewγ G minus 1) times f (wnew|G minus 1) times f (G minus 1)
f (Xy|woldγ G) times f (wold|G) times f (G)
times bGminus1 times f (wl)
dG times Gminus10
times (1 minus wl)minus(Gminus2) (18)
Tadesse Sha and Vannucci Bayesian Variable Selection 607
f (wnew|Gminus1)f (wold|G)
= B(α(Gminus1)α)
wαminus1l (1minuswl)
(Gminus1)(αminus1) G0 is the number of empty
components before the death move and f (wl) is the beta(1G)
density
6 POSTERIOR INFERENCE
We draw inference on the sample allocations conditionalon G Thus we first need to compute an estimate for this para-meter which we take to be the value most frequently visited bythe MCMC sampler In addition we need to address the label-switching problem
61 Label Switching
In finite mixture models an identifiability problem arisesfrom the invariance of the likelihood under permutation of thecomponent labels In the Bayesian paradigm this leads to sym-metric and multimodal posterior distributions with up to Gcopies of each ldquogenuinerdquo mode complicating inference on theparameters In particular it is not possible to form ergodic av-erages over the MCMC samples Traditional approaches to thisproblem impose identifiability constraints on the parametersfor instance w1 lt middot middot middot lt wG These constraints do not alwayssolve the problem however Recently Stephens (2000b) pro-posed a relabeling algorithm that takes a decision-theoretic ap-proach The procedure defines an appropriate loss function andpostprocesses the MCMC output to minimize the posterior ex-pected loss
Let P(ξ) = [pik(ξ)] be the matrix of allocation probabilities
pik(ξ) = Pr( yi = k|Xγ w θ)
= wk f (xi(γ )|θk(γ ))sumj wjf (xi(γ )|θ j(γ ))
(19)
In the context of clustering the loss function can be defined onthe cluster labels y
L0(y ξ) = minusnsum
i=1
logpiyi(ξ ) (20)
Let ξ (1) ξ (M) be the sampled parameters and let ν1 νM
be the permutations applied to them The parameters ξ (t) corre-spond to the component parameters w(t) micro(t) and (t) sampledat iteration t However in our methodology the componentmean and covariance parameters are integrated out and wedo not draw their posterior samples Therefore to calculatethe matrix of allocation probabilities in (19) we estimatethese parameters and set ξ (t) = w(t)
1 w(t)G x(t)
1(γ ) x(t)
G(γ )
S(t)1(γ )
S(t)G(γ )
where x(t)k(γ )
and S(t)k(γ )
are the estimates ofcluster krsquos mean and covariance based on the model visited atiteration t
The relabeling algorithm proceeds by selecting initial valuesfor the νtrsquos which we take to be the identity permutation theniterating the following steps until a fixed point is reached
a Choose y to minimizesumM
t=1 L0yνt(ξ(t))
b For t = 1 M choose νt to minimize L0yνt(ξ(t))
62 Posterior Densities and Posterior Estimates
Once the label switching is taken care of the MCMC samplescan be used to draw posterior inference Of particular interestare the allocation vector y and the variable selection vector γ
We compute the marginal posterior probability that sample iis allocated to cluster k yi = k as
p( yi = k|XG) =int
p(yi = ky(minusi)γ w
∣∣XG)
dy(minusi) dγ dw
propint
p(X yi = ky(minusi)γ w
∣∣G)
dy(minusi) dγ dw
=int
p(X yi = ky(minusi)
∣∣Gγ w)
times p(γ |G)p(w|G)dy(minusi) dγ dw
asympMsum
t=1
p(X y(t)
i = ky(t)(minusi)
∣∣Gγ (t)w(t))
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (21)
where y(t)(minusi) is the vector y(t) at the tth MCMC iteration without
the ith element The posterior allocation of sample i can then beestimated by the mode of its marginal posterior density
yi = arg max1lekleG
p( yi = k|XG) (22)
Similarly the marginal posterior for γj = 1 can be com-puted as
p(γj = 1|XG) =int
p(γj = 1γ (minusj)wy|XG
)dγ (minusj) dy dw
propint
p(Xy γj = 1γ (minusj)w
∣∣G)
dγ (minusj) dy dw
=int
p(Xy
∣∣G γj = 1γ (minusj)w)
times p(γ |G)p(w|G)dγ (minusj) dy dw
asympMsum
t=1
p(Xy(t)
∣∣G γ(t)j = 1γ
(t)(minusj)w(t))
times p(γj = 1γ
(t)(minusj)
∣∣G)p(w(t)
∣∣G) (23)
where γ(t)(minusj) is the vector γ (t) at the tth iteration without the
jth element The best discriminating variables can then beidentified as those with largest marginal posterior p(γj = 1|XG) gt a where a is chosen arbitrarily
γj = Ip(γj=1|XG)gta (24)
An alternative variable selection can be performed by con-sidering the vector γ with largest posterior probability amongall visited vectors
γ lowast = arg max1letleM
p(γ (t)
∣∣XG w y)
(25)
where y is the vector of sample allocations estimated via (22)and w is given by w = 1
M
sumMt=1 w(t) The estimate given in (25)
considers the joint density of γ rather than the marginal distri-butions of the individual elements as (24) In the same spirit an
608 Journal of the American Statistical Association June 2005
estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability
ylowast = arg max1letleM
p(y(t)
∣∣XG w γ)
(26)
where γ is the estimate in (24)
63 Class Prediction
The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by
p( yf = k|Xf XG)
=int
p( yf = kywγ |Xf XG)dy dγ dw
propint
p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw
asympMsum
t=1
p(Xf X yf = ky(t)
∣∣γ (t)w(t)G)
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (27)
and the observations are allocated according to yf
yf = arg max1lekleG
p( yf = k|Xf XG) (28)
7 PERFORMANCE OF OUR METHODOLOGY
In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data
using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study
71 Iris Data
This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array
We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference
Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped
(a) (b)
Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance
Tadesse Sha and Vannucci Bayesian Variable Selection 609
Table 1 Iris Data Posterior Distribution of G
Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]
k Different k Equal Different Σk Equal Σ Different Σk Equal Σ
3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0
together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823
We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters
We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means
Table 2 Iris Data Sample Allocation y
Heterogeneouscovariance
Homogeneouscovariance
I II III I II III IV
Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12
Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4
72 Simulated Data
We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that
xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ
22 )
+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ
24 )
i = 1 15 j = 1 20
where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2
1 = 15 σ 22 = 1 σ 2
3 = 5 and σ 24 = 2 We drew an
additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables
We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1
Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10
Posterior distribution of G
k 4 5 6 7 8 9 10
p(G = k |X) 0747 1874 3326 2685 1120 0233 0015
Sample allocations yI II III IV V VI
Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1
610 Journal of the American Statistical Association June 2005
Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6
h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in
Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all
datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables
We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot
(a) (b)
Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output
Tadesse Sha and Vannucci Bayesian Variable Selection 611
(a) (b)
Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ
Table 4 Simulated Data With p = 1000Posterior Distribution of G
k p(G = k|X )
2 01113 14474 34375 33756 13267 02168 00409 0015
10 000811 000512 000713 001214 0002
Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4
73 Application to Microarray Data
We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis
Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)
612 Journal of the American Statistical Association June 2005
(a) (b)
(c) (d)
Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster
Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range
As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G
with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model
To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in
We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in
Tadesse Sha and Vannucci Bayesian Variable Selection 613
(a) (b)
Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ
Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed
We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues
Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains
We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both
8 DISCUSSION
We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors
Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive
Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
604 Journal of the American Statistical Association June 2005
timesGprod
k=1
(2π)minuspγ nk2∣∣k(γ )
∣∣minusnk2wnk
k
times exp
minus1
2
sum
xiisinCk
(xi(γ ) minus microk(γ )
)Tminus1
k(γ )
(xi(γ ) minus microk(γ )
)
(3)
where (η(γ c)(γ c)) denote the mean and covariance parame-ters for the nondiscriminating variables (microk(γ )k(γ )) are themean and covariance parameters of cluster k Ck = xi|yi = kwith cardinality nk and pγ = sump
j=1 γj
4 PRIOR SETTING AND FULL CONDITIONALS
41 Prior Formulation and Choice of Hyperparameters
For the variable selection indicator γ we assume that itselements γj are independent Bernoulli random variables
p(γ ) =pprod
j=1
ϕγj(1 minus ϕ)1minusγj (4)
The number of covariates included in the model pγ thusfollows a binomial distribution and ϕ can be elicited as theproportion of variables expected a priori to discriminate the dif-ferent groups ϕ = ppriorp This prior assumption can be re-laxed by formulating a beta(ab) hyperprior on ϕ which leadsto a beta-binomial prior for pγ with expectation p a
a+b A vagueprior can be elicited by setting a + b = 2 which leads to settinga = 2ppriorp and b = 2 minus a
Natural priors for the number of components G are a trun-cated Poisson
P(G = g) = eminusλλgg1 minus (eminusλ(1 + λ) + suminfin
j=Gmax+1 (eminusλλ j)j)
g = 2 Gmax (5)
or a discrete uniform on [2 Gmax] P(G = g) = 1Gmaxminus1
where Gmax is chosen arbitrarily large For the vector of com-ponent weights w we specify a symmetric Dirichlet priorw|G sim Dirichlet(α α)
Updating the mean and covariance parameters is somewhatintricate When deleting or creating new components the corre-sponding appropriate changes must be made for the componentparameters However it is not clear whether adequate proposalscan be constructed for the reversible jump in the multivariatesetting Even if this difficulty can be overcome as γ gets up-dated the dimensions of microk(γ ) k(γ ) η(γ c) and (γ c) changerequiring another sampler that moves between different dimen-sional spaces The algorithm is much more efficient if we canintegrate out these parameters This also helps substantially ac-celerate model fitting because the set of parameters that needto be updated would then consist only of (γ wyG) The inte-gration can be facilitated by taking conjugate priors
microk(γ )
∣∣k(γ )G sim N(micro0(γ )h1k(γ )
)
η(γ c)
∣∣(γ c) sim N(micro0(γ c)h0(γ c)
)
(6)k(γ )|G sim IW
(δQ1(γ )
)
(γ c) sim IW(δQ0(γ c)
)
where IW(δQ1(γ )) indicates the inverse-Wishart distributionwith dimension pγ shape parameter δ = n minus pγ + 1 n degreesof freedom and mean Q1(γ )(δ minus 2) (Brown 1993) Weak priorinformation is specified by taking δ small which we set to 3 thesmallest integer such that the expectations of the covariance ma-trices are defined We take Q1 = 1κ1Iptimesp and Q0 = 1κ0IptimespSome care in the choice of κ1 and κ0 is needed These hyperpa-rameters need to be in the range of variability of the data At thesame time we do not want the prior information to overwhelmthe data and drive the inference We experimented with differ-ent values and found that values commensurate with the orderof magnitude of the eigenvalues of cov(X) where X is the fulldata matrix lead to reasonable results We suggest setting κ1between 1 and 10 of the upper decile of the n minus 1 non-zeroeigenvalues and κ0 proportional to their lower decile
For the mean parameters we take the priors to be fairly flatover the region where the data are defined Each element of micro0is set to the corresponding covariate interval midpoint andh0 and h1 are chosen arbitrarily large This prevents the speci-fication of priors that do not overlap with the likelihood and al-lows for mixtures with widely different component means Wefound that values of h0 and h1 between 10 and 1000 performedreasonably well
42 Full Conditionals
As we stated earlier the parameters micro η and are inte-grated out and we need to sample only from the joint posteriorof (Gyγ w) We have
f (Xy|Gwγ )
=int
f (Xy|γ Gwmicroη)f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d (7)
For completeness we provide the full conditionals under theassumption of equal and unequal covariances across clustersHowever in practice unless there is prior knowledge that fa-vors the former we recommend working with heterogeneouscovariances After some algebra (see the Appendix) we get thefollowing
1 Under the assumption of homogeneous covariance
f (Xy|Gwγ )
= πminusnp2K(γ ) middot ∣∣Q1(γ )
∣∣(δ+pγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(n+δ+pγ minus1)2
times H(γ c) middot ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(n+δ+pminuspγ minus1)2 (8)
K(γ ) =[
Gprod
k=1
wnkk (h1nk + 1)minuspγ 2
]
timespγprod
j=1
( 12 (n + δ + pγ minus j))
( 12 (δ + pγ minus j))
Tadesse Sha and Vannucci Bayesian Variable Selection 605
H(γ c) = (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
( 12 (n + δ + p minus pγ minus j))
( 12 (δ + p minus pγ minus j))
Sk(γ ) =sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
S0(γ c) =nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
xk(γ ) is the sample mean of cluster k and x(γ c) corre-sponds to the sample means of the nondiscriminating vari-ables As h1 rarr infin Sk(γ ) reduces to the within-group sumof squares for the kth cluster and
sumGk=1 Sk(γ ) becomes the
total within-group sum of squares2 Assuming heterogeneous covariances across clusters
f (Xy|Gwγ )
= πminusnp2Gprod
k=1
Kk(γ ) middot ∣∣Q1(γ )
∣∣(δ+pγ minus1)2
times ∣∣Q1(γ ) + Sk(γ )
∣∣minus(nk+δ+pγ minus1)2
times H(γ c) middot ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(n+δ+pminuspγ minus1)2 (9)
where Kk(γ ) = wnkk (h1nk + 1)minuspγ 2 prodpγ
j=1( 1
2 (nk+δ+pγ minusj))
( 12 (δ+pγ minusj))
The full conditionals for the model parameters are then givenby
f (y|Gwγ X) prop f (Xy|Gwγ ) (10)
f (γ |GwyX) prop f (Xy|Gwγ )p(γ |G) (11)
and
w|Gγ yX sim Dirichlet(α + n1 α + nG) (12)
5 MODEL FITTING
Posterior samples for the parameters of interest are obtainedusing Metropolis moves and RJMCMC embedded within aGibbs sampler Our MCMC procedure comprises of the follow-ing five steps
1 Update γ from its full conditional in (11)2 Update w from its full conditional in (12)3 Update y from its full conditional in (10)4 Split one mixture component into two or merge two into
one5 Birth or death of an empty component
The birthdeath moves specifically deal with creatingdele-ting empty components and do not involve reallocation ofthe observations We could have extended these to include
nonempty components and removed the splitmerge movesHowever as described by Richardson and Green (1997) in-cluding both steps provides more efficient moves and improvesconvergence
We would like to point out that as we iterate through thesesteps the cluster structure evolves with the choice of variablesThis makes the problem of variable selection in the context ofclustering much more complicated than classification or regres-sion analysis In classification the group structure of the ob-servations is prespecified in the training data and guides theselection Here however the analysis is fully data-based Inaddition in the context of clustering including unnecessaryvariables can have severe implications because it may obscurethe true grouping of the observations
51 Variable Selection Update
In step 1 we update γ using a Metropolis search as suggestedfor model selection by Madigan and York (1995) and applied inregression by Brown et al (1998a) among others and recentlyby Sha et al (2004) in multinomial probit models for classifi-cation In this approach a new candidate γ new is generated byrandomly choosing one of the following two transition moves
1 Adddelete Randomly choose one of the p indices in γ oldand change its value from 0 to 1 or from 1 to 0 to be-come γ new
2 Swap Choose independently and at random a 0 and a 1in γ old and switch their values to get γ new
The new candidate γ new is accepted with probabilitymin1
f (γ new|XywG)
f (γ old|XywG)
52 Update of Weights and Cluster Allocation
In step 2 we use a Gibbs move to sample w from its fullconditional in (12) This can be done by drawing independentgamma random variables with common scale and shape para-meters α + n1 α + nG then scaling them to sum to 1
In step 3 we update the allocation vector y one element ata time via a sub-Gibbs sampling strategy The full conditionalprobability that the ith observation is in the kth cluster is givenby
f(yi = k
∣∣Xy(minusi)γ wG) prop f
(X yi = ky(minusi)
∣∣Gwγ) (13)
where y(minusi) is the cluster assignment for all observations exceptthe ith one
53 Reversible-Jump MCMC
The last two steps splitmerge and birthdeath moves causethe creation or deletion of new components and consequentlyrequire a sampler that jumps between different dimensionalspaces A popular method for defining such a sampler is RJM-CMC (Green 1995 Richardson and Green 1997) whichextends the MetropolisndashHastings algorithm to general statespaces Suppose that a move type m from a countable familyof move types is proposed from ψ to a higher-dimensionalspace ψ prime This will very often be implemented by drawinga vector of continuous random variables u independent of ψ and setting ψ prime to be a deterministic and invertible function ofψ and u The reverse of this move (from ψ prime to ψ) can be ac-complished by using the inverse transformation so that in this
606 Journal of the American Statistical Association June 2005
direction the proposal is deterministic The move is then ac-cepted with probability
min
1
f (ψ prime|X)rm(ψ prime)f (ψ |X)rm(ψ)q(u)
∣∣∣∣partψ prime
part(ψu)
∣∣∣∣
(14)
where rm(ψ) is the probability of choosing move type m whenin state ψ and q(u) is the density function of u The final termin the foregoing ratio is a Jacobian arising from the change ofvariable from (ψu) to ψ prime
531 Split and Merge Moves In step 4 we make a randomchoice between attempting to split a nonempty cluster or com-bine two clusters with probabilities bk = 5 and dk = 1 minus bk
(k = 2 Gmax minus 1 and bGmax = 0)The split move begins by randomly selecting a nonempty
cluster l and dividing it into two components l1 and l2 withweights wl1 = wlu and wl2 = wl(1 minus u) where u is a beta(22)
random variable The parameters ψ = (Gwoldγ yold) are up-dated to ψ prime = (G + 1wnewγ ynew) The acceptance probabil-ity for this move is given by min(1A) where
A = f (G + 1wnewγ ynew|X)
f (Gwoldγ yold|X)times q(ψ |ψ prime)
q(ψ prime|ψ) times f (u)
times∣∣∣∣part(wl1wl2)
part(wlu)
∣∣∣∣
= f (Xynew|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)
f (Xyold|woldγ G) times f (wold|G) times f (G)
times dG+1 times Pchos
bG times Gminus11 times Palloc times f (u)
times wl (15)
f (wnew|G + 1)
f (wold|G)= wαminus1
l1 wαminus1l2
wαminus1l B(αGα)
and
f (G + 1)
f (G)=
1 if G sim uniform [2 Gmax]
λG+1 if G sim truncated Poisson (λ)
Here G1 is the number of nonempty components before thesplit and Palloc is the probability that the particular allocation ismade and is set to unl1(1 minus u)nl2 where nl1 and nl2 are the num-ber of observations assigned to l1 and l2 Note that Richardsonand Green (1997) calculated Palloc using the full conditional ofthe allocation variables P( yi = k|rest) In our case because thecomponent parameters micro and are integrated out the alloca-tion variables are no longer independent and calculating thesefull conditionals is computationally prohibitive The random al-location approach that we took is substantially faster even withthe longer chains that need to be run to compensate for thelow acceptance rate of proposed moves Pchos is the probabilityof selecting two components for the reverse move Obviouslywe want to combine clusters that are closest to one anotherHowever choosing an optimal similarity metric in the multi-variate setting is not straightforward We define closeness interms of the Euclidean distance between cluster means and con-sider three possible ways of splitting a cluster
a Split results into two mutually adjacent components
rArr Pchos = Glowast1
Glowast times 2Glowast
1
b Split results into components that are not mutually ad-jacent that is l1 has l2 as its closest component but there is
another cluster that is closer to l2 than l1 rArr Pchos = Glowast1
Glowast times 1Glowast
1
c One of the resulting components is empty rArr Pchos =Glowast
0Glowast times 1
Glowast0
times 1Glowast
1
Here Glowast is the total number of clusters after the splitGlowast
0 and Glowast1 are the number of empty and nonempty components
after the split Thus splits that result in mutually close com-ponents are assigned a larger probability and those that yieldempty clusters have lower probability
The merge move begins by randomly choosing two com-ponents l1 and l2 to be combined into a single cluster lThe updated parameters are ψ = (Gwoldγ yold) rarr ψ prime =(G minus 1wnewγ ynew) where wl = wl1 + wl2 The acceptanceprobability for this move is given by min(1A) where
A = f (Xynew|γ wnewG minus 1) times f (wnew|G minus 1) times f (G minus 1)
f (Xyold|γ woldG) times f (wold|G) times f (G)
times bGminus1 times Palloc times (G1 minus 1)minus1 times f (u)
dG times Pchos
times (wl1 + wl2)minus1 (16)
where f (wnew|Gminus1)f (wold|G)
= wαminus1l B(α(Gminus1)α)
wαminus1l1 wαminus1
l2and f (Gminus1)
f (G)equals 1 or G
λ
for the Uniform and Poisson priors on G G1 is the numberof nonempty clusters before the combining move f (u) is thebeta(22) density for u = wl1
wl and Palloc and Pchos are defined
similarly to the split move but now Glowast0 and Glowast
1 in Pchos corre-spond to the number of empty and nonempty clusters after themerge
532 Birth and Death Moves We first make a randomchoice between a birth move and a death move using thesame probabilities bk and dk = 1 minus bk as earlier For a birthmove the updated parameters are ψ = (Gwoldγ y) rarr ψ prime =(G + 1wnewγ y) The weight for the proposed new compo-nent is drawn using wlowast sim beta(1G) and the existing weightsare rescaled to wprime
k = wk(1minuswlowast) k = 1 G (where k indexesthe component labels) so that all of the weights sum to 1 LetG0 be the number of empty components before the birth Theacceptance probability for a birth is min(1A) where
A = f (Xy|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)
f (Xy|woldγ G) times f (wold|G) times f (G)
times dG+1 times (G0 + 1)minus1
bG times f (wlowast)times (1 minus wlowast)Gminus1 (17)
f (wnew|G+1)f (wold|G)
= wlowastαminus1(1minuswlowast)G(αminus1)
B(αGα) and (1 minus wlowast)Gminus1 is the
Jacobian of the transformation (woldwlowast) rarr wnewFor the death move an empty component is chosen at
random and deleted Let wl be the weight of the deleted com-ponent The remaining weights are rescaled to sum to 1 (wprime
k =wk(1 minus wl)) and ψ = (Gwoldγ y) rarr ψ prime = (G minus 1wnew
γ y) The acceptance probability for this move is min(1A)where
A = f (Xy|wnewγ G minus 1) times f (wnew|G minus 1) times f (G minus 1)
f (Xy|woldγ G) times f (wold|G) times f (G)
times bGminus1 times f (wl)
dG times Gminus10
times (1 minus wl)minus(Gminus2) (18)
Tadesse Sha and Vannucci Bayesian Variable Selection 607
f (wnew|Gminus1)f (wold|G)
= B(α(Gminus1)α)
wαminus1l (1minuswl)
(Gminus1)(αminus1) G0 is the number of empty
components before the death move and f (wl) is the beta(1G)
density
6 POSTERIOR INFERENCE
We draw inference on the sample allocations conditionalon G Thus we first need to compute an estimate for this para-meter which we take to be the value most frequently visited bythe MCMC sampler In addition we need to address the label-switching problem
61 Label Switching
In finite mixture models an identifiability problem arisesfrom the invariance of the likelihood under permutation of thecomponent labels In the Bayesian paradigm this leads to sym-metric and multimodal posterior distributions with up to Gcopies of each ldquogenuinerdquo mode complicating inference on theparameters In particular it is not possible to form ergodic av-erages over the MCMC samples Traditional approaches to thisproblem impose identifiability constraints on the parametersfor instance w1 lt middot middot middot lt wG These constraints do not alwayssolve the problem however Recently Stephens (2000b) pro-posed a relabeling algorithm that takes a decision-theoretic ap-proach The procedure defines an appropriate loss function andpostprocesses the MCMC output to minimize the posterior ex-pected loss
Let P(ξ) = [pik(ξ)] be the matrix of allocation probabilities
pik(ξ) = Pr( yi = k|Xγ w θ)
= wk f (xi(γ )|θk(γ ))sumj wjf (xi(γ )|θ j(γ ))
(19)
In the context of clustering the loss function can be defined onthe cluster labels y
L0(y ξ) = minusnsum
i=1
logpiyi(ξ ) (20)
Let ξ (1) ξ (M) be the sampled parameters and let ν1 νM
be the permutations applied to them The parameters ξ (t) corre-spond to the component parameters w(t) micro(t) and (t) sampledat iteration t However in our methodology the componentmean and covariance parameters are integrated out and wedo not draw their posterior samples Therefore to calculatethe matrix of allocation probabilities in (19) we estimatethese parameters and set ξ (t) = w(t)
1 w(t)G x(t)
1(γ ) x(t)
G(γ )
S(t)1(γ )
S(t)G(γ )
where x(t)k(γ )
and S(t)k(γ )
are the estimates ofcluster krsquos mean and covariance based on the model visited atiteration t
The relabeling algorithm proceeds by selecting initial valuesfor the νtrsquos which we take to be the identity permutation theniterating the following steps until a fixed point is reached
a Choose y to minimizesumM
t=1 L0yνt(ξ(t))
b For t = 1 M choose νt to minimize L0yνt(ξ(t))
62 Posterior Densities and Posterior Estimates
Once the label switching is taken care of the MCMC samplescan be used to draw posterior inference Of particular interestare the allocation vector y and the variable selection vector γ
We compute the marginal posterior probability that sample iis allocated to cluster k yi = k as
p( yi = k|XG) =int
p(yi = ky(minusi)γ w
∣∣XG)
dy(minusi) dγ dw
propint
p(X yi = ky(minusi)γ w
∣∣G)
dy(minusi) dγ dw
=int
p(X yi = ky(minusi)
∣∣Gγ w)
times p(γ |G)p(w|G)dy(minusi) dγ dw
asympMsum
t=1
p(X y(t)
i = ky(t)(minusi)
∣∣Gγ (t)w(t))
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (21)
where y(t)(minusi) is the vector y(t) at the tth MCMC iteration without
the ith element The posterior allocation of sample i can then beestimated by the mode of its marginal posterior density
yi = arg max1lekleG
p( yi = k|XG) (22)
Similarly the marginal posterior for γj = 1 can be com-puted as
p(γj = 1|XG) =int
p(γj = 1γ (minusj)wy|XG
)dγ (minusj) dy dw
propint
p(Xy γj = 1γ (minusj)w
∣∣G)
dγ (minusj) dy dw
=int
p(Xy
∣∣G γj = 1γ (minusj)w)
times p(γ |G)p(w|G)dγ (minusj) dy dw
asympMsum
t=1
p(Xy(t)
∣∣G γ(t)j = 1γ
(t)(minusj)w(t))
times p(γj = 1γ
(t)(minusj)
∣∣G)p(w(t)
∣∣G) (23)
where γ(t)(minusj) is the vector γ (t) at the tth iteration without the
jth element The best discriminating variables can then beidentified as those with largest marginal posterior p(γj = 1|XG) gt a where a is chosen arbitrarily
γj = Ip(γj=1|XG)gta (24)
An alternative variable selection can be performed by con-sidering the vector γ with largest posterior probability amongall visited vectors
γ lowast = arg max1letleM
p(γ (t)
∣∣XG w y)
(25)
where y is the vector of sample allocations estimated via (22)and w is given by w = 1
M
sumMt=1 w(t) The estimate given in (25)
considers the joint density of γ rather than the marginal distri-butions of the individual elements as (24) In the same spirit an
608 Journal of the American Statistical Association June 2005
estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability
ylowast = arg max1letleM
p(y(t)
∣∣XG w γ)
(26)
where γ is the estimate in (24)
63 Class Prediction
The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by
p( yf = k|Xf XG)
=int
p( yf = kywγ |Xf XG)dy dγ dw
propint
p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw
asympMsum
t=1
p(Xf X yf = ky(t)
∣∣γ (t)w(t)G)
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (27)
and the observations are allocated according to yf
yf = arg max1lekleG
p( yf = k|Xf XG) (28)
7 PERFORMANCE OF OUR METHODOLOGY
In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data
using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study
71 Iris Data
This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array
We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference
Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped
(a) (b)
Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance
Tadesse Sha and Vannucci Bayesian Variable Selection 609
Table 1 Iris Data Posterior Distribution of G
Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]
k Different k Equal Different Σk Equal Σ Different Σk Equal Σ
3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0
together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823
We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters
We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means
Table 2 Iris Data Sample Allocation y
Heterogeneouscovariance
Homogeneouscovariance
I II III I II III IV
Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12
Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4
72 Simulated Data
We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that
xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ
22 )
+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ
24 )
i = 1 15 j = 1 20
where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2
1 = 15 σ 22 = 1 σ 2
3 = 5 and σ 24 = 2 We drew an
additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables
We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1
Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10
Posterior distribution of G
k 4 5 6 7 8 9 10
p(G = k |X) 0747 1874 3326 2685 1120 0233 0015
Sample allocations yI II III IV V VI
Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1
610 Journal of the American Statistical Association June 2005
Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6
h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in
Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all
datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables
We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot
(a) (b)
Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output
Tadesse Sha and Vannucci Bayesian Variable Selection 611
(a) (b)
Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ
Table 4 Simulated Data With p = 1000Posterior Distribution of G
k p(G = k|X )
2 01113 14474 34375 33756 13267 02168 00409 0015
10 000811 000512 000713 001214 0002
Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4
73 Application to Microarray Data
We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis
Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)
612 Journal of the American Statistical Association June 2005
(a) (b)
(c) (d)
Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster
Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range
As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G
with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model
To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in
We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in
Tadesse Sha and Vannucci Bayesian Variable Selection 613
(a) (b)
Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ
Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed
We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues
Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains
We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both
8 DISCUSSION
We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors
Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive
Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
Tadesse Sha and Vannucci Bayesian Variable Selection 605
H(γ c) = (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
( 12 (n + δ + p minus pγ minus j))
( 12 (δ + p minus pγ minus j))
Sk(γ ) =sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
S0(γ c) =nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
xk(γ ) is the sample mean of cluster k and x(γ c) corre-sponds to the sample means of the nondiscriminating vari-ables As h1 rarr infin Sk(γ ) reduces to the within-group sumof squares for the kth cluster and
sumGk=1 Sk(γ ) becomes the
total within-group sum of squares2 Assuming heterogeneous covariances across clusters
f (Xy|Gwγ )
= πminusnp2Gprod
k=1
Kk(γ ) middot ∣∣Q1(γ )
∣∣(δ+pγ minus1)2
times ∣∣Q1(γ ) + Sk(γ )
∣∣minus(nk+δ+pγ minus1)2
times H(γ c) middot ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(n+δ+pminuspγ minus1)2 (9)
where Kk(γ ) = wnkk (h1nk + 1)minuspγ 2 prodpγ
j=1( 1
2 (nk+δ+pγ minusj))
( 12 (δ+pγ minusj))
The full conditionals for the model parameters are then givenby
f (y|Gwγ X) prop f (Xy|Gwγ ) (10)
f (γ |GwyX) prop f (Xy|Gwγ )p(γ |G) (11)
and
w|Gγ yX sim Dirichlet(α + n1 α + nG) (12)
5 MODEL FITTING
Posterior samples for the parameters of interest are obtainedusing Metropolis moves and RJMCMC embedded within aGibbs sampler Our MCMC procedure comprises of the follow-ing five steps
1 Update γ from its full conditional in (11)2 Update w from its full conditional in (12)3 Update y from its full conditional in (10)4 Split one mixture component into two or merge two into
one5 Birth or death of an empty component
The birthdeath moves specifically deal with creatingdele-ting empty components and do not involve reallocation ofthe observations We could have extended these to include
nonempty components and removed the splitmerge movesHowever as described by Richardson and Green (1997) in-cluding both steps provides more efficient moves and improvesconvergence
We would like to point out that as we iterate through thesesteps the cluster structure evolves with the choice of variablesThis makes the problem of variable selection in the context ofclustering much more complicated than classification or regres-sion analysis In classification the group structure of the ob-servations is prespecified in the training data and guides theselection Here however the analysis is fully data-based Inaddition in the context of clustering including unnecessaryvariables can have severe implications because it may obscurethe true grouping of the observations
51 Variable Selection Update
In step 1 we update γ using a Metropolis search as suggestedfor model selection by Madigan and York (1995) and applied inregression by Brown et al (1998a) among others and recentlyby Sha et al (2004) in multinomial probit models for classifi-cation In this approach a new candidate γ new is generated byrandomly choosing one of the following two transition moves
1 Adddelete Randomly choose one of the p indices in γ oldand change its value from 0 to 1 or from 1 to 0 to be-come γ new
2 Swap Choose independently and at random a 0 and a 1in γ old and switch their values to get γ new
The new candidate γ new is accepted with probabilitymin1
f (γ new|XywG)
f (γ old|XywG)
52 Update of Weights and Cluster Allocation
In step 2 we use a Gibbs move to sample w from its fullconditional in (12) This can be done by drawing independentgamma random variables with common scale and shape para-meters α + n1 α + nG then scaling them to sum to 1
In step 3 we update the allocation vector y one element ata time via a sub-Gibbs sampling strategy The full conditionalprobability that the ith observation is in the kth cluster is givenby
f(yi = k
∣∣Xy(minusi)γ wG) prop f
(X yi = ky(minusi)
∣∣Gwγ) (13)
where y(minusi) is the cluster assignment for all observations exceptthe ith one
53 Reversible-Jump MCMC
The last two steps splitmerge and birthdeath moves causethe creation or deletion of new components and consequentlyrequire a sampler that jumps between different dimensionalspaces A popular method for defining such a sampler is RJM-CMC (Green 1995 Richardson and Green 1997) whichextends the MetropolisndashHastings algorithm to general statespaces Suppose that a move type m from a countable familyof move types is proposed from ψ to a higher-dimensionalspace ψ prime This will very often be implemented by drawinga vector of continuous random variables u independent of ψ and setting ψ prime to be a deterministic and invertible function ofψ and u The reverse of this move (from ψ prime to ψ) can be ac-complished by using the inverse transformation so that in this
606 Journal of the American Statistical Association June 2005
direction the proposal is deterministic The move is then ac-cepted with probability
min
1
f (ψ prime|X)rm(ψ prime)f (ψ |X)rm(ψ)q(u)
∣∣∣∣partψ prime
part(ψu)
∣∣∣∣
(14)
where rm(ψ) is the probability of choosing move type m whenin state ψ and q(u) is the density function of u The final termin the foregoing ratio is a Jacobian arising from the change ofvariable from (ψu) to ψ prime
531 Split and Merge Moves In step 4 we make a randomchoice between attempting to split a nonempty cluster or com-bine two clusters with probabilities bk = 5 and dk = 1 minus bk
(k = 2 Gmax minus 1 and bGmax = 0)The split move begins by randomly selecting a nonempty
cluster l and dividing it into two components l1 and l2 withweights wl1 = wlu and wl2 = wl(1 minus u) where u is a beta(22)
random variable The parameters ψ = (Gwoldγ yold) are up-dated to ψ prime = (G + 1wnewγ ynew) The acceptance probabil-ity for this move is given by min(1A) where
A = f (G + 1wnewγ ynew|X)
f (Gwoldγ yold|X)times q(ψ |ψ prime)
q(ψ prime|ψ) times f (u)
times∣∣∣∣part(wl1wl2)
part(wlu)
∣∣∣∣
= f (Xynew|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)
f (Xyold|woldγ G) times f (wold|G) times f (G)
times dG+1 times Pchos
bG times Gminus11 times Palloc times f (u)
times wl (15)
f (wnew|G + 1)
f (wold|G)= wαminus1
l1 wαminus1l2
wαminus1l B(αGα)
and
f (G + 1)
f (G)=
1 if G sim uniform [2 Gmax]
λG+1 if G sim truncated Poisson (λ)
Here G1 is the number of nonempty components before thesplit and Palloc is the probability that the particular allocation ismade and is set to unl1(1 minus u)nl2 where nl1 and nl2 are the num-ber of observations assigned to l1 and l2 Note that Richardsonand Green (1997) calculated Palloc using the full conditional ofthe allocation variables P( yi = k|rest) In our case because thecomponent parameters micro and are integrated out the alloca-tion variables are no longer independent and calculating thesefull conditionals is computationally prohibitive The random al-location approach that we took is substantially faster even withthe longer chains that need to be run to compensate for thelow acceptance rate of proposed moves Pchos is the probabilityof selecting two components for the reverse move Obviouslywe want to combine clusters that are closest to one anotherHowever choosing an optimal similarity metric in the multi-variate setting is not straightforward We define closeness interms of the Euclidean distance between cluster means and con-sider three possible ways of splitting a cluster
a Split results into two mutually adjacent components
rArr Pchos = Glowast1
Glowast times 2Glowast
1
b Split results into components that are not mutually ad-jacent that is l1 has l2 as its closest component but there is
another cluster that is closer to l2 than l1 rArr Pchos = Glowast1
Glowast times 1Glowast
1
c One of the resulting components is empty rArr Pchos =Glowast
0Glowast times 1
Glowast0
times 1Glowast
1
Here Glowast is the total number of clusters after the splitGlowast
0 and Glowast1 are the number of empty and nonempty components
after the split Thus splits that result in mutually close com-ponents are assigned a larger probability and those that yieldempty clusters have lower probability
The merge move begins by randomly choosing two com-ponents l1 and l2 to be combined into a single cluster lThe updated parameters are ψ = (Gwoldγ yold) rarr ψ prime =(G minus 1wnewγ ynew) where wl = wl1 + wl2 The acceptanceprobability for this move is given by min(1A) where
A = f (Xynew|γ wnewG minus 1) times f (wnew|G minus 1) times f (G minus 1)
f (Xyold|γ woldG) times f (wold|G) times f (G)
times bGminus1 times Palloc times (G1 minus 1)minus1 times f (u)
dG times Pchos
times (wl1 + wl2)minus1 (16)
where f (wnew|Gminus1)f (wold|G)
= wαminus1l B(α(Gminus1)α)
wαminus1l1 wαminus1
l2and f (Gminus1)
f (G)equals 1 or G
λ
for the Uniform and Poisson priors on G G1 is the numberof nonempty clusters before the combining move f (u) is thebeta(22) density for u = wl1
wl and Palloc and Pchos are defined
similarly to the split move but now Glowast0 and Glowast
1 in Pchos corre-spond to the number of empty and nonempty clusters after themerge
532 Birth and Death Moves We first make a randomchoice between a birth move and a death move using thesame probabilities bk and dk = 1 minus bk as earlier For a birthmove the updated parameters are ψ = (Gwoldγ y) rarr ψ prime =(G + 1wnewγ y) The weight for the proposed new compo-nent is drawn using wlowast sim beta(1G) and the existing weightsare rescaled to wprime
k = wk(1minuswlowast) k = 1 G (where k indexesthe component labels) so that all of the weights sum to 1 LetG0 be the number of empty components before the birth Theacceptance probability for a birth is min(1A) where
A = f (Xy|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)
f (Xy|woldγ G) times f (wold|G) times f (G)
times dG+1 times (G0 + 1)minus1
bG times f (wlowast)times (1 minus wlowast)Gminus1 (17)
f (wnew|G+1)f (wold|G)
= wlowastαminus1(1minuswlowast)G(αminus1)
B(αGα) and (1 minus wlowast)Gminus1 is the
Jacobian of the transformation (woldwlowast) rarr wnewFor the death move an empty component is chosen at
random and deleted Let wl be the weight of the deleted com-ponent The remaining weights are rescaled to sum to 1 (wprime
k =wk(1 minus wl)) and ψ = (Gwoldγ y) rarr ψ prime = (G minus 1wnew
γ y) The acceptance probability for this move is min(1A)where
A = f (Xy|wnewγ G minus 1) times f (wnew|G minus 1) times f (G minus 1)
f (Xy|woldγ G) times f (wold|G) times f (G)
times bGminus1 times f (wl)
dG times Gminus10
times (1 minus wl)minus(Gminus2) (18)
Tadesse Sha and Vannucci Bayesian Variable Selection 607
f (wnew|Gminus1)f (wold|G)
= B(α(Gminus1)α)
wαminus1l (1minuswl)
(Gminus1)(αminus1) G0 is the number of empty
components before the death move and f (wl) is the beta(1G)
density
6 POSTERIOR INFERENCE
We draw inference on the sample allocations conditionalon G Thus we first need to compute an estimate for this para-meter which we take to be the value most frequently visited bythe MCMC sampler In addition we need to address the label-switching problem
61 Label Switching
In finite mixture models an identifiability problem arisesfrom the invariance of the likelihood under permutation of thecomponent labels In the Bayesian paradigm this leads to sym-metric and multimodal posterior distributions with up to Gcopies of each ldquogenuinerdquo mode complicating inference on theparameters In particular it is not possible to form ergodic av-erages over the MCMC samples Traditional approaches to thisproblem impose identifiability constraints on the parametersfor instance w1 lt middot middot middot lt wG These constraints do not alwayssolve the problem however Recently Stephens (2000b) pro-posed a relabeling algorithm that takes a decision-theoretic ap-proach The procedure defines an appropriate loss function andpostprocesses the MCMC output to minimize the posterior ex-pected loss
Let P(ξ) = [pik(ξ)] be the matrix of allocation probabilities
pik(ξ) = Pr( yi = k|Xγ w θ)
= wk f (xi(γ )|θk(γ ))sumj wjf (xi(γ )|θ j(γ ))
(19)
In the context of clustering the loss function can be defined onthe cluster labels y
L0(y ξ) = minusnsum
i=1
logpiyi(ξ ) (20)
Let ξ (1) ξ (M) be the sampled parameters and let ν1 νM
be the permutations applied to them The parameters ξ (t) corre-spond to the component parameters w(t) micro(t) and (t) sampledat iteration t However in our methodology the componentmean and covariance parameters are integrated out and wedo not draw their posterior samples Therefore to calculatethe matrix of allocation probabilities in (19) we estimatethese parameters and set ξ (t) = w(t)
1 w(t)G x(t)
1(γ ) x(t)
G(γ )
S(t)1(γ )
S(t)G(γ )
where x(t)k(γ )
and S(t)k(γ )
are the estimates ofcluster krsquos mean and covariance based on the model visited atiteration t
The relabeling algorithm proceeds by selecting initial valuesfor the νtrsquos which we take to be the identity permutation theniterating the following steps until a fixed point is reached
a Choose y to minimizesumM
t=1 L0yνt(ξ(t))
b For t = 1 M choose νt to minimize L0yνt(ξ(t))
62 Posterior Densities and Posterior Estimates
Once the label switching is taken care of the MCMC samplescan be used to draw posterior inference Of particular interestare the allocation vector y and the variable selection vector γ
We compute the marginal posterior probability that sample iis allocated to cluster k yi = k as
p( yi = k|XG) =int
p(yi = ky(minusi)γ w
∣∣XG)
dy(minusi) dγ dw
propint
p(X yi = ky(minusi)γ w
∣∣G)
dy(minusi) dγ dw
=int
p(X yi = ky(minusi)
∣∣Gγ w)
times p(γ |G)p(w|G)dy(minusi) dγ dw
asympMsum
t=1
p(X y(t)
i = ky(t)(minusi)
∣∣Gγ (t)w(t))
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (21)
where y(t)(minusi) is the vector y(t) at the tth MCMC iteration without
the ith element The posterior allocation of sample i can then beestimated by the mode of its marginal posterior density
yi = arg max1lekleG
p( yi = k|XG) (22)
Similarly the marginal posterior for γj = 1 can be com-puted as
p(γj = 1|XG) =int
p(γj = 1γ (minusj)wy|XG
)dγ (minusj) dy dw
propint
p(Xy γj = 1γ (minusj)w
∣∣G)
dγ (minusj) dy dw
=int
p(Xy
∣∣G γj = 1γ (minusj)w)
times p(γ |G)p(w|G)dγ (minusj) dy dw
asympMsum
t=1
p(Xy(t)
∣∣G γ(t)j = 1γ
(t)(minusj)w(t))
times p(γj = 1γ
(t)(minusj)
∣∣G)p(w(t)
∣∣G) (23)
where γ(t)(minusj) is the vector γ (t) at the tth iteration without the
jth element The best discriminating variables can then beidentified as those with largest marginal posterior p(γj = 1|XG) gt a where a is chosen arbitrarily
γj = Ip(γj=1|XG)gta (24)
An alternative variable selection can be performed by con-sidering the vector γ with largest posterior probability amongall visited vectors
γ lowast = arg max1letleM
p(γ (t)
∣∣XG w y)
(25)
where y is the vector of sample allocations estimated via (22)and w is given by w = 1
M
sumMt=1 w(t) The estimate given in (25)
considers the joint density of γ rather than the marginal distri-butions of the individual elements as (24) In the same spirit an
608 Journal of the American Statistical Association June 2005
estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability
ylowast = arg max1letleM
p(y(t)
∣∣XG w γ)
(26)
where γ is the estimate in (24)
63 Class Prediction
The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by
p( yf = k|Xf XG)
=int
p( yf = kywγ |Xf XG)dy dγ dw
propint
p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw
asympMsum
t=1
p(Xf X yf = ky(t)
∣∣γ (t)w(t)G)
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (27)
and the observations are allocated according to yf
yf = arg max1lekleG
p( yf = k|Xf XG) (28)
7 PERFORMANCE OF OUR METHODOLOGY
In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data
using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study
71 Iris Data
This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array
We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference
Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped
(a) (b)
Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance
Tadesse Sha and Vannucci Bayesian Variable Selection 609
Table 1 Iris Data Posterior Distribution of G
Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]
k Different k Equal Different Σk Equal Σ Different Σk Equal Σ
3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0
together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823
We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters
We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means
Table 2 Iris Data Sample Allocation y
Heterogeneouscovariance
Homogeneouscovariance
I II III I II III IV
Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12
Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4
72 Simulated Data
We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that
xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ
22 )
+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ
24 )
i = 1 15 j = 1 20
where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2
1 = 15 σ 22 = 1 σ 2
3 = 5 and σ 24 = 2 We drew an
additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables
We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1
Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10
Posterior distribution of G
k 4 5 6 7 8 9 10
p(G = k |X) 0747 1874 3326 2685 1120 0233 0015
Sample allocations yI II III IV V VI
Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1
610 Journal of the American Statistical Association June 2005
Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6
h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in
Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all
datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables
We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot
(a) (b)
Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output
Tadesse Sha and Vannucci Bayesian Variable Selection 611
(a) (b)
Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ
Table 4 Simulated Data With p = 1000Posterior Distribution of G
k p(G = k|X )
2 01113 14474 34375 33756 13267 02168 00409 0015
10 000811 000512 000713 001214 0002
Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4
73 Application to Microarray Data
We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis
Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)
612 Journal of the American Statistical Association June 2005
(a) (b)
(c) (d)
Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster
Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range
As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G
with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model
To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in
We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in
Tadesse Sha and Vannucci Bayesian Variable Selection 613
(a) (b)
Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ
Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed
We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues
Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains
We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both
8 DISCUSSION
We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors
Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive
Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
606 Journal of the American Statistical Association June 2005
direction the proposal is deterministic The move is then ac-cepted with probability
min
1
f (ψ prime|X)rm(ψ prime)f (ψ |X)rm(ψ)q(u)
∣∣∣∣partψ prime
part(ψu)
∣∣∣∣
(14)
where rm(ψ) is the probability of choosing move type m whenin state ψ and q(u) is the density function of u The final termin the foregoing ratio is a Jacobian arising from the change ofvariable from (ψu) to ψ prime
531 Split and Merge Moves In step 4 we make a randomchoice between attempting to split a nonempty cluster or com-bine two clusters with probabilities bk = 5 and dk = 1 minus bk
(k = 2 Gmax minus 1 and bGmax = 0)The split move begins by randomly selecting a nonempty
cluster l and dividing it into two components l1 and l2 withweights wl1 = wlu and wl2 = wl(1 minus u) where u is a beta(22)
random variable The parameters ψ = (Gwoldγ yold) are up-dated to ψ prime = (G + 1wnewγ ynew) The acceptance probabil-ity for this move is given by min(1A) where
A = f (G + 1wnewγ ynew|X)
f (Gwoldγ yold|X)times q(ψ |ψ prime)
q(ψ prime|ψ) times f (u)
times∣∣∣∣part(wl1wl2)
part(wlu)
∣∣∣∣
= f (Xynew|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)
f (Xyold|woldγ G) times f (wold|G) times f (G)
times dG+1 times Pchos
bG times Gminus11 times Palloc times f (u)
times wl (15)
f (wnew|G + 1)
f (wold|G)= wαminus1
l1 wαminus1l2
wαminus1l B(αGα)
and
f (G + 1)
f (G)=
1 if G sim uniform [2 Gmax]
λG+1 if G sim truncated Poisson (λ)
Here G1 is the number of nonempty components before thesplit and Palloc is the probability that the particular allocation ismade and is set to unl1(1 minus u)nl2 where nl1 and nl2 are the num-ber of observations assigned to l1 and l2 Note that Richardsonand Green (1997) calculated Palloc using the full conditional ofthe allocation variables P( yi = k|rest) In our case because thecomponent parameters micro and are integrated out the alloca-tion variables are no longer independent and calculating thesefull conditionals is computationally prohibitive The random al-location approach that we took is substantially faster even withthe longer chains that need to be run to compensate for thelow acceptance rate of proposed moves Pchos is the probabilityof selecting two components for the reverse move Obviouslywe want to combine clusters that are closest to one anotherHowever choosing an optimal similarity metric in the multi-variate setting is not straightforward We define closeness interms of the Euclidean distance between cluster means and con-sider three possible ways of splitting a cluster
a Split results into two mutually adjacent components
rArr Pchos = Glowast1
Glowast times 2Glowast
1
b Split results into components that are not mutually ad-jacent that is l1 has l2 as its closest component but there is
another cluster that is closer to l2 than l1 rArr Pchos = Glowast1
Glowast times 1Glowast
1
c One of the resulting components is empty rArr Pchos =Glowast
0Glowast times 1
Glowast0
times 1Glowast
1
Here Glowast is the total number of clusters after the splitGlowast
0 and Glowast1 are the number of empty and nonempty components
after the split Thus splits that result in mutually close com-ponents are assigned a larger probability and those that yieldempty clusters have lower probability
The merge move begins by randomly choosing two com-ponents l1 and l2 to be combined into a single cluster lThe updated parameters are ψ = (Gwoldγ yold) rarr ψ prime =(G minus 1wnewγ ynew) where wl = wl1 + wl2 The acceptanceprobability for this move is given by min(1A) where
A = f (Xynew|γ wnewG minus 1) times f (wnew|G minus 1) times f (G minus 1)
f (Xyold|γ woldG) times f (wold|G) times f (G)
times bGminus1 times Palloc times (G1 minus 1)minus1 times f (u)
dG times Pchos
times (wl1 + wl2)minus1 (16)
where f (wnew|Gminus1)f (wold|G)
= wαminus1l B(α(Gminus1)α)
wαminus1l1 wαminus1
l2and f (Gminus1)
f (G)equals 1 or G
λ
for the Uniform and Poisson priors on G G1 is the numberof nonempty clusters before the combining move f (u) is thebeta(22) density for u = wl1
wl and Palloc and Pchos are defined
similarly to the split move but now Glowast0 and Glowast
1 in Pchos corre-spond to the number of empty and nonempty clusters after themerge
532 Birth and Death Moves We first make a randomchoice between a birth move and a death move using thesame probabilities bk and dk = 1 minus bk as earlier For a birthmove the updated parameters are ψ = (Gwoldγ y) rarr ψ prime =(G + 1wnewγ y) The weight for the proposed new compo-nent is drawn using wlowast sim beta(1G) and the existing weightsare rescaled to wprime
k = wk(1minuswlowast) k = 1 G (where k indexesthe component labels) so that all of the weights sum to 1 LetG0 be the number of empty components before the birth Theacceptance probability for a birth is min(1A) where
A = f (Xy|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)
f (Xy|woldγ G) times f (wold|G) times f (G)
times dG+1 times (G0 + 1)minus1
bG times f (wlowast)times (1 minus wlowast)Gminus1 (17)
f (wnew|G+1)f (wold|G)
= wlowastαminus1(1minuswlowast)G(αminus1)
B(αGα) and (1 minus wlowast)Gminus1 is the
Jacobian of the transformation (woldwlowast) rarr wnewFor the death move an empty component is chosen at
random and deleted Let wl be the weight of the deleted com-ponent The remaining weights are rescaled to sum to 1 (wprime
k =wk(1 minus wl)) and ψ = (Gwoldγ y) rarr ψ prime = (G minus 1wnew
γ y) The acceptance probability for this move is min(1A)where
A = f (Xy|wnewγ G minus 1) times f (wnew|G minus 1) times f (G minus 1)
f (Xy|woldγ G) times f (wold|G) times f (G)
times bGminus1 times f (wl)
dG times Gminus10
times (1 minus wl)minus(Gminus2) (18)
Tadesse Sha and Vannucci Bayesian Variable Selection 607
f (wnew|Gminus1)f (wold|G)
= B(α(Gminus1)α)
wαminus1l (1minuswl)
(Gminus1)(αminus1) G0 is the number of empty
components before the death move and f (wl) is the beta(1G)
density
6 POSTERIOR INFERENCE
We draw inference on the sample allocations conditionalon G Thus we first need to compute an estimate for this para-meter which we take to be the value most frequently visited bythe MCMC sampler In addition we need to address the label-switching problem
61 Label Switching
In finite mixture models an identifiability problem arisesfrom the invariance of the likelihood under permutation of thecomponent labels In the Bayesian paradigm this leads to sym-metric and multimodal posterior distributions with up to Gcopies of each ldquogenuinerdquo mode complicating inference on theparameters In particular it is not possible to form ergodic av-erages over the MCMC samples Traditional approaches to thisproblem impose identifiability constraints on the parametersfor instance w1 lt middot middot middot lt wG These constraints do not alwayssolve the problem however Recently Stephens (2000b) pro-posed a relabeling algorithm that takes a decision-theoretic ap-proach The procedure defines an appropriate loss function andpostprocesses the MCMC output to minimize the posterior ex-pected loss
Let P(ξ) = [pik(ξ)] be the matrix of allocation probabilities
pik(ξ) = Pr( yi = k|Xγ w θ)
= wk f (xi(γ )|θk(γ ))sumj wjf (xi(γ )|θ j(γ ))
(19)
In the context of clustering the loss function can be defined onthe cluster labels y
L0(y ξ) = minusnsum
i=1
logpiyi(ξ ) (20)
Let ξ (1) ξ (M) be the sampled parameters and let ν1 νM
be the permutations applied to them The parameters ξ (t) corre-spond to the component parameters w(t) micro(t) and (t) sampledat iteration t However in our methodology the componentmean and covariance parameters are integrated out and wedo not draw their posterior samples Therefore to calculatethe matrix of allocation probabilities in (19) we estimatethese parameters and set ξ (t) = w(t)
1 w(t)G x(t)
1(γ ) x(t)
G(γ )
S(t)1(γ )
S(t)G(γ )
where x(t)k(γ )
and S(t)k(γ )
are the estimates ofcluster krsquos mean and covariance based on the model visited atiteration t
The relabeling algorithm proceeds by selecting initial valuesfor the νtrsquos which we take to be the identity permutation theniterating the following steps until a fixed point is reached
a Choose y to minimizesumM
t=1 L0yνt(ξ(t))
b For t = 1 M choose νt to minimize L0yνt(ξ(t))
62 Posterior Densities and Posterior Estimates
Once the label switching is taken care of the MCMC samplescan be used to draw posterior inference Of particular interestare the allocation vector y and the variable selection vector γ
We compute the marginal posterior probability that sample iis allocated to cluster k yi = k as
p( yi = k|XG) =int
p(yi = ky(minusi)γ w
∣∣XG)
dy(minusi) dγ dw
propint
p(X yi = ky(minusi)γ w
∣∣G)
dy(minusi) dγ dw
=int
p(X yi = ky(minusi)
∣∣Gγ w)
times p(γ |G)p(w|G)dy(minusi) dγ dw
asympMsum
t=1
p(X y(t)
i = ky(t)(minusi)
∣∣Gγ (t)w(t))
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (21)
where y(t)(minusi) is the vector y(t) at the tth MCMC iteration without
the ith element The posterior allocation of sample i can then beestimated by the mode of its marginal posterior density
yi = arg max1lekleG
p( yi = k|XG) (22)
Similarly the marginal posterior for γj = 1 can be com-puted as
p(γj = 1|XG) =int
p(γj = 1γ (minusj)wy|XG
)dγ (minusj) dy dw
propint
p(Xy γj = 1γ (minusj)w
∣∣G)
dγ (minusj) dy dw
=int
p(Xy
∣∣G γj = 1γ (minusj)w)
times p(γ |G)p(w|G)dγ (minusj) dy dw
asympMsum
t=1
p(Xy(t)
∣∣G γ(t)j = 1γ
(t)(minusj)w(t))
times p(γj = 1γ
(t)(minusj)
∣∣G)p(w(t)
∣∣G) (23)
where γ(t)(minusj) is the vector γ (t) at the tth iteration without the
jth element The best discriminating variables can then beidentified as those with largest marginal posterior p(γj = 1|XG) gt a where a is chosen arbitrarily
γj = Ip(γj=1|XG)gta (24)
An alternative variable selection can be performed by con-sidering the vector γ with largest posterior probability amongall visited vectors
γ lowast = arg max1letleM
p(γ (t)
∣∣XG w y)
(25)
where y is the vector of sample allocations estimated via (22)and w is given by w = 1
M
sumMt=1 w(t) The estimate given in (25)
considers the joint density of γ rather than the marginal distri-butions of the individual elements as (24) In the same spirit an
608 Journal of the American Statistical Association June 2005
estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability
ylowast = arg max1letleM
p(y(t)
∣∣XG w γ)
(26)
where γ is the estimate in (24)
63 Class Prediction
The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by
p( yf = k|Xf XG)
=int
p( yf = kywγ |Xf XG)dy dγ dw
propint
p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw
asympMsum
t=1
p(Xf X yf = ky(t)
∣∣γ (t)w(t)G)
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (27)
and the observations are allocated according to yf
yf = arg max1lekleG
p( yf = k|Xf XG) (28)
7 PERFORMANCE OF OUR METHODOLOGY
In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data
using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study
71 Iris Data
This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array
We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference
Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped
(a) (b)
Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance
Tadesse Sha and Vannucci Bayesian Variable Selection 609
Table 1 Iris Data Posterior Distribution of G
Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]
k Different k Equal Different Σk Equal Σ Different Σk Equal Σ
3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0
together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823
We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters
We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means
Table 2 Iris Data Sample Allocation y
Heterogeneouscovariance
Homogeneouscovariance
I II III I II III IV
Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12
Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4
72 Simulated Data
We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that
xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ
22 )
+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ
24 )
i = 1 15 j = 1 20
where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2
1 = 15 σ 22 = 1 σ 2
3 = 5 and σ 24 = 2 We drew an
additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables
We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1
Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10
Posterior distribution of G
k 4 5 6 7 8 9 10
p(G = k |X) 0747 1874 3326 2685 1120 0233 0015
Sample allocations yI II III IV V VI
Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1
610 Journal of the American Statistical Association June 2005
Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6
h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in
Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all
datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables
We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot
(a) (b)
Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output
Tadesse Sha and Vannucci Bayesian Variable Selection 611
(a) (b)
Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ
Table 4 Simulated Data With p = 1000Posterior Distribution of G
k p(G = k|X )
2 01113 14474 34375 33756 13267 02168 00409 0015
10 000811 000512 000713 001214 0002
Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4
73 Application to Microarray Data
We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis
Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)
612 Journal of the American Statistical Association June 2005
(a) (b)
(c) (d)
Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster
Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range
As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G
with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model
To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in
We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in
Tadesse Sha and Vannucci Bayesian Variable Selection 613
(a) (b)
Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ
Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed
We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues
Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains
We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both
8 DISCUSSION
We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors
Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive
Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
Tadesse Sha and Vannucci Bayesian Variable Selection 607
f (wnew|Gminus1)f (wold|G)
= B(α(Gminus1)α)
wαminus1l (1minuswl)
(Gminus1)(αminus1) G0 is the number of empty
components before the death move and f (wl) is the beta(1G)
density
6 POSTERIOR INFERENCE
We draw inference on the sample allocations conditionalon G Thus we first need to compute an estimate for this para-meter which we take to be the value most frequently visited bythe MCMC sampler In addition we need to address the label-switching problem
61 Label Switching
In finite mixture models an identifiability problem arisesfrom the invariance of the likelihood under permutation of thecomponent labels In the Bayesian paradigm this leads to sym-metric and multimodal posterior distributions with up to Gcopies of each ldquogenuinerdquo mode complicating inference on theparameters In particular it is not possible to form ergodic av-erages over the MCMC samples Traditional approaches to thisproblem impose identifiability constraints on the parametersfor instance w1 lt middot middot middot lt wG These constraints do not alwayssolve the problem however Recently Stephens (2000b) pro-posed a relabeling algorithm that takes a decision-theoretic ap-proach The procedure defines an appropriate loss function andpostprocesses the MCMC output to minimize the posterior ex-pected loss
Let P(ξ) = [pik(ξ)] be the matrix of allocation probabilities
pik(ξ) = Pr( yi = k|Xγ w θ)
= wk f (xi(γ )|θk(γ ))sumj wjf (xi(γ )|θ j(γ ))
(19)
In the context of clustering the loss function can be defined onthe cluster labels y
L0(y ξ) = minusnsum
i=1
logpiyi(ξ ) (20)
Let ξ (1) ξ (M) be the sampled parameters and let ν1 νM
be the permutations applied to them The parameters ξ (t) corre-spond to the component parameters w(t) micro(t) and (t) sampledat iteration t However in our methodology the componentmean and covariance parameters are integrated out and wedo not draw their posterior samples Therefore to calculatethe matrix of allocation probabilities in (19) we estimatethese parameters and set ξ (t) = w(t)
1 w(t)G x(t)
1(γ ) x(t)
G(γ )
S(t)1(γ )
S(t)G(γ )
where x(t)k(γ )
and S(t)k(γ )
are the estimates ofcluster krsquos mean and covariance based on the model visited atiteration t
The relabeling algorithm proceeds by selecting initial valuesfor the νtrsquos which we take to be the identity permutation theniterating the following steps until a fixed point is reached
a Choose y to minimizesumM
t=1 L0yνt(ξ(t))
b For t = 1 M choose νt to minimize L0yνt(ξ(t))
62 Posterior Densities and Posterior Estimates
Once the label switching is taken care of the MCMC samplescan be used to draw posterior inference Of particular interestare the allocation vector y and the variable selection vector γ
We compute the marginal posterior probability that sample iis allocated to cluster k yi = k as
p( yi = k|XG) =int
p(yi = ky(minusi)γ w
∣∣XG)
dy(minusi) dγ dw
propint
p(X yi = ky(minusi)γ w
∣∣G)
dy(minusi) dγ dw
=int
p(X yi = ky(minusi)
∣∣Gγ w)
times p(γ |G)p(w|G)dy(minusi) dγ dw
asympMsum
t=1
p(X y(t)
i = ky(t)(minusi)
∣∣Gγ (t)w(t))
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (21)
where y(t)(minusi) is the vector y(t) at the tth MCMC iteration without
the ith element The posterior allocation of sample i can then beestimated by the mode of its marginal posterior density
yi = arg max1lekleG
p( yi = k|XG) (22)
Similarly the marginal posterior for γj = 1 can be com-puted as
p(γj = 1|XG) =int
p(γj = 1γ (minusj)wy|XG
)dγ (minusj) dy dw
propint
p(Xy γj = 1γ (minusj)w
∣∣G)
dγ (minusj) dy dw
=int
p(Xy
∣∣G γj = 1γ (minusj)w)
times p(γ |G)p(w|G)dγ (minusj) dy dw
asympMsum
t=1
p(Xy(t)
∣∣G γ(t)j = 1γ
(t)(minusj)w(t))
times p(γj = 1γ
(t)(minusj)
∣∣G)p(w(t)
∣∣G) (23)
where γ(t)(minusj) is the vector γ (t) at the tth iteration without the
jth element The best discriminating variables can then beidentified as those with largest marginal posterior p(γj = 1|XG) gt a where a is chosen arbitrarily
γj = Ip(γj=1|XG)gta (24)
An alternative variable selection can be performed by con-sidering the vector γ with largest posterior probability amongall visited vectors
γ lowast = arg max1letleM
p(γ (t)
∣∣XG w y)
(25)
where y is the vector of sample allocations estimated via (22)and w is given by w = 1
M
sumMt=1 w(t) The estimate given in (25)
considers the joint density of γ rather than the marginal distri-butions of the individual elements as (24) In the same spirit an
608 Journal of the American Statistical Association June 2005
estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability
ylowast = arg max1letleM
p(y(t)
∣∣XG w γ)
(26)
where γ is the estimate in (24)
63 Class Prediction
The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by
p( yf = k|Xf XG)
=int
p( yf = kywγ |Xf XG)dy dγ dw
propint
p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw
asympMsum
t=1
p(Xf X yf = ky(t)
∣∣γ (t)w(t)G)
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (27)
and the observations are allocated according to yf
yf = arg max1lekleG
p( yf = k|Xf XG) (28)
7 PERFORMANCE OF OUR METHODOLOGY
In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data
using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study
71 Iris Data
This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array
We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference
Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped
(a) (b)
Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance
Tadesse Sha and Vannucci Bayesian Variable Selection 609
Table 1 Iris Data Posterior Distribution of G
Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]
k Different k Equal Different Σk Equal Σ Different Σk Equal Σ
3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0
together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823
We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters
We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means
Table 2 Iris Data Sample Allocation y
Heterogeneouscovariance
Homogeneouscovariance
I II III I II III IV
Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12
Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4
72 Simulated Data
We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that
xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ
22 )
+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ
24 )
i = 1 15 j = 1 20
where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2
1 = 15 σ 22 = 1 σ 2
3 = 5 and σ 24 = 2 We drew an
additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables
We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1
Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10
Posterior distribution of G
k 4 5 6 7 8 9 10
p(G = k |X) 0747 1874 3326 2685 1120 0233 0015
Sample allocations yI II III IV V VI
Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1
610 Journal of the American Statistical Association June 2005
Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6
h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in
Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all
datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables
We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot
(a) (b)
Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output
Tadesse Sha and Vannucci Bayesian Variable Selection 611
(a) (b)
Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ
Table 4 Simulated Data With p = 1000Posterior Distribution of G
k p(G = k|X )
2 01113 14474 34375 33756 13267 02168 00409 0015
10 000811 000512 000713 001214 0002
Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4
73 Application to Microarray Data
We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis
Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)
612 Journal of the American Statistical Association June 2005
(a) (b)
(c) (d)
Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster
Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range
As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G
with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model
To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in
We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in
Tadesse Sha and Vannucci Bayesian Variable Selection 613
(a) (b)
Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ
Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed
We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues
Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains
We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both
8 DISCUSSION
We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors
Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive
Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
608 Journal of the American Statistical Association June 2005
estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability
ylowast = arg max1letleM
p(y(t)
∣∣XG w γ)
(26)
where γ is the estimate in (24)
63 Class Prediction
The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by
p( yf = k|Xf XG)
=int
p( yf = kywγ |Xf XG)dy dγ dw
propint
p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw
asympMsum
t=1
p(Xf X yf = ky(t)
∣∣γ (t)w(t)G)
times p(γ (t)
∣∣G)p(w(t)
∣∣G) (27)
and the observations are allocated according to yf
yf = arg max1lekleG
p( yf = k|Xf XG) (28)
7 PERFORMANCE OF OUR METHODOLOGY
In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data
using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study
71 Iris Data
This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array
We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference
Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped
(a) (b)
Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance
Tadesse Sha and Vannucci Bayesian Variable Selection 609
Table 1 Iris Data Posterior Distribution of G
Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]
k Different k Equal Different Σk Equal Σ Different Σk Equal Σ
3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0
together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823
We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters
We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means
Table 2 Iris Data Sample Allocation y
Heterogeneouscovariance
Homogeneouscovariance
I II III I II III IV
Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12
Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4
72 Simulated Data
We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that
xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ
22 )
+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ
24 )
i = 1 15 j = 1 20
where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2
1 = 15 σ 22 = 1 σ 2
3 = 5 and σ 24 = 2 We drew an
additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables
We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1
Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10
Posterior distribution of G
k 4 5 6 7 8 9 10
p(G = k |X) 0747 1874 3326 2685 1120 0233 0015
Sample allocations yI II III IV V VI
Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1
610 Journal of the American Statistical Association June 2005
Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6
h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in
Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all
datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables
We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot
(a) (b)
Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output
Tadesse Sha and Vannucci Bayesian Variable Selection 611
(a) (b)
Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ
Table 4 Simulated Data With p = 1000Posterior Distribution of G
k p(G = k|X )
2 01113 14474 34375 33756 13267 02168 00409 0015
10 000811 000512 000713 001214 0002
Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4
73 Application to Microarray Data
We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis
Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)
612 Journal of the American Statistical Association June 2005
(a) (b)
(c) (d)
Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster
Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range
As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G
with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model
To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in
We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in
Tadesse Sha and Vannucci Bayesian Variable Selection 613
(a) (b)
Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ
Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed
We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues
Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains
We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both
8 DISCUSSION
We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors
Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive
Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
Tadesse Sha and Vannucci Bayesian Variable Selection 609
Table 1 Iris Data Posterior Distribution of G
Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]
k Different k Equal Different Σk Equal Σ Different Σk Equal Σ
3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0
together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823
We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters
We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means
Table 2 Iris Data Sample Allocation y
Heterogeneouscovariance
Homogeneouscovariance
I II III I II III IV
Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12
Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4
72 Simulated Data
We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that
xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ
22 )
+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ
24 )
i = 1 15 j = 1 20
where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2
1 = 15 σ 22 = 1 σ 2
3 = 5 and σ 24 = 2 We drew an
additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables
We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1
Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10
Posterior distribution of G
k 4 5 6 7 8 9 10
p(G = k |X) 0747 1874 3326 2685 1120 0233 0015
Sample allocations yI II III IV V VI
Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1
610 Journal of the American Statistical Association June 2005
Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6
h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in
Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all
datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables
We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot
(a) (b)
Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output
Tadesse Sha and Vannucci Bayesian Variable Selection 611
(a) (b)
Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ
Table 4 Simulated Data With p = 1000Posterior Distribution of G
k p(G = k|X )
2 01113 14474 34375 33756 13267 02168 00409 0015
10 000811 000512 000713 001214 0002
Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4
73 Application to Microarray Data
We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis
Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)
612 Journal of the American Statistical Association June 2005
(a) (b)
(c) (d)
Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster
Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range
As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G
with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model
To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in
We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in
Tadesse Sha and Vannucci Bayesian Variable Selection 613
(a) (b)
Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ
Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed
We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues
Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains
We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both
8 DISCUSSION
We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors
Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive
Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
610 Journal of the American Statistical Association June 2005
Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6
h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in
Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all
datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables
We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot
(a) (b)
Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output
Tadesse Sha and Vannucci Bayesian Variable Selection 611
(a) (b)
Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ
Table 4 Simulated Data With p = 1000Posterior Distribution of G
k p(G = k|X )
2 01113 14474 34375 33756 13267 02168 00409 0015
10 000811 000512 000713 001214 0002
Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4
73 Application to Microarray Data
We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis
Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)
612 Journal of the American Statistical Association June 2005
(a) (b)
(c) (d)
Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster
Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range
As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G
with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model
To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in
We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in
Tadesse Sha and Vannucci Bayesian Variable Selection 613
(a) (b)
Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ
Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed
We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues
Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains
We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both
8 DISCUSSION
We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors
Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive
Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
Tadesse Sha and Vannucci Bayesian Variable Selection 611
(a) (b)
Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ
Table 4 Simulated Data With p = 1000Posterior Distribution of G
k p(G = k|X )
2 01113 14474 34375 33756 13267 02168 00409 0015
10 000811 000512 000713 001214 0002
Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4
73 Application to Microarray Data
We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis
Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)
612 Journal of the American Statistical Association June 2005
(a) (b)
(c) (d)
Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster
Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range
As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G
with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model
To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in
We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in
Tadesse Sha and Vannucci Bayesian Variable Selection 613
(a) (b)
Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ
Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed
We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues
Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains
We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both
8 DISCUSSION
We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors
Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive
Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
612 Journal of the American Statistical Association June 2005
(a) (b)
(c) (d)
Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster
Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range
As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G
with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model
To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in
We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in
Tadesse Sha and Vannucci Bayesian Variable Selection 613
(a) (b)
Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ
Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed
We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues
Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains
We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both
8 DISCUSSION
We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors
Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive
Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
Tadesse Sha and Vannucci Bayesian Variable Selection 613
(a) (b)
Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ
Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed
We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues
Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains
We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both
8 DISCUSSION
We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors
Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive
Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
614 Journal of the American Statistical Association June 2005
Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)
approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open
Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3
problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-
(a) (b)
(c) (d)
Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
Tadesse Sha and Vannucci Bayesian Variable Selection 615
servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to
(n2
)proba-
bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure
An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters
Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms
APPENDIX FULL CONDITIONALS
Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters
Homogeneous Covariance 1 = middot middot middot = G =
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk(γ )
)Tminus1
(γ )
(xi(γ ) minus microk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1(γ )
)minus1(microk(γ ) minus micro0(γ )
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times ∣∣(γ c)
∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h
minus(pminuspγ )20
times exp
minus1
2
nsum
i=1
(xi(γ c) minus η(γ c)
)Tminus1
(γ c)
(xi(γ c) minus η(γ c)
)
times exp
minus1
2
(η(γ c) minus micro0(γ c)
)T(h0(γ c)
)minus1(η(γ c) minus micro0(γ c)
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)Q0(γ c)
)
dmicrod dη d
=int Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k
∣∣(γ )
∣∣minus(nk+1)2
times exp
minus1
2tr(minus1
(γ )Sk(γ )
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
(γ )Q1(γ )
)
times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20
∣∣(γ c)
∣∣minus(n+1)2
times exp
minus1
2tr(minus1
(γ c)S0(γ c)
)
times exp
minus1
2
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)T
times (n + hminus10 )minus1
(γ c)
(η(γ c) minus
sumni=1 xi(γ c) + hminus1
0 micro0(γ c)
n + hminus10
)
times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣(γ c)
∣∣minus(δ+2(pminuspγ ))2
times exp
minus1
2tr(minus1
0(γ c)Q0(γ c)
)dmicrod dη d
=[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
]πminusnpγ 2
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
∣∣(γ )
∣∣minus(δ+n+2pγ )2
times exp
minus1
2tr
(minus1
(γ )
[Q1(γ ) +
Gsum
k=1
Sk(γ )
])d
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
616 Journal of the American Statistical Association June 2005
times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4
times( pminuspγprod
j=1
(δ + p minus pγ minus j
2
))minus1
timesint
∣∣(γ c)
∣∣minus(δ+n+2(pminuspγ ))2
times exp
minus1
2tr(minus1
(γ c)
[Q0(γ ) + S0(γ c)
])d13
= πminusnp2
[ Gprod
k=1
(h1nk + 1)minuspγ 2wnkk
](h0n + 1)minus(pminuspγ )2
timespγprod
j=1
(
(δ + n + pγ minus j
2
)
(δ + pγ minus j
2
))
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2
times∣∣∣∣∣Q1(γ ) +
Gsum
k=1
Sk(γ )
∣∣∣∣∣
minus(δ+n+pγ minus1)2
times ∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
where Sk(γ ) and S0(γ ) are given by
Sk(γ ) =sum
xi(γ )isinCk
xi(γ )xTi(γ ) + hminus1
1 micro0(γ )microT0(γ )
minus (nk + hminus11 )minus1
( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)
times( sum
xi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
)T
=sum
xi(γ )isinCk
(xi(γ ) minus xk(γ )
)(xi(γ ) minus xk(γ )
)T
+ nk
h1nk + 1
(micro0(γ ) minus xk(γ )
)(micro0(γ ) minus xk(γ )
)T
and
S0(γ ) =nsum
i=1
xi(γ c)xTi(γ c) + hminus1
0 micro0(γ c)microT0(γ c)
minus (n + hminus10 )minus1
( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)
times( nsum
i=1
xi(γ c) + hminus10 micro0(γ c)
)T
=nsum
i=1
(xi(γ c) minus x(γ c)
)(xi(γ c) minus x(γ c)
)T
+ n
h0n + 1
(micro0(γ c) minus x(γ c)
)(micro0(γ c) minus x(γ c)
)T
with xk(γ ) = 1nk
sumxi(γ )isinCk
xi(γ ) and x(γ c) = 1n
sumni=1 xi(γ c)
Heterogeneous Covariances
f (Xy|Gwγ )
=int
f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )
times f (η|γ )f (|γ )dmicrod dη d
=int Gprod
k=1
wnk
k
∣∣k(γ )
∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h
minuspγ 21
times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times exp
minus1
2
Gsum
k=1
sum
xi(γ )isinCk
(xi(γ ) minus microk
)Tminus1
k(γ )
(xi(γ ) minus microk
)
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus micro0(γ )
)T(h1k(γ )
)minus1
times (microk(γ ) minus micro0(γ )
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )Q1(γ )
)dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
=int
int
micro
Gprod
k=1
(2π)minus(nk+1)pγ 2h
minuspγ 21 wnk
k 2minuspγ (δ+pγ minus1)2
times πminuspγ (pγ minus1)4
( pγprod
j=1
(δ + pγ minus j
2
))minus1
times ∣∣k(γ )
∣∣minus(nk+1)2
times exp
minus1
2
Gsum
k=1
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)T
times (nk + hminus11 )minus1
k(γ )
(microk(γ ) minus
sumxi(γ )isinCk
xi(γ ) + hminus11 micro0(γ )
nk + hminus11
)
timesGprod
k=1
∣∣Q1(γ )
∣∣(δ+pγ minus1)2∣∣k(γ )
∣∣minus(δ+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q1(γ ) + Sk(γ )
])dmicrod
times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
Tadesse Sha and Vannucci Bayesian Variable Selection 617
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
∣∣Q(γ )
∣∣(δ+pγ minus1)2
times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4
times( pγprod
j=1
(δ + pγ minus j
2
))minus1
timesint
Gprod
k=1
∣∣k(γ )
∣∣minus(δ+nk+2pγ )2
times exp
minus1
2tr(minus1
k(γ )
[Q(γ ) + Sk(γ )
])d
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)
(δ + p minus pγ minus j
2
))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
= πminusnp2Gprod
k=1
(hnk + 1)minuspγ 2wnk
k
timespγprod
j=1
(
(δ + nk + pγ minus 1
2
)(
(δ + pγ minus 1
2
)))
timesGprod
k=1
∣∣Q(γ )
∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )
∣∣minus(δ+nk+pγ minus1)2
times (h0n + 1)minus(pminuspγ )2
timespminuspγprod
j=1
(
(δ + n + p minus pγ minus j
2
)(
(δ + p minus pγ minus j
2
)))
times ∣∣Q0(γ c)
∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)
∣∣minus(δ+n+pminuspγ minus1)2
[Received May 2003 Revised August 2004]
REFERENCES
Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5
Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821
Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press
Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182
(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641
Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270
Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275
Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375
Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228
Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center
George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373
Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286
Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall
Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136
Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732
Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232
McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422
Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206
Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71
Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190
Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792
Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819
Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74
(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809
Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554
Recommended