Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

Embed Size (px)

Citation preview

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    1/36

    Bayesian Analysis of Mixture Models with an Unknown Number of Components- An

    Alternative to Reversible Jump MethodsAuthor(s): Matthew StephensSource: The Annals of Statistics, Vol. 28, No. 1 (Feb., 2000), pp. 40-74Published by: Institute of Mathematical StatisticsStable URL: http://www.jstor.org/stable/2673981

    Accessed: 23/06/2010 16:00

    Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at

    http://dv1litvip.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless

    you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you

    may use content in the JSTOR archive only for your personal, non-commercial use.

    Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at

    http://www.jstor.org/action/showPublisher?publisherCode=ims.

    Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed

    page of such transmission.

    JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of

    content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms

    of scholarship. For more information about JSTOR, please contact [email protected].

    Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The

    Annals of Statistics.

    http://dv1litvip.jstor.org

    http://www.jstor.org/stable/2673981?origin=JSTOR-pdfhttp://dv1litvip.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/action/showPublisher?publisherCode=imshttp://www.jstor.org/action/showPublisher?publisherCode=imshttp://dv1litvip.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/stable/2673981?origin=JSTOR-pdf
  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    2/36

    The Annals of Statistics2000, Vol. 28, No. 1, 40-74BAYESLAN ANALYSIS OF MIXTURE MODELS WITH ANUNKNOWN NUMBER OF COMPONENTS-ANALTERNATIVE TO REVERSIBLE JUMP METHODS'

    BY MATTHEWSTEPHENSUniversityfOxford

    Richardson nd Greenpresent method fperforming Bayesiananalysis fdatafrom finitemixture istributionith n unknownum-berof omponents.heirmethods a Markov hainMonteCarlo MCMC)approach, hichmakesuse of he reversibleump"methodologyescribedbyGreen.Wedescribe n alternative CMC methodwhich iews hepa-rameters f the model as a (marked)pointprocess, xtendingmethodssuggested yRipleyo create Markov irth-deathrocesswith n appro-priate tationaryistribution.ur methods easytoimplement,ven nthecase ofdata in more hanonedimension,nd we illustratet on bothunivariate nd bivariate ata. There ppears o be considerableotentialfor pplyinghese deas toother ontexts,s an alternativeomoregen-eral reversibleump methods,nd we concludewith brief iscussion fhowthismight e achieved.

    1. Introduction. Finitemixturemodels retypicallysed to modeldatawhere ach observations assumedto have arisenfrom ne of kgroups, achgroupbeing uitablymodeled ya density rom omeparametricamily.hedensity f each group s referred o as a componentf themixture,nd isweighted ythe relative requencyf hegroupnthepopulation. hismodelprovides frameworkywhich bservationsmaybe clustered ogetherntogroupsfordiscriminationr classificationsee, e.g.,McLachlanand Basford(1988)].For a comprehensiveist of uchapplications,eeTitterington,mithandMakov 1985). Mixturemodels lsoprovide convenientnd flexible am-ilyofdistributionsor stimating r approximatingistributionshich re notwell modeledby anystandardparametric amily,nd provide parametricalternative onon-parametric ethods fdensity stimation,uchas kerneldensity stimation.ee, for xample,Roeder 1990),West 1993) and Priebe(1994).Thispaper s principallyoncerned ith heanalysisofmixturemodels nwhichthenumber fcomponents is unknown.n applicationswherethecomponentsave a physical nterpretation,nference orkmaybe of nterestinitself.Where hemixturemodel s beingused purely s a parametriclter-native onon-parametricensity stimation,he value ofk chosen ffectsheflexibilityf hemodel nd thusthesmoothness f heresulting ensity sti-mate. nferenceork maythenbe seen as analogous obandwidth election

    Received ecember 998.1Supported yan EPSRC studentshipnda grant romheUniversityfOxford.AMS 1991 ubjectclassifications. Primary 62F15.Keywords ndphrases.Bayesian nalysis, irth-deathrocess,Markov rocess,MCMC,mix-turemodel,model hoice, eversibleump, patialpoint rocess.40

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    3/36

    BAYESIANANALYSIS OF MIXTURES 41inkerneldensity stimation. rocedureswhich llowktovarymaythereforebe of nterestwhether rnotk has a physicalnterpretation.Inference ork may be seen as a specific xampleofthe verycommonproblem f choosing model from givenset ofcompetingmodels.Takinga Bayesian approach o thisproblem,s we dohere,has theadvantage hatit providesnotonly wayofselecting single best"model,but also a co-herentwayofcombining esults verdifferent odels. n themixturemodelcontext hismightncludeperformingensitystimationytaking n appro-priate verageofdensity stimates btained singdifferentaluesofk. Whilemodel hoice andmodel veraging)within heBayesianframeworkre boththeoreticallytraightforward,heyoften rovide computationalhallenge,particularly hen as here)thecompetingmodels re ofdifferingimension.The use ofMarkovChain MonteCarlo MCMC)methodsseefor n introduc-tionGilks,Richardson ndSpiegelhalter1996)]toperformayesian nalysisis nowvery ommon, utMCMC methodswhich re able to umpbetweenmodelsofdifferingimension ave becomepopular nlyrecently,nparticu-larthroughheuse of he "reversibleump"methodologyeveloped yGreen(1995).Reversibleumpmethods llowthe constructionf n ergodicMarkovchain withthe oint posterior istributionftheparameters nd themodelas itsstationaryistribution. ovesbetweenmodels re achieved yperiodi-callyproposing move o a different odel, ndrejectingt with ppropriateprobabilityoensurethat the chainpossessestherequired tationaryistri-bution.deallytheseproposedmoves re designed o havea highprobabilityof cceptance o that thealgorithmxplores he different odels dequately,though his is not alwayseasy to achieve n practice.As usual in MCMCmethods, uantities f nterestmaybe estimated yformingamplepathav-eragesover imulated ealizations f hisMarkov hain.Thereversibleumpmethodologyas nowbeenappliedto a widerangeofmodel hoice roblems,including hangepoint nalysis Green1995)],QuantitativeraitLocusanal-ysis Stephens nd Fisch 1998)]and mixturemodels Richardsonnd Green(1997)].In this paper wepresent n alternativemethod fconstructingn ergodicMarkov hain with ppropriatetationary istribution,henthenumber fcomponents is considered nknown. hemethod s based on the construc-tionof continuousimeMarkov irth-deathrocess s described yPreston(1976) withthe appropriate tationary istribution. CMC methods asedonthese and related)processeshave beenusedextensivelyn thepointpro-cess literatureo simulaterealizations fpointprocesseswhich re difficultto simulatefrom irectly;n idea whichoriginatedwithKelly and Ripley(1976) and Ripley 1977) [see also Glotzl 1981), Stoyan,Kendall andMecke(1987)].Theserealizations an thenbe usedfor ignificanceesting as in Rip-ley 1977)],or ikelihoodnference or heparameters fthe model see, e.g.,Geyer nd M0ller 1994) and referencesherein].Morerecently uch MCMCmethods avebeen usedto performayesian nference or heparametersofa pointprocessmodel,where heparameters hemselvesre (modeled y) apointprocess see,e.g.,Baddeley nd vanLieshout1993),Lawson 1996)].

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    4/36

    42 M. STEPHENSIn order o applythese MCMC methods o the mixturemodel ontext, eview the parameters f the model as a (marked)pointprocess,witheachpointrepresenting componentf the mixture. he MCMC scheme llows

    the number fcomponentsovarybyallowingnewcomponentso be "born"and existing omponentso "die."These births nd deaths occur n continu-ous time, nd the relative ates at which hey ccurdeterminehestationarydistributionf heprocess. herelationshipetween heserates and thesta-tionary istributions formalizednSection (Theorem .1).We thenuse thisto constructn easilysimulated rocess,nwhichbirths ccur t a constantrate from heprior,nd deaths occur t a rate which s very ow for ompo-nentswhich re criticalnexplaininghedata,andveryhighfor omponentswhich onothelpexplain he data. Theaccept-reject echanism freversiblejump is thusreplacedbya mechanismwhich llowsboth"good" nd "bad"births o occur, ut reversesbad birthsveryquickly hrough veryquickdeath.Ourmethod s illustratednSection , byfitting ixtures fnormaland t)distributionso univariate ndbivariate ata. We found hat heposterioris-tribution fthenumber fcomponentsor givendata settypicallyependsheavily nmodeling ssumptionsuch as the form fthedistributionor hecomponentsnormals r ts) and thepriors sed for heparameters fthesedistributions.n contrast,redictiveensitystimates end oberelativelyn-sensitive othesemodelingssumptions. urmethodppearstohavesimilarcomputationalxpense othatofRichardson nd Green 1997)inthecontextofmixtures funivariate ormal istributions,hough irect omparisonsredifficult.othmethods ertainly ivecomputationallyractable olutions otheproblem,withroughresults vailable in a matter fminutes.However,ourapproach ppearsthe morenatural ndelegantnthiscontext,xploitingthenaturalnested tructure f hemodels ndexchangeabilityf hemixturecomponents. s a resultwe remove he need for alculation fa complicatedJacobian, educinghepotential ormaking lgebraic rrors.n addition, hechangesnecessary oexplore lternativemodelsfor hemixtureomponents(replacing ormalswith distributions,.g.) aretrivial.Weconcludewith discussion f hepotential or xtendinghebirth-deathmethodologyBDMCMC) toother ontexts,s an alternativeo moregeneralreversibleump (RJMCMC)methods.One interpretationfBDMCMC is asa continuous-timeersion fRJMCMC,witha limiton thetypesofmoveswhich re permittedn order o simplifymplementation.DMCMC is eas-ily appliedto any contextwheretheparameters f nterestmay be viewedas a pointprocess, nd where the likelihood fthese parametersmay beexplicitly alculated this atterrules outHiddenMarkovModels for xam-ple). We consider rieflyome examples a multiple hange-pointroblem,and variableselection n regressionmodels)where heseconditionsre ful-filled, nd discuss the difficultiesf designing uitablebirth-deathmoves.Where uchmoves re sufficiento achieve dequatemixing DMCMC pro-videsan attractiveasily-implementedlternative o moregeneralRJMCMCschemes.

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    5/36

    BAYESIANANALYSISOF MIXTURES 432. Bayesian methods formixtures.2.1. Notation and missingdata formulation. We consider a finitemixture

    model in which data x' = x1, .., xn are assumed to be independentobserva-tionsfrom mixture ensitywith k (k possibly nknown utfinite) ompo-nents,(1) p(X I T, ), ') = 7Tf(x; 01, j) + + lTkf(x; 4k, 'i),whereT = *T1, ..., rvk) re the mixtureroportions hich re constrainedobe non-negativend sum tounity;4 = ( *1,..., k) are the possibly ector)componentpecific arameters, ithfibeing pecifico component; and 7qis a (possibly ector) ommon arameterwhich s commono all components.Throughouthispaperp(* ) willbe used to denote oth onditionalensitiesanddistributions.It is convenientointroduce hemissing ata formulationf hemodel,nwhich ach observationxi is assumed to arise from specific utunknowncomponentzj ofthe mixture. he model 1) can be written n terms f themissing data, with z1, . ., z, assumed to be realizations of ndependent andidentically istributediscrete andom ariablesZ1,..., Zn withprobabilitymass function(2) Pr(Z = i I s, 71)= i (j = 1, .., n; i = 1,..., k).Conditional on the Zs, x1, .., xnare assumed to be independentobservationsfrom hedensities(3) p(xjlZj = i'r,w,+,/) =f(xj;oi,r/) (j= 1,...,n).Integrating ut themissing ata Z1, ., Zn thenyields hemodel 1).

    2.2. Hierarchicalmodel. Weassume a hierarchicalmodelfor he prior nthe parameters (k, n, p, ii), with (r1, j),...1(7Tk, k) being exchangeable.[Foran alternative pproach ee Escobarand West 1995) who use a priorstructure ased on the Dirichletprocess.]Specifically e assume that theprior distribution or k, ,, +) givenhyperparametersw,and commoncompo-nentparameters j,hasRadon-Nikodymerivative"density")(k, a,+ I , r)withrespect oan underlyingymmetric easureX/#defined elow).For no-tational onvenience edropfor he restof hepapertheexplicit ependenceofr(. w,71) n wand rq.To ensureexchangeability e require hat,for nygivenk,r(.) is invariant nderrelabeling f hecomponents,nthat(4) r(k, (71, ... Irk), ( 1, ** ,k)) = r(k, 7Te(j), * * 7TE(k)) ( (1), e(k)for ll permutations of1,..., k.In order odefine hesymmetric easure ,kwe introduceome notation.Let 4k-1 denote heUniformistributionn thesimplex

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    6/36

    44 M. STEPHENSLet P denote the parameter space for he /i so Xi E 'Pfor ll i), let v be somemeasure on PD, nd let vk be the induced productmeasure on (jk. (For mostof this paper F)will be Rm for ome m, and v can be assumed to be Lebesguemeasure.) Now let /tkbe the productmeasure Pk x 4k-I on XXk-1, andfinallydefine 4' to be the induced measure on the disjointunion U, 1(pk XJtk-1 )

    A special case. Given w and -q, et k have prior probabilitymass distri-bution p(k c, ii). Suppose + and aTare a priori independentgiven k,w and71,with 01, * k,4k being independentand identicallydistributedfrom dis-tributionwith density 5(O Iw,71)withrespectto v, and a having a uniformdistribution n the simplex y2k-1. Then(5) r(k, +, i) = p(k I I, W )... ik I ,,Note that this special case includes the specificmodels used by Diebolt andRobert 1994) and Richardsonand Green (1997) in the contextof mixtures ofunivariate normal distributions.

    2.3. Bayesian inferencevia MCMC. Given data xn, Bayesian inferencemay be performed sing MCMC methods,which involve the construction fa Markov chain {0(t)} withthe posteriordistributionp(O Ixn) of the param-eters 0 = (k, i, +, r7) s its stationarydistribution.Given suitable regularityconditions[see, e.g., Tierney (1996), page 65], quantities of interestmay beconsistently stimatedby sample path averages. For example, if 0(0), 0(1), ..is a sampled realization ofsuch a Markov chain, then inferencefor k may bebased on an estimate of the marginal posteriordistribution(6) ~~Pr(k i IXn) =lim N#{t: k() = i}

    (6) N->o#N#{t: (t)-i} (N large),and similarly hepredictivedensityfor future bservationmay be estimatedby

    XnN(7) P(Xn+l x) N p(xn+l I0(t)).t=1

    More details, including details of the constructionof a suitable Markovchain when k is fixed, an be found n thepaper by Diebolt and Robert 1994),chapters of the books by Robert (1994) and Gelman et al. (1995), and thearticlebyRobert 1996). Richardson and Green (1997) describethe construc-tion of a suitable Markov chain when k is allowed to vary using the reversible

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    7/36

    BAYESIAN ANALYSIS OF MIXTURES 45jumpmethodologyeveloped yGreen 1995).We nowdescribe n alternativeapproach.

    3. Constructinga Markov chain via simulation ofpoint processes.3.1. Theparametersas a point process. Our strategy s to view each com-ponent fthemixture s a point n parameter pace,and adapt theory romthesimulation fpointprocesses ohelpconstruct Markov hain with heposterior istributionftheparameters s its stationary istribution.ince,for ivenk,theprior istributionor,r,4+)defined t (4) does notdepend nthe abeling f hecomponents,nd the ikelihood

    L(k, ,, 4, j) = p(xn k, 'r, +, 71)(8) nH [TlfXj; 01 ,' + + gkfXi; ?wk,01)j=1is also invariant nderpermutationsfthecomponentsabels,theposteriordistribution(9) p(k, ', xn, ,) L(k, n, 4, -q)r(k, n,willbe similarlynvariant. ixingwand q,we can thus gnore he abeling fthe components nd can considerany set of kparametervalues I(91r' 1) . ... 7(1k, d'k)I as a set ofkpointsn[0, 1]x?, with heconstrainthat 1+ .+k =1 (see,e.g.,Figure a.) Theposterioristribution(k, Tr, 1Xn, w, 7j) canthenbe seenas a (suitably onstrained)istributionfpointsn[0, 1]xCP,or notherwords point rocess n[0, 1]x . Equivalentlyheposterioristributionanbe seenas a marked oint rocessn4),with achpointkihaving nassociatedmark7iE [0, 1],with he marksbeing onstrainedo sumtounity.This view ofthe parameters s a marked ointprocess which s also out-linedbyDawid (1997)] allows us to use methods imilar o those n Ripley(1977) to construct continuous ime Markovbirth-deathrocesswith sta-tionary istribution(k, a, + I n, w, 71), withw and q kept fixed. etails ofthis constructionre given n the nextsection. n Section 3.4 we combinethisprocesswithstandard fixed-dimension)CMC updatesteps which l-loww and iq to vary, o create a Markov hain with tationary istributionp(k, r, 4!, , q I xn).

    3.2. Birth-deathprocesses forthecomponents f a mixturemodel. Let Qkdenote heparameter paceof hemixturemodelwithkcomponents,gnoringthe abeling f hecomponents,nd etfl= Uk>1k . Wewilluse setnotation orefer o members fQl,writing = {(7T1, 1)kD7* *Tk, k)1 Elk to representtheparametersf hemodel1) keepingqfixed,nd sowe maywrite 7Ti, i) Ey for = 1, .., k. Note that (for given w and -q)the invariance ofL(.) and

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    8/36

    46 M. STEPHENS

    cn - ~s C") C")0.2 l5) 05 c) 04C ~~~~~Czac~~~~~~~~cc -6 .c acO 0.co ~~0.6 0 0.20.2 0.5 0.4

    -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2mean mean mean

    (a) (b) (c)FIG. 1. Illustration of births and deaths as defined by (10) and (11). (a) Representationof0.2+IV(-1, 1)+0.6X(1, 2)+0.2.XV(1, 3) as a setofpoints inparameterspace. Y4, S2) denotes theunivariate normal distributionwithmean ,t and variance o2. (b) Resultingmodel afterdeath ofcomponent .6X/(1, 2) in (a). (c)Resultingmodel afterbirthofcomponent t 0.2.4/(0.5, 2) in (b).

    r(.) under permutationof the component abels allows us to defineL(y) andr(y) in an obviousway.We definebirths and deaths on fl as follows:

    Births: If at time t our process is at y = {(471,01), , (7Tk,4k)} IE ik and abirth s said to occurat (i-, 4) E [0, 1] x (D,then the process umps to

    (10) y U (iT, 0) := j(T(1 - 7), 4)1) ** , (k(71 - r)7, /k), (7m4)) E fk+l-Deaths: If at time t our process is at y = ((71, 41), ..., (0k, 4)) E flk and adeath is said to occur at (7Ti,Oi) E y, then the process umps to

    {i (1-T )1 (1Ti )(11) (1-Ti 7+) (-i'k E _.

    Thus a birth ncreases the numberofcomponentsby one, while a death de-creases thenumber ofcomponentsbyone. These definitions ave been chosenso that birthsand deaths are inverse operations to each other, nd the con-straint 7T1 . + 7Fk = 1 remains satisfiedaftera birth or death; they are

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    9/36

    BAYESIAN NALYSIS F MIXTURES 47illustratednFigure1. Withbirths nd deathsthusdefined, e consider hefollowingontinuous imeMarkov irth-deathrocess:Whentheprocess s at y e Qk, letbirths nddeathsoccur s independentPoissonprocesses s follows:Births:Birthsoccur t overallrate ,B(y), nd when a birth ccurs t occursat a point IT, 4) E [0, 1] x P,chosen ccordingodensity (y; IT, 4)) withrespect o the productmeasure ' x v,where 1 is the uniformLebesgue)measureon [0,1].Deaths:When heprocesss aty = {( T1,01), , Tk, 0)0} eachpointTj, Oj)diesindependentlyf he others s a Poissonprocesswithrate(12) 8i(y) = d(y\(Tj, )j); (7jm 4j))

    for omed: flx ([0, 1] x P) -* R+. The overalldeath rate s thengivenby6(y) = Lj My).The time to the next birth/deathvent s thenexponentiallyistributed,withmean1/(fl(y)?+(y)),and t willbe a birthwith robability3(y)/(13(y)+8(y)), and a death ofcomponent withprobabilityj(y)/(/3(y) + 5(y)). Inorder o ensure hat hebirth-deathrocess oesn'tumpto an area with ero"density" e impose hefollowingonditions n b and d:(13) b(y; (T, 4))) = 0 whenever (y u (ir,4))L(y u (7T,4)) = 0,(14) d(y; (IT, 4)) = 0 whenever r(y)L(y) = 0.The followingheorem hengivessufficientonditions n b and d for hisprocess ohavestationary istribution(k, rr, I Xn, w, q).

    THEOREM .1. Assuming thegeneral hierarchicalprioron (k, Tr, ) givenin Section2.2, and keepingwand 7jfixed, hebirth-deathrocessdefined bovehas stationarydistributionp(k, Tr, Ixn, w, i), provided b and d satisfy(15) (k + 1)d(y; (IT, 4)))r(y U (TT, )))L(y U T, 4)))k(l - T)k-l= /3(y)b(y; ir, 0)) r(y)L(y)for ll y E lk and (T, 4) z [O,] x?.

    PROOF. Theproofs deferredo theAppendix.3.3. Naive algorithmfora special case. We now consider the special casedescribed t (5), where

    (16) r y) = p k Io),71)P(1 Ioj,rq)- A0k I&J,)Suppose that we can simulatefrom5 cw, 1), nd consider he processobtainedby setting 3(y) = Ab a constant),withb(y; I, 4)) = k(l - IT)k-1 . 5(4 w )

    Applying heorem3.1 we find hatthe processhas the correct tationarydistribution, rovidedthat whentheprocessis at y = {( T1, 4)) .., (Tk, 4k)},

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    10/36

    48 M. STEPHENSeach point 7rj, fj) dies ndependentlyf he others s a Poissonprocesswithrate(17) d(y\(7 , Aj); j,j)) bL(y\(7TJ,4j))p(k-I w,,) (j= 1,...,k).Algorithm .1 belowsimulates hisprocess.We note that the algorithmsvery traightforwardo mplement,equiringnly he bility o imulate rom(. IWI,), and tocalculate the model likelihoodfor ny given model. The maincomputationalurden s incalculatinghe ikelihood,nd it s importanthatcalculations fdensities re stored nd reusedwherepossible.

    ALGORITHM.1. To simulate processwith ppropriatetationaryistri-bution.Startingwith nitialmodely = (1, 01),.. (7k, k k)}I zk, iterate hefollowingteps:1. Let the birth rate 3(y) = Ab.2. Calculatethe deathrate for achcomponent,hedeathrate for omponentj beinggivenby 17):(18) 5i(Y) = L(y\(7( ,J4j)) p(k - 1

    ',)(j =

    1,-, k).J(y)=Ab L(y) kpk w,i)3. Calculate the totaldeath rate8(y) = .5S(y).4. Simulate he time o the nextumpfrom n exponential istribution ithmean17(3(y) + 8(y)).5. Simulate hetype fump:birth r deathwithrespective robabilitiesPr(birth) = 13(y) Pr(death) = 8(y)/3(y)+ 6(y)' N3Y)? 5(y),

    6. Adjusty to reflect he birth r death as defined y 10) and (11)]:Birth: Simulate the point (r, k) at which a birth takes place from he den-sityb(y; (T, 0)) = k(l - q)k-l w( IC, r1) by simulatingr and 0 indepen-dently rom ensitiesk(l - ,)k1- and /(Ow, ) respectively. e note hatthe formers the Beta distribution ithparameters1, k), which s easilysimulated rom ysimulating 1 - F(1, 1) and Y2 - F(k, 1) and settingX = Y1/(Y1+ Y2), whereF(n, A) denotes he Gammadistribution ithmean n/A.Death: Selecta componento die: yj, j) C y being electedwithprobability5j(y)/6(y) for = 1, .., k.

    7. Return ostep2.REMARK .2. Algorithm.1 seems rathernaive in that birthsoccur insome sense) from the prior,which may lead to many births of componentswhichdo nothelp to explainthe data. Such components ill have a high

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    11/36

    BAYESIANANALYSIS OF MIXTURES 49death rate 17) and so will die very uickly, hich s inefficientn the sameway as an accept-rejectimulation lgorithms inefficientfmany amplesare rejected.However,n theexampleswe consider n the next section hisnaive algorithm erformseasonablywell, ndso we have notconsidered nycleverer hoices fb(y; T,1)) whichmay llowbirths o occurna less naiveway see Section .2 for urther iscussion).

    3.4. Constructing Markov Chain. If we fixwo nd rj thenAlgorithm .1simulatesa birth-deathrocesswithstationary istribution(k, , + Ix',co, i). This can be combinedwith MCMC update steps which llowwo nd7i to vary to create a Markov chain with stationary distributionp(k, T, ?, w,r1 Xn). By augmenting he data xn by the missing ata zn = (zl, ..., zX) de-scribedn Section2.1, and assuming he existence nd use ofthenecessaryconjugate riors,we can use Gibbs ampling tepsto achieve his as inAlgo-rithm .2 below;Metropolis-Hastingspdates could also be used,removingthe need to ntroduce he missing ata or use conjugate riors.

    ALGORITHM3.2. To simulate Markov hainwith ppropriatetationarydistribution.Giventhe state@(t) - 0(t) at time , simulate value for (t+1) = 0(t+?) asfollows:Step 1. Sample k(t)', r(t)',+(t)') byrunninghebirth-deathrocess or fixedtime o, tartingromk(t),5T(t),+(t)) andfixingw, ij) to be @(t), -q(t)).Set k(t+1) = k(t)'.Step2. Sample zn)(t+1)from (zI k(t?U, (t)/ (t)' (t) @(t), nStep3. Sample q(t+l), w(t+1) from (, w k(t+l),.(t)' +(t)' xnI znStep4. Sample n(t+1), (t+1) fromp(Tr,pI k(t+l), ,(t+1), w(t+l), xn, zn).

    Provided hefull onditional osterior istributionsor ach parameter ivesupport o all partsof the parameter pace, this will define n irreducibleMarkov hainwith tationary istribution(k, Tr, , , ,q, n Ixn) suitableforestimating uantities f nterest y formingample path averagesas in (6)and 7). The proofs straightforwardnd s omitted ere see Stephens1997),page84].Step1of healgorithmnvolvesmovementsetween ifferentaluesofkby llowing ewcomponentsobe "born,"nd existing omponentso"die."Steps 2, 3 and4 allow theparametersovarywithk keptfixed. tep 4 is notstrictly ecessary o ensureconvergencefthe Markov hainto the correctstationary istribution,ut s included o mprovemixing. otethat as usualinGibbs ampling) he algorithmemains alid f ny or all ofwl, 1 nd + arepartitionedntoseparatecomponents hich re updatedone at a timeby aGibbssampling tep, s willbe the case in ourexamples.

    4. Examples. Our examples emonstrateheuse ofAlgorithm.2 to per-formnferencen thecontext fbothunivariate nd bivariate ata xn,whichare assumed to be independent bservations rom mixture fan unknown

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    12/36

    50 M. STEPHENS(finite) umber fnormal istributions:(19) p(x I , ji, Y) = 1T14(X; 17 Y1) ? * +TkIr(X; /k'v).Here JV(x;,ui, i) denotes he density unction f the r-dimensionalmulti-variatenormal istribution ithmean/xind variance-covarianceatrix i.In the univariate case (r = 1) we maywriteo-2for1.Priordistributions.We assumea truncated oissonprior n the numberofcomponents :

    kk(20) p(k) A (k= 1,..* kmax = 100),whereA s a constant;we willperformnalyseswith everaldifferentaluesofA. Conditional n k we base our priorfor he modelparameters n thehierarchicalrior uggested yRichardsonndGreen 1997)inthe contextfmixturesfunivariate ormal istributions. naturalgeneralizationf heirprior ordimensionss obtained yreplacing nivariate ormal istributionswith multivariate ormaldistributions,nd replacing ammadistributionswithWishart istributions,ogive(21) Ali 1VJ6,K 1) (i = 1, ...,I k),(22) Y;- 1 -3 Yr(2a, (2/8)-1) (i = 1, ..,(23) [3 Ylr(2g,2h)-1),(24) r - 9(y)where 3 s a hyperparameter;, 3andharerxr matrices; s an rx1 vector;a, yand g arescalars;9(y) denotes hesymmetricirichlet istributionithparameter and density

    F(k-y)i _1Y-F7(y)k ,1 **7Tk 1(1 - -T1 *-k-O ;and 4r(m, ) denotes he Wishart istributionnr dimensions ithparame-tersmandA. This ast s usually ntroduceds the distributionf he samplecovariancematrix,or sampleof ize m from multivariateormal istribu-tion nr dimensions ith ovariancematrixA. Becauseof his nterpretationm is usuallytakenas an integer,ndform > r Yr(m,A) has density(25) Ylrt(V; , A) = KIA I-m/2 m-r-l/2

    x exp{- tr(A V)II(V positive efinite)on the space ofall symmetric atrices- Rr(r+l)/2), where (.) denotes nindicator unctionnd

    K-1 = 2mr/2,r(r-1)/4 hIF(m 1-S)

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    13/36

    BAYESIANANALYSIS OF MIXTURES 51However,25) also defines density ornon-integer providedm > r - 1.Methodsofsimulating rom heWishartdistributionwhichworkfornon-integer m > r - 1) maybe found n Ripley (1987). For m < r - 1 we will useY/'(m,A) to representhe mproper istributionithdensity roportionalo(25). (This is not the usual definition fY/,(m,A) for m < r - 1,which is asingular istributiononfinedoa subspaceof ymmetric atrices.)Where nimproper riordistributions used, t is importantocheck he ntegrabilityof heposterior.Forunivariate atawe follow ichardsonnd Green 1997),who ake (, K,a, g, h, y)to be (data-dependent)onstantswith hefollowingalues:

    1=1, K= 2, a=2,1

    g =0.2, h = lOOg = 1aR1where j is the midpointf he observed nterval fvariation f hedata,andR1 is the ength fthis nterval. he value a = 2 was chosen o express hebelief hat the variancesof the componentsre similar,without estrictingthem to be equal. For bivariate data (r = 2) we feltthat a slightly trongerconstraint would be appropriate,and so increased a to 3, making a corre-spondingchange in g and obvious generalizations forthe otherconstants togive

    ( ((1, (2 K = 1a =3,O 2

    g=O.3, 1 OOgaRR2=0.3, h = O Og 1,

    where 61 and (2 are the midpointsof the observed intervals of variation ofthe data in the first nd second dimension respectively, nd R1 and R2 arethe respectivelengths of these intervals.We note that the prioron ,B n thebivariate case/3- 02(O.6, (2h) )is an improperdistribution,but careful checking of the necessary integralsshows that the posteriordistributions re proper.In our examples we consider the following riors:

    1. The Fixed-K prior,which s the name we give to the prior givenabove. Thefullconditionalposteriordistributions equiredfor he Gibbs sampling up-

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    14/36

    52 M. STEPHENSdates Steps2-4 inAlgorithm.2) are then using ... to denote ondition-ing on all other ariables)

    (26) p(z i = i | ) oCir(xj; /,Xj(27) 3 llr2g + 2ka, [2h+ 2 E -1-)(28) 91(y + nl, * * , y+ nk),(29) -'. J4(niET' + K)1(nil-' xi+ Kc), (ni-'1 ?)-)(30) I .. Y/}2a + ni, [2l3? E(x+ -Tu)(x -j:zj =i

    for = 1, .., k and j = 1,L. , n, where ni is the number of observationsallocated o class i (ni = #Jj zj = i}) and xi is the meanofthe observa-tions allocated to class i (i= ZEj:zj=i xj/ni.) The Gibbs sampling updateswereperformednthe order 1, r,u,N.2. TheVariable-Kprior,nwhich and K are also treated s hyperparametersonwhichweplace "vague" riors. his s an attemptorepresenthebeliefthatthe means will be closetogether henviewedonsomescale,withoutbeing nformativebout their ctual location. t is also an attempt o ad-dresssome oftheobjectionsothe Fixed-K riordiscussed n Section5.1.We chose oplacean improperniformrior istributionnf anda "vague"1r(Y' (lIr)X1) distribution n K where Ir is the r x r identitymatrix. n or-der to ensuretheposterior istributionor is proper,hisdistributionsrequiredto be proper, nd so we require I > r- 1.We used I = r- 1+ 0.001as our default aluefor . In general, ixing distributiono be propernthisway s not a good dea. However,n this case it can be shown hat fI = r - 1+ - then nference or t,E and k is not sensitive o E for mall r,although umerical roblemsmayoccur or ery mallE.)The full onditional osteriorsre then s for he Fixed-K rior, ith headdition f

    (31) ... 4r(-, kK)1),(32) K I /41r(1k, hr + SS

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    15/36

    BAYESIAN ANALYSIS OF MIXTURES 53Values for to, Ab). Algorithm .1 requires the specification f a birth-rateAb, nd Algorithm .2 requires the specification f a (virtual) timetoforwhichthe birth-deathprocess is run. Doubling Ab s mathematically equivalent todoubling to, and so we are free to fixto = 1, and specifya value forAb. nall our examples we used Ab= A [the parameter of the Poisson prior n (20)],which gives a convenient formof the death rates (18) as a likelihood ratiowhich does not depend on A. Larger values of Abwill result in bettermixingover k, at the cost of more computation time per iteration of Algorithm3.2,and it is not clear how an optimal balance between these factors should beachieved.

    4.1. Example 1: Galaxy data. As our first xample we considerthe galaxydata firstpresented by Postman, Huchra and Geller (1986) consistingof thevelocities (in 103 kmls) of distant galaxies divergingfromour own, from ixwell-separated conic sections of the Corona Borealis. The original data con-sists of 83 observations,but one of these observations a velocityof 5.607 x 103km/s) does not appear in the version of the data given by Roeder (1990),which has since been analyzed under a varietyofmixturemodels by a numberof authors, includingCrawford 1994), Chib (1995), Carlin and Chib (1995),Escobar and West (1995), Phillips and Smith (1996) and Richardson andGreen (1997). In order to make our analysis comparable with these we havechosen to gnorethemissingobservation.A histogramof he data overlaidwitha Gaussian kernel densityestimate is shown in Figure 2. The multimodalityof the velocitiesmay indicate the presence of super clusters ofgalaxies sur-roundedby large voids, each mode representing cluster as it moves away atits own speed [Roeder (1990) gives morebackground].

    LOc\!aLO

    LO0

    0 10 20 30 40velocity

    FIG. 2. Histogram ofthegalaxy data, with bin-widths hosenby eye.Since histograms re ratherunreliable densityestimation devices [see, e.g., Roeder (1990)]we have overlaid the histogramwith non-parametric ensity stimateusingGaussian kerneldensity stimation,with bandwidthchosenautomatically accordingto a rulegiven bySheather nd Jones1991), alculated using theS functionwidth. SJfromVenables and Ripley (1997).

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    16/36

    54 M. STEPHENSWe useAlgorithm.2 to fit hefollowing ixturemodels o thegalaxydata:(a) A mixture fnormaldistributionssingthe Fixed-K rior escribedn

    Section .(b) A mixture fnormaldistributionssingthe Variable-K rior escribedin Section .(c) A mixture ft distributionsnp = 4 degrees ffreedom:(33) p(x I , i, j2) = Tltp(x;bL1, -1 ) + 7Tktp(X; [Lk, %)'where p(x;A, io2) is thedensity fthe t-distributionithp degrees ffree-dom,withmean Li ndvariancepo-2/(p- 2) [see,e.g.,Gelman tal. (1995),page 476]. The value p = 4 was chosen ogivea distributionimilar o thenormal istribution ith lightlyfatterails," incetherewas someevidencewhenfittinghenormal istributionshatextra omponents erebeingusedto create onger ails. We used the Fixed-K rior orw, u, 2). Adjustinghebirth-deathlgorithmo fit distributionss simply matter freplacinghenormaldensitywith he tdensitywhencalculatinghe ikelihood. heGibbssampling tepsare performeds explainedn Stephens1997).

    Wewillrefer othese hreemodels s "Normal, ixed-K,"Normal, ariable-K" and "t4,Fixed-K" espectively.oreach ofthethreemodelswe performedtheanalysiswithfour ifferentalues oftheparameterA theparameter fthe truncated oissonprioron k): 1,3,6and 25. The choiceofA = 25 wasconsideredn order ogivesome dea ofhow themethodwouldbehave as Awas allowed ogetvery arge.Startingpoints,computationalexpenseand mixing behavior For each priorwe performed0,000 terations fAlgorithm.2, with hestarting ointbe-ingchosenbysetting = 1,setting4,K) tothevalues chosen or heFixed-Kprior, nd sampling he otherparameters romheirointpriordistribution.In eachcase thesamplermoved uickly romhe ow ikelihoodf hestartingpoint oan area ofparameter pace withhigherikelihood. hecomputationalexpensewas notgreat.Forexample, he runsforA= 3 took150-250seconds(CPU timeson a Sun UltraSparc 00 workstation,997), which orrespondstoabout 80-130 iterations er second.Roughly hesameamount f imewasspentperforminghe Gibbs ampling teps s performinghebirth-deathal-culations. he mainexpenseofthebirth-deathrocess alculationss in cal-culating hemodel ikelihood,nd a significantaving ouldbe madebyusinga look-up able for henormal ensitythiswas notdone).In assessing theconvergencend mixing roperties f our algorithm efollowRichardsonnd Green 1997) in examining irstlyhe mixing verk,and then hemixing ver he other arameterswithin .Figure a shows hesampledvalues of k for herunswithA= 3. A rough dea ofhowwell thealgorithms exploringhespacemaybe obtained rom hepercentagesf ter-ationswhich hangedk,which nthis asewere36%,52%and 38% formodelsa)-c) respectively.ore nformationan be obtained romheautocorrelationofthesampled values of k (Figure3b) which how thatsuccessive amples

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    17/36

    BAYESIAN ANALYSIS OF MIXTURES 55

    0 . -4uou

    0 50Q0 l 000 15000 20000 0 5000 10000 15000 20000 0 S000 10000 15000 20000Fitfingornals,ixed appa) Fittingormals,ariableappa Fittingswithdegreesf reedomixedappa

    (a) Sampledvalues of k

    Zo ilikll E S t

    0 00

    0 20 d 00 eo 100 0 20 40 60 eo 100 0 20 440 60 80 tOOLag Fitingormals,ixedappa) Lag Finingonmals anrableappa) Lag Fiting4s, ixedappa)

    (b) Autocorrelationsor ampled alues ofkFIG. 3. Results fromusing Algorithm .2 to fit hethreedifferent odels tothegalaxy data usingA= 3. The columns show resultsforLeft:Normals, Fixed-K;Middle: Normals, Variable-K; Right:t4s,Fixed-K.

    have a high autocorrelation.This is due to the factthat k tends to change byat most one in each iteration,and so many iterations are required to movebetween small and large values ofk.In ordertoobtain a comparisonwith theperformance fthe reversibleumpsamplerofRichardson and Green 1997) we also performeduns withthepriortheyused for hisdata; namelya uniform rioron k= 1, . . , 30 and the Fixed-K prioron the parameters. For this priorour sampler took 170 seconds andchanged k in 34% of terations,whichcomparesfavorablywith the 11-18% ofiterations obtainedbyRichardson and Green (1997) using thereversible umpsampler (their Table 1). We also tried applying the convergence diagnosticsuggested by Gelman and Rubin (1992) whichrequires more than one chainto be run from ver-dispersed tarting points (see the reviewsbyCowles andCarlin (1996) orBrooks and Roberts 1998) for lternativediagnostics). Basedon four chains of ength 20,000, with two started fromk = 1 and two startedfromk= 30, convergencewas diagnosed for heoutputofAlgorithm .2 within2500 iterations.Richardson and Green (1997) note that allowing k to vary can result inmuch improvedmixingbehavior of the sampler over the mixturemodel pa-rameters withink. For example, if we fixk and use Gibbs sampling to fitk = 3 t4 distributions o the galaxy data withthe Fixed-Kprior, hereare twowell-separated modes (a major mode with means near 10, 20 and 23 and aminormode with means near 10, 21 and 34). Our Gibbs sampler withfixed

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    18/36

    56 M. STEPHENSk struggledo movebetween hesemodes,moving rommajormode ominormode nd backonly nce n 10,000 terationsresultsnot hown).WeappliedAlgorithm.2 to thisproblem, singA= 1. Of the 10,000 points ampled,therewere 1913 visits o k = 3,duringwhich he minormodewas visited nat least 6 differentccasionsFigure4). In this case the mprovedmixing e-havior esults romheability omovebetween he modesfork = 3 via stateswithk= 4: that s (roughlypeaking), romhemajormode o theminormodevia a four omponent odelwithmeansnear10,20,23 and 34. Ifwe aregen-uinely nlynterestednthe case k = 3 then he mprovedmixing ehavior fthe variablek samplermustbe balanced gainst ts ncreased omputationalcost,particularlys we generated nly1913 samplesfrom = 3 in 10,000iterations f he sampler. ytruncatingheprior n kto allowonlyk= 3 andk = 4, and usingA= 0.1 tofavor he3 componentmodelstrongly,e wereable to increasethis to 7371 sampleswith k = 3 in 10,000 iterations, ithabout6 separatevisits otheminormode.Alternativetrategiesor btaininga sample from hebirth-deathrocess onditional n a fixed alue of k aregivenbyRipley 1977).Inference. The results n this section re based on runs of ength 0,000with he first 0,000 iterations eingdiscarded s burn-in numberswe be-lieveto be largeenough ogivemeaningfulesults asedonour nvestigationsofthemixing ropertiesfour chain.Estimates ftheposterior istributionofk (Figure5) show hat t s highlyensitive otheprior sed,both ntermsof choiceofAand thepriorVariable-KrFixed-K) sed on theparameters(,U, Cr2). Correspondingstimates fthepredictiveensityFigure ) showthat this s less sensitive ochoice fmodel.Although hedensity stimatesbecome ess smooth s A ncreases, venthedensitystimates or the unrea-sonablyargevalueof)A= 25 donotappear to be over-fittingadly.The largenumber fnormal omponentseingfittedo thedata suggeststhat hedata isnotwellmodeled y mixture fnormal istributions.urtherinvestigationhows hatmany f hese omponents avesmallweightndarebeingused to effectivelyfatten hetails"ofthenormaldistributions,hichexplainswhy ewer 4 componentsrerequired o model hedata.Parsimonysuggests hat we shouldprefer he t4model, nd we can formalizehis as

    I 0 I 0E E E

    0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000sample oint sample oint sample oint

    FIG. 4. Sampled values ofmeans forthree omponents,ampled usingAlgorithm .2 whenfittinga variable number f t4components othegalaxy data, withFixed-Kprior,A= 1, and conditioningthe resultingoutputon k = 3. The output s essentially "unlabeled,"and so labeling of thepointswas achieved by applyingAlgorithm 3.3 of Stephens (1997). The variable k sampler visits theminormode at least 6 separate times n 1913 iterations, ompared with once in 10, 000 iterationsfora fixed k sampler

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    19/36

    BAYESIAN ANALYSIS OF MIXTURES 57

    0~~~~~~~~~~~~~~~~ 0no,~~~~~~~~~~~~~~~~~~c

    0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14k fitting omals, fixedkappa, lambda=l) k fi8nig omals, varabl. kappa. lamtbda1) k fitting4a, fixedkappa. lambda-i)

    (a) A = 1

    10 -Ds

    c, Si 0.| 1One 640 2 4 6 8 10 12 14 0 2 4 8 8 10 12 14 0 2 4 6 8 10 12 14k fittingom kals, ixedkappa. lanbmda-3) k fitting ormals,variable kappa, lanbda-3) k fittingL4s,fixadkappa, lambda=3)

    (b) A = 3

    D+- Ii C. . 6

    0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14k(fittingormal,oaedappa,smbda=6( k hfitigonnala, analsa appa,sndrda=6( k ltfing_,4a,ixadappa,amtida-O(

    (C) A = 6FIG. 5. Graphs showing estimates (6) of Pr(k = i) for i = 1, 2, ., for thegalaxy data. Theseestimatesare based on the values of k sampled usingAlgorithm3.2when fitting he threedifferentmodels to thegalaxy data with A= 1,3,6, with n each case the first 0, 000 samples having beendiscarded as burn-in.The three olumns showresultsforLeft:Normals, Fixed-K;Middle:Normals,Variable-Ki; Right: t4s,Fixed- . The posterior distributionof k can be seen to depend on the typeof mixtureused (normal or t4),the prior distribution or k (value of A), and thepriordistributionfor ji., 2 ) (Variable- K orFixedK).

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    20/36

    58 M. STEPHENS0 0 0

    o a 0~~~~~~~~N a |0 10 20 30 40 0 10 20 30 40 0 10 20 30 40Fitting ornals, fixed appa Fitting romal.s enable kappa Fitingts it 4 degrees offreedom hixed appa

    (a) A= 1

    0 O a0 10 20 30 40 0 10 20 300 0 13

    W)~~~~~~~b A=330 j > o Xo ci ~~~~~~~~~~~~~~~~~~0

    0 10 20 30 40 0 10 20 30 40 0 10 20 30 40Fitting ormals,fixed appa Fitting omrna, vanabia kappa Fitting awith dgree" of treedom,fixed appa

    (b) A = 3

    o Ci 04

    o ~~~~~~~~~~~~~~~0 00 0~~~~~~~c

    0 10 20 30 40 0 10 20 30 40 0 10 20 30 40Fitting omials, fixoedappa Fifrfgnonnala, vaiiabls kappa Fittingwisth 4 degree. ol trsedom,fixedkappa

    (c) A = 6

    FIG 6.Peitv estoriae ()frteglx aa hs arebsdo h upto

    or L 0W~~~~~~~~~~~~~~~~~~6 o0 10 20 30 40 0 10 20 30 40 0 10 20 30 40Fitfingormials,oxedappa Fitinnormals,adabla appa Fitngo svitht egreeatfasrdomn.oxedappa

    (d) A= 25FIG. 6. Predictivedensityestimates (7) forthegalaxy data. These are based on the output ofAlgorithm3.2 when fitting he threedifferentmodels to thegalaxy data with A = 1,3,6,25.The three columns show resultsforLeft:Normals, Fixed-K~;Middle:Normals, Variable- ; Right:t4s,Fixed-K. The density estimates become less smooth as A increases, corresponding o a priordistributionwhichfavorsa largernumberof components.However, he method ppears toperformacceptably forevenunreasonably arge values ofA.

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    21/36

    BAYESIAN ANALYSIS OF MIXTURES 59follows. upposewe assume thatthe data has arisenfrom ither mixtureofnormals r a mixture f t4S,withp(t4) = p(normal)= 0.5. For the Fixed-K priorwith A= 1 we can estimatep(k I 4, Xn) and p(k Inormal, n) usingAlgorithm.2 (Table 1). By Bayes'theoremwehave(34) p(k I 4 p() = p(k, t4 I[X) for ll kp(k4,x0)=P(t4I Xn)and so(35) p(t4 IX -) = p(k, t4 X) = p(Xn k,t4)p(k, t4) for ll k,~~L40) -p(k I t4,Xnl) p(k t4,Xn)p(Xn)and similarly(36) p(normalXn) -p(Xn k,normal)p(k, ormal) for ll k.p(k Inormal, n)p(xn)Thus fwe can estimate (Xn Ik, 4) foromekandp(Xn Ik,normal) or omekthenwe canestimate (t4I n) andp(normal xn).Mathieson 1997)describesa methoda type f mportanceamplingwhich e refers o as Truncated ar-monicMean (THM) andwhich s similar o the method escribed yDiCiccio,Kass, Raftery ndWasserman1997)] of obtaining stimates orp(xn Ik, 4)and p(Xn Ik,normal), nd uses this method oobtain he estimates

    -log p(xn Ik= 3, t4) - 227.64and- logp(xn Ik= 3,normal) ` 229.08,

    giving using quations 35) and (36)]p(t4 Ixn) , 0.916 and p(normal Ixn) - 0.084,

    from which we can estimate p(t4, k IXn) = p(t4 Ixn)p(k It4,Xn), and simi-larlyfornormals-the results re shownnTable 2. We conclude hatfor heprior istributionssed, mixturesf 4 distributionsre heavily avored vermixtures fnormal istributions,ithfour 4 componentsaving hehighestTABLE 1Estimates of the posteriorprobabilities p(k I 4, ) and p(k Inormal,xn) for the galaxy data(Fixed-Kprior,A= 1). These are the means of theestimatesfromfive eparate runs of Algorithm3.2, each run consisting of 20, 000 iterations with the first10, 000 iterations being discarded asburn-in;the standard errorsofthese estimates are shown in brackets

    k= 2 3 4 5 6 >6pk It4, x') 0.056 0.214 0.601 0.115 0.012 0.001(0.014) (0.009) (0.011) (0.005) (0.001) (0.000)

    I(k normal, xn) 0.000 0.554 0.338 0.093 0.013 0.001(0.014) (0.011) (0.004) (0.001) (0.000)

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    22/36

    60 M. STEPHENSTABLE2Estimates f theposteriorrobabilities (t4, k x') and p(normal, x') forthegalaxydata(Fixed-Krior,A 1).See text or etails fhowthesewere btained

    k= 2 3 4 5 6 >6p(t4,kIxn) 0.051 0.196 0.551 0.105 0.011 0.000p^(normal,k x') 0.000 0.047 0.028 0.008 0.001 0.000

    posteriorrobability.t wouldbe relativelytraightforwardomodifyural-gorithmofit distributions ith n unknown umber fdegrees ffreedom,thusautomatingheabovemodel hoiceprocedure.t would lso be straight-forwardo allow eachcomponentf he mixture o havea differentumber fdegrees ffreedom.

    4.2. Example2: Old Faithful ata. Forour second xample,we considerthe OldFaithful ata [theversion rom ardle 1991) also consideredyVen-ables and Ripley 1994)]which onsists f data on 272 eruptions fthe OldFaithful eysern theYellowstone ational Park. Each observationonsistsoftwoobservations:hedurationinminutes) ftheeruption,nd thewait-ing time in minutes) efore henexteruption. scatterplotofthe data intwo dimensionshows womoderatelyeparatedgroupsFigure7). We usedAlgorithm.2 to fit mixture f an unknown umber fbivariatenormaldistributionso thedata, usingA= 1,3 and both heFixed-K ndVariable-Kpriors etailed n Section .Each runconsisted f20,000 iterations fAlgorithm.2,with hestartingpointbeingchosenby settingk = 1, setting6, K) to thevalues chosenfortheFixed-K rior, nd sampling he otherparameters rom heiroint prior

    8

    : O * . X

    o

    1 2 3 4 5 6duration

    FIG.7. Scatter lotof heOldFaithful ata [from ardle 1991)]. The x axisshows heduration(inminutes) fthe ruption,nd they axis shows hewaiting ime inminutes) eforehenexteruption.

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    23/36

    BAYESIANANALYSIS OF MIXTURES 61distribution.n each case thesamplermoved uickly rom he ow ikelihoodofthe starting ointto an area ofparameter pace withhigher ikelihood.The runsforA = 3 tookabout 7-8minutes. igure8a shows theresultingsampled alues of henumber f omponents,which an beseen tovarymorerapidlyfor heVariable-Kmodel,due in partto its greaterpermissivenessof extracomponents.or the runs with A = 3 the proportionf terationswhich esultedn a change n k were9% (Fixed-K) nd39% (Variable-K).orA = 1 thecorrespondingigures ere3% and 10%respectively.raphsof heautocorrelationsFigure8b) suggest hatthemixings slightly oorer hanfor hegalaxydata,presumablyue tobirths freasonable omponentseingless likelynthetwo-dimensionalase. Thispoorermixingmeansthat ongerrunsmaybe necessary o obtain ccurate stimates fp(k Ixn). The methodofGelman and Rubin 1992) appliedto tworuns of ength 0,000startingfrom = 1 andk= 30 diagnosed onvergence ithin 0,000 terations or heFixed-KpriorwithA= 1, 3.Estimates f heposterioristributionor Figure c)show hat tdependsheavily ntheprior sed,while stimates f hepredictiveensityFigure d))are less sensitive ochanges n theprior.Wheremore han twocomponentsare fitted o thedata theextra omponentsppeartobemodeling eviationsfromnormalityn the twoobviousgroups, ather han interpretablextragroups.

    4.3. Example 3: Iris Virginicadata. We now briefly onsider the famousIris data,collected yAnderson1935) which onsists ffourmeasurements(petaland sepal length ndwidth) or 0 specimens feach ofthree pecies(setosa, versicolor, nd virginica) of ris. Wilson (1982) suggests that the vir-ginicaandversicolorpeciesmay achbesplit nto ubspecies,houghnalysisbyMcLachlan 1992) using maximumikelihoodmethods uggests hat thisis not ustified y the data. We investigated hisquestionfor hevirginicaspeciesbyfitting mixture f n unknown umber fbivariate ormal istri-butions o the50 observationsf epal engthndpetal ength or his pecies,which re shown nFigure9.Our analysis was performed ith A = 1, 3 and withbothFixed-K ndVariable-K riors.WeappliedAlgorithm.2 toobtain sample of ize 20,000from random tarting oint, nd discarded he first 0,000observationssburn-in. he mixing ehavior fthechain overk was reasonable,withthepercentagesf ample pointsforwhichk changedbeing6% (A= 1) and 21%(A= 3) for heFixed-K rior,nd5% (A= 1) and36% A= 3) for heVariable-K prior. hemode oftheresultingstimates or heposterior istributionfk is at k = 1 for t least threeofthe fourpriors sed (Figure10a) and theresults eemtosupportheconclusionfMcLachlan 1992)that hedatadoesnotsupport divisionntosubspecies thoughwe notethat n ouranalysisweusedonly woof he fourmeasurementsvailablefor achspecimen). hefullpredictiveensity stimatesnFigure10b ndicate hatwheremore hanonecomponents fitted othedata they reagainbeingusedto model ack ofnormalitynthedata, rather han nterpretableroupsn thedata.

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    24/36

    62 M. STEPHENS

    q ~~~o q IL-Lao A o

    0 5000 15000 0 5000 15000 0 5000 15000 0 5000 15000Fixed appa,jambda-1 Vaniablhappa,lambda=1 Fixed appa,lambda,3 Vanable appa.lambda.3

    (a) Sampledvaluesofk

    o o o o

    0 2040 6080 100 0 204056080 100 0 2040 6080 100 0 2040860 80 100Lag hxed appa.iambda=1) Lag vanabil appa ambda-I) Lag hxed appa.Imbda=3) Lag vanable appa,lambda=3)

    (b) Autocorrelationsf sampledvalues ofk

    D _ _ _0 _ _ _ _ O' |i

    o . . , , . o . . . . . o . . , . , . o

    0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10k (Fixed kappa, lambda=1) k Varable kappaj brmbda.1) k (Fixed kappa, lambda-3) k Variabl. kappa, lambda=3)

    (c) Estimates6) of Pr(k = i)C5 S4 1 *6 oa . 14 cL .. , ..............:O:6: 8 < t X1 2 34 6 6 0 2 34 5 10 2 34 5 6 0 2 34 5 6

    Fixed appa Vanable appa 1F edkappa Vanablekappa

    (d) Predictive ensity stimates 7), dark shading orrespondingo regions f high density, llshaded on the same scale

    FIG. 8. Resultsforusing Algorithm .2 tofit mixture fnormaldistributions o the Old Faithfuldata. The columns show resultsforLeft:Fixed-K prior,A= 1; Left-middle:Variable-Kprior,A= 1;Right-middle: ixed-K prior,A= 3; Right:Variable-Kprior,A= 3. The posteriordistributionofkcan be seen to depend on boththeprior distribution ork (value of A),and theprior distributionfor tL, 1) (Variable-K or Fixed-K). The densityestimates appear to be less sensitiveto choice ofprior.

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    25/36

    BAYESIAN ANALYSIS OF MIXTURES 63

    CD -1.o....

    C)LI)

    4 5 6 7 8 9sepal length

    FIG. 9. Scatterplot ofpetal length gainst sepal lengthforthe ris Virginicadata.

    _ _ _ _.. ._ _ _. .1 , . . O . . , ,0 2 4 8 810 0 2 4 6 810 0 2 4 8 810 0 2 4 8 810k Fixed kappa. larrbda=1) k Variabe kappa. lambda-1) k Fixed kappa, lambda=3) k Variable kappa, lanbda.3)

    (a) Estimates6) ofPr(k = i)

    O - . . 7Of - Ot;-:~~~~~~~~~~~~~~~.O, . . . . . O. . ~~~ ~ ~~~...... ...yi-- ; , .

    4 5 6 7 8 9 4 5 6 7 8 9 4 5 6 7 8 9 4 5 6 7 8 9Fixed kappa, bambda-1 Variabl kappa. urbdal1 Fixed kappa, In,bda.3 Vanable kappa, bknbda-3

    (b) Predictive ensitystimates7), dark hading orrespondingo regions f highdensity,llshadedon the samescale

    FIG. 10. Results forusing Algorithm3.2 to fit mixtureofnormal distributions o theIris Vir-ginica data. The columns show resultsforLeft: Fixed-Kprior,A= 1; Left-middle:Variable-K prior,A= 1;Right-middle: ixed-Kprior,A= 3; Right:Variable-K prior,A= 3. The mode of the estimatesof Pr(k = i) is k = 1 forat least threeofthefourpriors used, and seems to indicate that the datadoes notsupportsplittingthespecies intosub-species.

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    26/36

    64 M. STEPHENS5. Discussion.5.1. Density estimation, nference ork and priors. Our examples demon-

    strate hata Bayesianapproach o density stimation singmixtures f uni-variate or bivariate)normaldistributions ith n unknown umber fcom-ponentss computationallyeasible, nd thattheresulting ensity stimatesarereasonably obust omodeling ssumptionsndpriors sed. Extension ohigher imensionss likely oprovide omputationalhallenges, utmight epossiblewith uitable onstraintsn the covariancematricesrequiringhemall to be equal or all to be diagonalfor xample).Our examples lso highlighthe fact hat while nference or he numberofcomponents inthe mixtures also computationallyeasible,heposteriordistribution ork can be highly ependent n not ust theprior hosenfork,but also theprior hosenfor he otherparameters f themixturemodel.Richardson ndGreen 1997), ntheirnvestigationfone-dimensionalata,note that when using the Fixed-K rior, hevalue chosenfor in thepriorJ7((, K-1) for he means ctl, . ., Ak has a subtle effect n the posteriordistri-bution fk.Avery argevalueof , representingstrong elief hat hemeanslie at g (chosen o be themidpointftherangeof hedata) will favormodelswith smallnumber f omponentsnd arger ariances. ecreasing torep-resentvaguerpriorknowledgebout the meanswill nitially ncourage hefittingfmore omponents ithmeansspreadacrosstherangeofthe data.However, ontinuingo decrease , torepresent aguer ndvaguerknowledgeon the location fthemeans,eventually avors ittingewer omponents.nthe limit, s K -* 0, theposterior istributionfk becomes ndependentfthedata,anddepends nly n the number fobservations,eavily avoringonecomponent odel or easonable umber f bservationsStephens1997),Jennison1997)].Priorswhich ppeartobe only weakly"nformativeor heparametersf hemixtureomponents ay husbehighlynformativeor henumber f omponentsnthe mixture. incevery argeandvery mallvaluesofK in the Fixed-K rior oth ead topriorswhich re highlynformativeork, tmight e interestingo searchfor valueof (probably ependingntheobserved ata)which eads to a Fixed-K riorwhichs "minimallynformative"fork in somewell-defineday.Wherethemain aim ofthe analysis s to definegroupsfordiscrimina-tion as intaxonomicpplicationsuch as the risdata, e.g.) t seemsnaturalthat the priors hould reflect ur belief hat thisis a reasonable im, andthus avoidfittingeveral imilar omponents here ne willsuffice. his deais certainly otcaptured ythepriorswe used here,whichRichardson ndGreen 1997) suggest re more ppropriate or exploring eterogeneity."n-hibition riorsfrom patial pointprocesses as used by, .g., Baddeley andvan Lieshout 1993)] provide ne wayofexpressing priorbelief hat thecomponents resentwillbe somewhat istinct. lternativelyemight ry is-tinguishingetween henumber f omponentsn themodel, ndthenumberof groups"nthedata, by allowing achgroup o be modeled y several sim-ilar"components.orexample, roupmeans might e a priori istributedn

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    27/36

    BAYESIANANALYSIS OF MIXTURES 65the scale of thedata, and each groupmight onsist fan unknown umberofnormal omponents, ithmeans distributedround hegroupmeanonasmaller cale than the data. The discussion ollowingichardson nd Green(1997)provides number f ther venuesfor urthernvestigationf uitablepriors, nd wehopethatthecomputationaloolsdescribednthispaperwillhelpmake suchfurthernvestigationossible.

    5.2. Choice fbirth istribution.Thechoice fbirth istribution emadein Algorithm.1 is rathernaive,and indeedwewererather urprised hatwe wereable to make muchprogresswiththisapproach. ts success n theFixed-Kmodel ppearstostemfrom he fact hatthe data-dependent)nde-pendent riors ntheparameters are not so vagueas toneverproducereasonablebirth vent, ndyetnot so tight s toalwayspropose omponentswhich re very imilar othosealreadypresent.n the Variable-Kmodel hesuccessof he naivealgorithmeemstobe duetotheway n which hehyper-parameters and f "adapt" hebirth istributiono makethe birth fbettercomponentsmore ikely. erewemayhave been ucky,ince thepriorswerenot chosenwiththesepropertiesn mind. n general hen tmaybe neces-saryto spendmore ffortesigningensiblebirth-deathchemes o achieveadequate mixing.Our resultssuggest hat a strategy fallowing hebirthdistribution(y; u, 4)) to be independentfy,butdepend n thedata,mayresult n a simplealgorithm ithreasonablemixing roperties. n ad hocapproach oimproving ixingmightnvolve implynvestigating ixing e-havior ormore r ess"vague" hoices fb.Amore rincipledpproachwouldbetochoose birth istribution hich anbe both asilycalculated ndsimu-lated from irectly,nd which oughly pproximateshe marginal) osteriordistributionf randomlyhosen lement fP. Such an approximation ightbe obtainedfrom preliminarynalysiswith a naive birthmechanism, rperhaps tandard ixed-dimensionCMC with arge k.A moresophisticatedpproachmight llow the birthdistribution(y;(T, 0)) to dependon y. Indeed,theopposite xtreme o ournaiveapproachwould eto allow llpoints odie at a constantate, nd findhecorrespondingbirth istributionsing 15) [as in,e.g.,Ripley1977)]. However,much ffortmay henbe required ocalculate hebirth ate 3(.) perhaps yMonte-Carlointegration), hich imits heappeal ofthisapproach. Thisproblem id notarise nRipley 1977)where imulations ereperformedonditional n a fixedvalue ofkbyalternating irths nd deaths.]Forthisreason we believe hatit is easiertoconcentratendesigning fficientirth istributionshich anbe simulated rom irectlynd whosedensities anbe calculated xplicitlyothatthedeathrates 15) are easilycomputed.

    5.3. Extension oother ontexts. t appearsfromurresults hat, or initemixture roblems,urbirth-deathlgorithmrovides n attractive lterna-tive to thealgorithm sed byRichardsonnd Green 1997). Thereseems tobe considerable otential or pplying imilarbirth-deathchemes n othercontextss an alternative omoregeneralreversibleumpmethods.Wenow

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    28/36

    66 M. STEPHENSattempt o give some insight nto forwhichproblems uch an approach slikely o be feasible.We beginour discussion y highlightinghemain differ-ences between ur Algorithm.1 and the algorithm sed by Richardson ndGreen 1997).A. Our algorithm perates n continuous ime,replacing he accept-rejectschemeby allowing vents o occur t differingates.B. Our dimension-changingirth nd death movesdo not make use of themissingdata zn, effectivelyntegratingut over themwhencalculatingthe ikelihood.C. Our birth nd deathmovestake advantageof the natural nested struc-tureof he models, emovinghe need for hecalculation f complicated

    Jacobian,ndmaking mplementation ore traightforward.D. Our birth nddeathmoves reat heparameterss a point rocess,nd donotmake use of nyconstraintuch as IL, < ,Uk [used byRichardsonand Green 1997) indefiningheir plit nd combinemoves].We considerA tobe theleast important istinction.ndeed, discrete imeversion f urbirth-deathrocess sing n accept-rejecttep ouldbedesignedalongthe linesofGeyer nd M0ller 1994),orusingthegeneralreversible-jump formulationf Green 1995). (Similarly ne can envision continuoustime version f thegeneralreversibleump formulation.) e have no goodintuition orwhether iscrete imeor continuous imeversions re likely obe more fficientngeneral, lthoughGeyer ndM0ller 1994) suggests hatit s easier toobtain nalytical esults elatingomixing or he discrete imeversion.PointB raises an important equirementor pplication fouralgorithm:we must be able to calculate the likelihood or ny givenparameters. hisrequirementmakesthemethod ifficulto applytoHidden MarkovModels,orothermissingdata problemswhere alculation fthe likelihood equiresknowledgef hemissing ata.Onesolutiono thisproblem ouldbe to ntro-duce themissing ata into heMCMCscheme,ndperformirths nddeathswhilekeeping hemissing ata fixedalong he inesof he births nd deathsof empty" omponentsnRichardsonnd Green 1997)]. However, here hemissing ata is highlynformativeork thisseems ikely o ead topoormix-ing, nd reversibleump methodswhich roposeoint updatestothemissingdata and thedimensionf hemodel ppearmore ensiblehere.In order o takeadvantage f hesimplicityf hebirth-death ethodology,we mustbe ableto view heparametersf urmodel s a point rocess,nd nparticularwemustbe abletoexpress urpriorn terms f Radon-Nikodymderivative, (.),with espect o a symmetric easure, s inSection .2.This snota particularlyestrictiveequirement,ndwegivetwoconcretexamplesbelow.These examples re inmanyways simpler han the mixture roblemsince here re no mixture roportions,nd themarked oint rocess ecomesa pointprocesson a space (D. The analogueof Theorem .1 for hissimplercase [which ssentially ollows irectlyrom reston1976)andRipley1977)]

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    29/36

    BAYESIANANALYSIS OF MIXTURES 67maybe obtained yreplacing ondition15) with(37) (k + 1)d(y; 4))r(yu O)L(y U4) = f(y)b(y;O)r(y)L(y).Providedwe cancalculate he ikelihood (y), theviabilityf he birth-deathmethodologyilldepend nbeing ble to find birth istribution hich ivesadequatemixing. hecommentsnSection .2provideomeguidancehere. tis clear that n someapplications he use ofbirth nddeathmoves lonewillmakeit difficulto achieveadequatemixing.However,he ease withwhichdifferentirth istributions aybe tried, nd the successofouralgorithmnthe mixture ontextwithminimal ffortn designingfficientirthdistribu-tions, uggests hat his ype f lgorithms worthryingeforemore omplexreversibleumpproposaldistributionsre implemented.

    Example 1: Change point analysis. Considerthe change-point roblemfromGreen 1995). The parameters fthis model re the number fchangepoints k, the positions 0 < s

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    30/36

    68 M. STEPHENSfor he new h to dependon the new s, again being entered n regionswhichappear to be goodcandidates ased on the data.Now suppose that [as in Green (1995)] S(l), ... ., S(k) are, given k,a prioridis-tributed s the even-numberedrder tatistics f2k+ 1points ndependentlyand uniformlyistributedn [0,L]:

    (40Sp(s() ' * X(k)) - L2k+ 1) (S(1) - 0)(S(2) - s(1))... (S(k) - S(k1))(L - S(k))I(O < S()

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    31/36

    BAYESIANANALYSISOF MIXTURES 69p(l3i),againindependentor ll i. Then wehave

    r(k, il, P3i,),***X(ik, P3ik))(43) (0, if a = ib for omea, b,

    pil P(/i1) .Pik(Pik.) otherwise.The choice fbirth istribution(y; i, P3i))must n thiscase dependony, norder o avoidaddingvariableswhich realreadypresent.Anaivesuggestionwouldbe to set(44) b(y; (i, Pi3))= bip(8i3)withbi ocpi for hevariables not lreadypresentny.Again,more fficientschemescould be devisedby letting he birthsbe data-dependent,ossiblythroughxamininghemarginal osterior istributionsfthefPin prelimi-nary nalyses.

    APPENDIX: PROOF OF THEOREM 3.1PROOF. Ourproof rawsheavily n thetheory erived yPreston1976),Section ,for eneralMarkov irth-deathrocesses n statespaceQl = Uk lkwhere heQk are disjoint. he process volvesby umps,ofwhich nly fi-nitenumber an occur n a finite ime.The umpsare oftwotypes: births,"which re umpsfrom point n Qk to Qk+1' and "deaths,"which re umpsfrom point n flk to a point n Qk-l When theprocess s at y E f1k thebehavior ftheprocess s defined ythe birth ate13(y), hedeath rate5(y),and the birth nd death transitionkernelsK(k)(y; ) and K(k)(y; ) which areprobability easuresonQk+l and Qk-1 respectively.irths nddeaths occuras independent oissonprocesses,withrates,B(y) nd 6(y) respectively.fabirth ccurs hentheprocessumpsto a point n fk+1' with heprobability

    that thispoint s in any particular etF C Q2k+1beinggivenbyK(k)(y; F). Ifa deathoccurs hen heprocessumpsto a pointnQk-1l with heprobabilitythatthispoint s in any particular et G C lk-1 beinggivenbyK" (y; G).Preston1976) showed hatfor uch a process opossessstationaryistribu-tion i it is sufficienthatthefollowing etailedbalanceconditionsold:DEFINITION 1 (Detailed balanceconditions).,i is said to satisfy etailedbalanceconditionsf

    (45) | ,/3(y)j/8( )= 8(z)K kl) (z; F) dik+?l(z) fork > 0, F C lkand(46) JG(z) iik?i(z) I /3(y)K k(y; G) di8k(y) fork > 0, G C Qk+l-Thesehave the ntuitivemeaning hatthe rate at which heprocess eavesanyset throughheoccurrence fa birth s exactlymatched ythe rateat

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    32/36

    70 M. STEPHENSwhich heprocess nters hat etthroughheoccurrencef death, ndvice-versa. O

    We thereforeheck hatp(k, a, Xn,w, j) satisfies he detailedbalanceconditions orourprocess,whichcorrespondso thegeneralMarkovbirth-deathprocesswithbirth ate/3(y), eathrate8(y),and birth nd deathtran-sitionkernelsK k)(y; ) and K() (y; ) which atisfy(47) K(k)(y; F) = b(y; (IT, 4p))dr v(d4)8 1T~~(,:YU(7T, )eFand(48) 5(y)K(k)(y; F) = d y\(, k); I, k)).(w7,O)ey:y\(r,O)eF

    Webegin by introducingomenotation. et Ak representheparameterspace for hek-component odel,with he abeling ftheparameters akeninto ccount, nd letQk be thecorrespondingpaceobtained y gnoringhelabelingofthe components.f ,a,4) E Ak, thenwe will write ,r, ] for hecorresponding ember fflk* WithA = Uk>j Ak, let P(.) and P(.) be theprior ndposteriorrobability easures nA,and let Pk(.) and Pk(.) denotetheir espectiveestrictionsoAk.Theprior istributionas Radon-Nikodymderivative (k, a, +) withrespect o4k- x vk.Thusforir, +) e Ak wehave(49) dPk{(r(, 4)} = r(k, n, )(k - 1)! dr7T .. diTkl v(d l)... v(d k).Also,by Bayestheoremwe have

    dP{(r(, ?)} ocL([r, ]) dP{ (,, +)}and so wewill write

    dP{(rr, +)} = f [r,]) dP{(Tr,+)}for omefQ([,+]) o L([r, D]).Now et u(.) and,i(.) be theprobability easures nduced nflby P(.) andP(-) respectively,nd let IkQ) and /k(.) denote heirrespective estrictionstoQk. Then for nyfunction Q -+ R we have(50) 1 y)(y)kd(Y) f g([r,a ,) dPk I(, )}and |Q(y) d/i(k =AY g([wr,])dPkl QT, 4)I(51) - f ([r, Df([Tr, )dPk{(I, r)}

    - I g(y)f(y) dhk(y).Qk

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    33/36

    BAYESIANANALYSIS OF MIXTURES 71We define irths nA by

    (52) (Tr,4) U (IT, 4) := ((7TI(1 - ), b1), * k*, (1 -( r), k), (n, .))and willrequire hefollowingemma which s essentially simplechangeofvariableformula).

    LEMMA .1. If Tr, ) E Ak and (T, 4)) [0, 1] x (Fthenr(k, Tr,)dPk+lt(Tr, 4) U (ir, )I)}

    = r(k+ 1, yr, ) U ir, 0))k(l - 0 for ll y. Let I(.) denote hegenericndicatorunction,o I(x E A) = 1 ifx E A and0 otherwise.We checkthefirst artof hedetailedbalanceconditions45) as follows:LHS = IF Y)dk(Y)

    = | I(y E F)fl(y)f(y)d,Ak(y) [equation51)]Qk

    I(y E F)8(y)f(y) f Jb(y; IT, )) dw v(d4) d,LLk(Y)[bmust ntegrateo 1.]RHS = | (z)K(k+l)(z; F) d, (z)k+1

    -| 3(z)K(k+l)(z; F) f(z) dAk+1(z) [equation (51)]fk+ 1

    -t ~ ~ ~~ (z\,7T +)(7T, 0)) f z)dAk+1(Z) [equation (48)]k+&,,_1X_.(0EZ\7,_ 1EF

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    34/36

    72 M. STEPHENSk+1

    = fA|+1 E I([Tr, )]\('i,4i) E F) d([Tr, )]\(ij, 0i); (i, 0)))Ak+l i=1x f [r, 4)])dPk+l (Tr ) [equation (50)]

    = fA(k + 1)I([r, 4)]\(Tk+l, Ok+l) E F)Ak+1 xdQT, )]\(QTk+l, Ok+l); (7Tk+l, Ok+l))x f [,, 4])dPk+l t(r, 4)1 [bysymmetryfPk+l(-)]

    = fA(k + 1)I([n', 4)'] F) d([iT', )]; iT, 0))f ([r', 4)'] (, 4))Ak+1 xdPk+lr, 4)')U (T )} [(T, 4)')U iT, f)= (r, 4))]

    = L| /0|1] f([T', 4)'] F)(k + 1)d([rr', ']; 7T,4)))f [r', 4']U (iT, 4))x r(k + 1, (,' ,(F) U (r, r)kl_)k-1x dr v(d4) dPk ,' 1, ')} [Lemma5.1]

    = f|kk/ |1] I(y F)(k + 1)d(y; (ir, 4)))f y U r, 4))xr (y U (r, 4)) k(l - T)k-l dT v(d4) dpk(y) [equation (50)]r(y)

    and so LHS = RHS provided

    (k+1)d (y; (iTr, ) f yU(, 0)) r(yu (ir,4))) (l - r)k-l = 3(y)b(y; IT, 0)) f y)which s equivalent o the conditions15) stated n the Theorem s f(y) O(L(y). Theremaining etailedbalance conditions46) can be shown o hold na similarway.The condition hatr(y)L(y) = 0 for ll y can now be relaxedbyapplyingthe conditions13) and (14), and restrictinghe spaces Ak and Qk to {yr(y)L(y) > 0}. U

    Acknowledgments. I would iketo thankmyD.Phil.supervisor,rofes-sorBrianRipley, or uggestinghisapproach o theproblem,nd forvalu-ablecommentsn earlier ersions. would lso liketo thankMarkMathiesonforhelpfuldiscussions n thiswork, nd PeterDonnelly, eterGreen, woanonymous eviewers,n AssociateEditor nd the Editorforhelpful dviceonimprovinghemanuscript.

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    35/36

    BAYESIAN ANALYSIS OF MIXTURES 73REFERENCES

    ANDERSON, E. (1935). The risesof he Gaspe Peninsula. ulletin oftheAmerican Iris Society592-5.BADDELEY,A. J. andVANLIESHOUT,M. N. M. (1993). Stochastic eometry odels n high-levelvision. nAdvancesnApplied tatisticsK.V. Mardia nd G.K.Kanji, ds.)1231-256.Carfax, bingdon,K.BROOKS,. P. andROBERTS, . 0. (1998). Convergencessessment echniquesorMarkov hainMonteCarlo. tatist. Comput.8 319-335.CARLIN, . P. and CHIB, . (1995). Bayesianmodel hoice ia Markov hainMonteCarlomethods.J Roy.Statist. Soc. Ser. B 57 473-484.CHIB, . (1995). Marginalikelihoodrom heGibbs utput. Amer tatist. ssoc. 0 1313-1321.COWLES,M. K. andCARLIN,B. P. (1996).Markov hainMonteCarloconvergenceiagnostics:comparativeeview. Amer.Statist. Assoc. 91 883-904.CRAWFORD,. L. (1994).Anapplicationf heLaplacemethod o finitemixture istributions.Amer.Statist. Assoc. 89 259-267.DAWID, . P. (1997).Contributiono the discussion fpaperbyRichardsonnd Green 1997).JRoy.Statist. Soc. Ser B 59 772-773.DiCiccio, T., KASS, R., RAFTERY, . and WASSERMAN,. (1997). Computing Bayes factorsbyposteriorimulation nd asymptoticpproximations.AmerStatist.Assoc.92 903-915.DIEBOLT, . nd ROBERT,.P. 1994).Estimation ffinitemixtureistributionshrough ayesiansampling. Roy.Statist. Soc. Ser B 56 363-375.ESCOBAR,.D. andWEST,M. 1995). Bayesiandensitystimationnd nferencesingmixtures.J Amer Statist. Assoc. 90 577-588.GELMAN,. andRUBIN,. B. (1992). nferenceromterative imulationsingmultiple equences(with iscussion).tatist. ci. 7 457-511.GELMAN,A. G., CARLIN,J. B., STERN,H. S. and RUBIN,D. B. (1995). Bayesian Data Analysis.Chapman& Hall,London.GEORGE,E. I. and MCCULLOCH,R. E. (1996). Stochastic earchvariable election.n MarkovChainMonteCarlo in PracticeW R. Gilks, . Richardsonnd D. J.Spiegelhalter,eds.) Chapman& Hall,London.GEYER, . J.andMOLLER,J. 1994).Simulation roceduresnd likelihoodnference or patialpoint rocesses. cand.J Statist. 1 359-373.GILKS,W. R., RICHARDSON, S. and SPIEGELHALTER,. J.,eds. (1996). Markov Chain Monte CarloinPractice. hapman& Hall,London.GLOTZL,E. (1981).Time-reversiblend Gibbsian oint rocesses. Markovianpatialbirth nddeathprocess n a general hase space.Math.Nach.102217-222.GREEN,.J. 1995).ReversibleumpMarkovhainMonteCarlo omputationndBayesianmodeldetermination.iometrika 2 711-732.HARDLE,W. (1991). Smoothing techniqueswith mplementation n S. Springer,New York.JENNISON,C. (1997).Comment n "On Bayesian nalysis fmixtures ith n unknown umber fcomponents,"yS. Richardsonnd P.J.Green. Roy. tatist. oc.Ser B 59 778-779.KELLY,F. P. and RIPLEY,B. D. (1976) A note on Strauss'smodelfor lustering.iometrika63357-360.LAWSON,. B. (1996). Markovchain MonteCarlo methods or patial clusterprocesses.nComputer Science and Statistics: Proceedings of the 27th Symposium on the nterface

    314-319.MATHIESON,. J. 1997). Ordinalmodels nd predictive ethodsn pattern ecognition.h.D.thesis,Univ.Oxford.MCLACHLAN, G. J. 1992). DiscriminantAnalysis and Statistical PatternRecognition.Wiley,NewYork.MCLACHLAN, . J. and BASFORD,K. E. (1988). Mixture Models: Inference nd Applications toClustering. Dekker,New York.PHILLIPS,D. B. and SMITH,A. F. M. (1996). Bayesianmodel omparison ia ump diffusions.n

  • 7/27/2019 Birth-Death MCMCs, Annals of Statistics, 2000, Stephens

    36/36

    74 M. STEPHENSMarkov hain MonteCarlo n Practice W.R. Gilks, . Richardson ndD. J.Spiegel-halter, ds.) 215-239. Chapman& Hall, London.POSTMAN, ., HUCHRA,. P. and GELLER,M. J. 1986). Probesof arge-scale tructuren theCoronaBorealisregion. heAstronomicalournal 2 1238-1247.PRESTON,. J. 1976). Spatial birth-and-deathrocesses. ull. nst. nternat. tatist. 6 371-391.PRIEBE, . E. (1994). Adaptivemixtures. Amer. tatist.Assoc.89 796-806.RICHARDSON, . and GREEN,P. J. (1997). On Bayesian analysis of mixtureswith an unknownnumber f omponentswithdiscussion). Roy. tatist. oc.Ser.B 59 731-792.RIPLEY, . D. (1977). Modelling patial patternswithdiscussion). Roy. tatist. oc. Ser.B 39172-212.RIPLEY,B. D. (1987). Stochastic imulation.Wiley, ew York.ROBERT,C. P. (1994).TheBayesianChoice: Decision-Theoreticotivation.pringer,ewYork.ROBERT,.P. 1996).Mixtures fdistributions:nferencend estimation.n Markov hainMonteCarlo n PracticeW.R.Gilks, . Richardsonnd D. J.Spiegelhalter,ds.) Chapman&Hall, London.ROEDER,. (1990).Densitystimation ith onfidenceets xemplifiedy uperclustersnd voidsinthe galaxies.J Amer. tatist.Assoc.85 617-624.SHEATHER,. J.and JONES, . C. (1991).A reliabledata-based andwidth electionmethod orkernel ensity stimation. Roy. tatist. oc. Ser.B 53 683-690.STEPHENS,. A. andFISCH,R.D. (1998). Bayesian nalysis f uantitativerait ocusdatausingreversibleumpMarkov hainMonteCarlo.Biometrics4 1334-1367.STEPHENS,M. (1997).BayesianMethodsorMixturesfNormal istributions.h.D.thesis,Univ.Oxford. vailable rom ww.tats. ox. ac .uk/hstephens.STOYAN, ., KENDALL,W. S. and MECKE,J. 1987). Stochastic eometrynd Its Applications,sted.Wiley, ew York.

    TIERNEY,. (1996). ntroductiono general tate-spaceMarkov haintheory.n MarkovChainMonte arlo nPracticeW.R.Gilks, . Richardsonnd D. J.Spiegelhalter,ds.) 59-74.Chapman& Hall,London.TITTERINGTON,. M., SMITH, A. F. M. and MAKOV, U. E. (1985). StatisticalAnalysis f FiniteMixture istributions. iley, ew York.VENABLES,W.N.andRIPLEY,B. D. (1994).Modern pplied tatisticswith -Plus. Springer, ewYork.VENABLES,W.N.andRIPLEY,B. D. (1997).Modern pplied tatistics ith -Plus,2nd d. Springer,New York.WEST,M. (1993). Approximatingosterior istributionsy mixtures. : Roy. tatist. oc. Ser. B55 409-422.WILSON,S. R. (1982).Soundandexploratoryata analysis. n COMPSTAT 1982.ProceedingsnComputationaltatisticsH. Caussinus, . EttingerndR. Tamassone, ds.) 447-450.Physica, ienna.

    DEPARTMENT F STATISTICS1, SOUTH PARKS RoADOXFORD, X1 3TGUNITED KINGDOME-MAIL: [email protected]