17
Classical Model Selection via Simulated Annealing S.P. Brooks University of Cambridge, UK N. Friel University of Glasgow,UK R. King University of Cambridge, UK Summary. The classical approach to statistical analysis is usually based upon finding values for model parameters that maximise the likelihood function. Model choice in this context is often also based upon the likelihood function, but with the addition of a penalty term for the number of parameters. Though models may be compared pairwise using likelihood ratio tests for example, various criteria such as the AIC have been proposed as alternatives when multiple models need to be compared. In practical terms, the classical approach to model selection usually involves maximising the likelihood function associated with each competing model and then calculating the corresponding criteria value(s). However, when large numbers of models are possible, this quickly becomes infeasible unless a method that simultaneously maximises over both parameter and model space is available. In this paper we propose an extension to the traditional simulated annealing algorithm that allows for moves that not only change parameter values but that also move between competing models. This trans- dimensional simulated annealing algorithm can therefore be used to locate models and parameters that minimise criteria such as the AIC, but within a single algorithm, removing the need for large numbers of simulations to be run. We discuss the implementation of the trans-dimensional simulated annealing algorithm and use simula- tion studies to examine their performance in realistically complex modelling situations. We illustrate our ideas with a pedagogic example based upon the analysis of an autoregressive time series and two more detailed examples: one on variable selection for logistic regression and the other on model selection for the analysis of integrated recapture/recovery data. Keywords: Autoregressive time series; Capture-recapture; Classical statistics; Information criteria; Logis- tic Regression; Markov chain Monte Carlo; Optimisation; Reversible jump MCMC; Variable selection. 1. Introduction E-mail: [email protected]

Classical model selection via simulated annealing

Embed Size (px)

Citation preview

Classical Model Selection via Simulated Annealing

S.P. BrooksyUniversity of Cambridge, UK

N. Friel

University of Glasgow,UK

R. King

University of Cambridge, UK

Summary . The classical approach to statistical analysis is usually based upon finding values for modelparameters that maximise the likelihood function. Model choice in this context is often also based uponthe likelihood function, but with the addition of a penalty term for the number of parameters. Thoughmodels may be compared pairwise using likelihood ratio tests for example, various criteria such as theAIC have been proposed as alternatives when multiple models need to be compared. In practical terms,the classical approach to model selection usually involves maximising the likelihood function associatedwith each competing model and then calculating the corresponding criteria value(s). However, when largenumbers of models are possible, this quickly becomes infeasible unless a method that simultaneouslymaximises over both parameter and model space is available.In this paper we propose an extension to the traditional simulated annealing algorithm that allows formoves that not only change parameter values but that also move between competing models. This trans-dimensional simulated annealing algorithm can therefore be used to locate models and parameters thatminimise criteria such as the AIC, but within a single algorithm, removing the need for large numbers ofsimulations to be run.We discuss the implementation of the trans-dimensional simulated annealing algorithm and use simula-tion studies to examine their performance in realistically complex modelling situations. We illustrate ourideas with a pedagogic example based upon the analysis of an autoregressive time series and two moredetailed examples: one on variable selection for logistic regression and the other on model selection forthe analysis of integrated recapture/recovery data.

Keywords: Autoregressive time series; Capture-recapture; Classical statistics; Information criteria; Logis-tic Regression; Markov chain Monte Carlo; Optimisation; Reversible jump MCMC; Variable selection.

1. IntroductionThe lassi al approa h to statisti al analysis is usually based upon �nding values for model parametersthat maximise the likelihood fun tion. Various te hniques have been proposed that may be used whenthese maxima annot be obtained analyti ally. See Nelder and Mead (1965), Ingber (1992), Dempsteret al (1977), Glover and Laguna (1997), Brent (1973), Flet her (1987) and Smyth (1996), for example.The earliest te hniques tended to su�er from the problem that they ould not be guaranteed to lo atethe global (as opposed to a lo al) maximum of the obje tive fun tion (typi ally the likelihood). Morere ent te hniques tend to be more robust, but often with an asso iated in rease in omputationalexpense.One su h method is that of simulated annealing (SA; Kirkpatri k et al 1983; Brooks and Morgan1995) The term simulated annealing derives from the physi al pro ess of heating and then slowly ooling a rystalline substan e and the observation that if the stru ture is ooled slowly enough, themole ules will line up in a rigid pattern orresponding to a state of minimum energy. The simulatedannealing algorithm mimi s this pro ess by produ ing a sequen e of draws from a series of statisti aldistributions that move toward a point mass at the minimum of a hosen obje tive fun tion as the\temperature" is lowered. The method has enjoyed wide su ess and has been applied a ross a wideyAddress for orresponden e: Steve Brooks, Statisti al Laboratory, University of Cambridge, Wilberfor eRoad, Cambridge, CB3 0WB, UKE-mail: [email protected]

2 Brooks, Friel and Kingspe trum of pra ti al problems. See Corana et al (1987), Simkin and Trowbridge (1992), Vanderbiltand Louie (1984), Brooks et al (1997), Draper and Fouskakis (2001) and Drago et al (1992), forexample.In this paper we fo us upon the problem of model sele tion, rather than the somewhat simplerproblem of parameter estimation under a �xed model. Typi ally, model sele tion is also based uponthe likelihood fun tion, but with the addition of a penalty term for the number of parameters. Thoughmodels may be ompared pairwise using likelihood ratio tests for example, various alternative riteriasu h as the AIC (Akaike 1974) and/or BIC (S hwarz 1978) have been proposed when several modelsneed to be ompared. In pra ti al terms, the lassi al approa h to model sele tion usually involvesmaximising the likelihood fun tion asso iated with ea h ompeting model and then al ulating the orresponding riteria value(s). However, when large numbers of models are possible, this qui kly be- omes infeasible unless a method that simultaneously (and eÆ iently) maximises over both parameterand model spa e is available.Re ent advan es in the Bayesian literature and, in parti ular, the advent of reversible jump Markov hain Monte Carlo algorithms (Green 1995) have greatly simpli�ed the Bayesian model determinationproblem (Gelfand 1996; Gamerman 1995, se tion 7). These methods are based upon the onstru tionof Markov hains that traverse both parameter and model spa e so that the best (of perhaps manymillions) of models an be identi�ed. In this paper we take ideas from the Bayesian Markov hainMonte Carlo (MCMC) literature and propose an extension to the traditional simulated annealingalgorithm that allows for moves that not only hange parameter values but that also move between ompeting models. This trans-dimensional simulated annealing (TDSA) algorithm an therefore beused to lo ate models and parameters that minimise riteria su h as the AIC, but within a single algo-rithm. This removes the need for large numbers of simulations to be run, thereby greatly simplifying(and essentially automating) the lassi al model sele tion problem.We begin, in Se tion 2, with a brief review of the simulated annealing algorithm introdu ed byKirkpatri k et al (1983). We motivate the algorithm as a method for simulating a sequen e of Markov hains with di�erent target densities and dis uss how re ent developments in the MCMC literature an help provide more eÆ ient SA algorithms. We also introdu e a pedagogi example on erningthe analysis of autoregressive time series models to illustrate some of these ideas. In Se tion 3, weintrodu e the general model sele tion problem and motivate the need for Markov hains that movebetween states of varying dimension. We motivate the use of reversible jump (RJ)MCMC updates asa means of moving from one model to another within the simulation, returning to the autoregressiveexample as an illustration. Finally, we introdu e two examples. The �rst, introdu ed in Se tion 4, on erns a variable sele tion problem for modelling redit s oring data using logisti regression; whilstthe se ond example on erns the analysis of integrated re apture/re overy data for a population ofShags ringed on the Isle of May. Both examples illustrate how our trans-dimensional SA algorithm an be used to explore model spa e when omplete enumeration is impossible and use simulationstudies to investigate the reliability and performan e of the method in realisti ally omplex model�tting situations.2. Fixed-Dimensional Simulated AnnealingThe traditional �xed-dimensional simulated annealing algorithm an be des ribed as follows. Given anobje tive fun tion f(�) that we wish to minimise over the p-dimensional ve tor of parameters � 2 �,the orresponding Boltzmann distribution (Kirkpatri k 1984; Geman and Geman 1984) admits adensity bT (�) / exp[�f(�)=T ℄. Obviously, we assume here that the integral R� bT (�)d� is �nite,otherwise some appropriate transformation of f an be made in order to ensure that it is. We beginby simulating a Markov hain with stationary density bT (�) at some initial temperature T = T0 say,and allow the hain to rea h equilibrium. We then de rease the temperature and ontinue the hainuntil equilibrium is a hieved in the new temperature. We ontinue to repeat this pro ess and, as Tde reases, the orresponding stationary density for the hain moves loser and loser to a point massat the minimum of f and the system essentially \freezes" in this minimal state. See Brooks andMorgan (1995), for example.We obtain the following algorithm.

Classical Model Selection via Simulated Annealing 3Step 1. Choose some initial temperature T0 and some initial starting onfiguration�0.Step 2. Propose a new onfiguration of the parameter spa e, �0 = g(�;u) whereu 2 U is drawn from an essentially arbitrary proposal distribution q(u) andg : � � U ! � denotes any fun tion su h that for all � 2 � and u 2 U thereexists a u0 2 U su h that g(g(�;u);u0) = �. This ensures that the reverse movefrom �0 to � is always possible.Step 3. A ept the move to �0 with probability�(�;�0) = min�1; bT (�0)q(u0)bT (�)q(u) � ; (1)where u0 satisfies g(�0;u0) = �. Otherwise, the hain remains in state �.Step 4. Repeat steps 2 and 3 for a spe ified number of iterations, until the hain isdeemed to have rea hed an equilibrium state.Step 5. Lower the temperature T a ording to some predetermined s hedule and re-peat steps 2�4, until some stopping riterion is met. The final onfiguration� will approximate the maximum of f(�).Steps 2 � 4 of the SA algorithm simply des ribe the simulation of a Markov hain with limitingdistribution bT (�) / exp[�f(�)=T ℄. Traditionally, the hain is simulated using Metropolis updates(Metropolis et al 1953; Kirkpatri k et al 1983). However, the algorithm allows for more generalupdating s hemes, su h as those arising from the re ent interest in Markov hain Monte Carlo methods(Gilks et al 1996; Gamerman 1995; Brooks 1998). In parti ular, we shall make use of Gibbs samplerupdates (Geman and Geman 1984; Besag and York 1989; Gelfand and Smith 1990) wherever possible.In order to implement the SA algorithm it is also ne essary to hoose both a stopping rule anda ooling s hedule. Various stopping riteria have been proposed, though most are based upon theproportion of a epted moves within the last temperature: if this is small (or even zero), the simulationends and the hain is assumed to have \frozen". For the remainder of this paper, we shall stop thealgorithm whenever no moves are a epted within a parti ular temperature. For the ooling s hedule,various authors (see Geman and Geman 1984 and Gidas 1985, for example) have used detailed studiesof non-homogeneous Markov hains to suggest that at iteration t of the algorithm, we should takeT / 1=log(1 + t). As the algorithm pro eeds, the temperature redu tions de rease dramati ally andit is impossible in pra ti e to follow these general guidelines. In pra ti e, far faster ooling ratesare typi ally used, and so the point at whi h the SA algorithm freezes is no longer guaranteed tobe the minimum of the obje tive fun tion, f . However, the sample path of the SA algorithm willin reasingly on entrate on regions of highest mass under bT as T tends to zero and, by hoosingthe observed sample value minimising f , we obtain a far more eÆ ient means of exploring parameterspa e than evaluating the obje tive fun tion at either randomly or systemati ally hosen points, seeRipley (1987). An alternative is to use the SA algorithm to identify starting points for mu h simplerdownhill sear h s hemes that o�er higher a ura y when started within a neighbourhood of the truesolution, see Brooks and Morgan (1994). In this paper we shall adopt a geometri ooling s hedule sothat the new temperature is some onstant multiple 0 < � < 1 of the last and hoose, as our solution,the point along the sample path observed to minimise f . As we shall see in the examples introdu edin the next few se tions, this is suÆ ient to provide us with a urate parameter estimates but, mostimportantly, a highly a urate and reliable model sele tion pro edure.2.1. Example: autoregressive time seriesSuppose we are given data x = (x1; x2; : : : ; xN ) from an autoregressive (AR) pro ess of order k sothat the data is generated by the pro ess:Xt = kXs=1 asXt�s + "t; t = k; : : : ; N; (2)

4 Brooks, Friel and Kingwhere "t � N(0; 1=�). Then, the likelihood asso iated with these data is approximated byL(xj�) = NYt=k+1r �2� exp0���2 "xt � kXs=1 asxt�s#21A ; (3)where � = (a1; :::; ak; �) denotes the ve tor of model parameters. Maximising the likelihood is equiv-alent to minimising f(�) = � logL(xj�) so that at temperature T , we have target distributionbT (�) / exp(logL(xj�)=T ) = L(xj�)1=T :Before we dis uss the implementation of the SA algorithm for this parti ular problem, let us �rst onsider the Bayesian approa h in whi h a prior p(�) is pla ed upon the model parameters in orderto obtain a posterior distribution, �(�jx) / L(xj�)p(�):If the prior is simply a onstant fun tion, then �(�jx) / bT (�) when T = 1. Several authors(Troughton and Godsill 1997; Ehlers and Brooks 2002; Marriott et al 1996) dis uss MCMC simulationmethods for exploring the posterior distribution � and these te hniques and experien es an be useddire tly to motivate di�erent approa hes to the implementation of the SA algorithm. For example,Marriott et al (1996) use a ombination of Gibbs and Metropolis Hastings updates to simulate their hain, whilst Troughton and Godsill (1997) and Ehlers and Brooks (2002) use Gibbs updates, updatingthe autoregressive parameters both separately and as a blo k (Roberts and Sahu 1997), onditioningon the error varian e.Here our SA algorithm at temperature T has target densitybT (�) / 8<:� (N�k)=2 exp0���2 NXt=k+1 "xt � kXs=1 asxt�s#21A9=;1=Tand it is easy to show that the onditional distribution of the error pre ision given the autoregressiveparameters is given bybT (� ja1; :::; ak;x) � � 1 + N � k2T ;PNt=k+1(xt �Pks=1 asxt�s)22T ! ; (4)Thus, we an use a Gibbs update for the error varian e.Similarly, Ehlers and Brooks (2002) show that if we let xk denote the ve tor with elements(xk+1; :::; xN ) and Xk denote the (N � k)� k matrix with (i; j)th element xk�i+j , then Equation (2) an be rewritten as, xk =Xka+ e;with obvious notation. In this ase, the full onditional distribution for the autoregressive parametersa given the error pre ision � is given bybT (aj�) = Nk(�;�); (5)where � = ��TXTk xkT ; and ��1 = �XTkXkT :The SA algorithm therefore pro eeds by updating �rst the error varian e and then the au-toregressive parameter ve tor within ea h iteration, essentially by generating u1 � bt(� ja;x) andu2 � bt(aju1;x) and setting (� 0;a0) = g(�;a; u1;u2) = (u1;u2) in Step 2 of the annealing algorithm.Sin e we are using Gibbs updates, the orresponding a eptan e ratio in Step 3 is one, and we ansimply remove this step from the algorithm. One distin t advantage of the use of Gibbs updates hereis that they automati ally adapt as the target distribution hanges with temperature. This is par-ti ularly important at low temperatures when the system begins to freeze, sin e parameter updatesne essarily be ome in reasingly diÆ ult. When more general Metropolis Hastings updates are used,

Classical Model Selection via Simulated Annealing 5it is important to s ale the proposal varian e with de reasing temperature. This o urs automati allywith the Gibbs updates.It is lear from (4) and (5) that as T ! 0, the varian e of the onditionals de rease to zero sothat the onditional distributions simply onverge to point-masses orresponding to the usual MLEequations (Box and Jenkins 1970, p277). Thus, the algorithm is learly guaranteed to freeze at theMLE in the limit as T tends to zero.3. Trans-Dimensional Simulated AnnealingThough TDSA need not ne essarily be applied in the ontext of model sele tion problems, we shallrestri t our attention only to problems of this sort here. Suppose that we have a olle tion of modelsM1;M2; ::: indexed by the parameter k 2 K and that under ea h model we have an asso iatedparameter ve tor �k 2 �k of length nk with orresponding likelihood fun tion Lk(�k). In addition,suppose that we wish to �nd the minimum of some general fun tion f(�k; k) over values of (�k; k).For example, if we take f(�k; k) = 2logLk(�k)�2nk then we have minus the usual Akaike InformationCriterion (Akaike 1974). The Boltzmann distribution asso iated with this obje tive fun tion an thenbe de�ned in the usual way, and we setbT (�k; k) / exp[�f(�k; k)=T ℄:At temperature T = 1, the Boltzmann distribution orresponds to the Bayesian posterior with a atprior over parameter spa e, but a prior for model k proportional to exp(�nk=T ) (see also Andrieuet al 2001; and George and Foster 2000). Of ourse, bT is no longer a density but a distributionover K � �, where � = [k2K�k. We therefore require a more general updating s heme in order to onstru t a hain with bT as its stationary distribution.3.1. Reversible jump updatesSuppose that we wish to onstru t a general Markov hain on some state spa e C with stationarydistribution bT , whi h is urrently in some state �. The generalised Metropolis-Hastings approa hinvolves proposing new state �0 for the hain from some proposal measure q(�; d�0), whi h is thensubsequently either a epted or reje ted. Green (1995) shows that detailed balan e is preserved if wea ept the proposed new state with probability�(�;�0) = min�1; bT (d�0)q(�0; d�)bT (d�)q(�; d�0) � ;see also Norman and Filinov (1969). This expression learly generalises the usual Metropolis-Hastingsa eptan e probability given in (1).This general formulation an be simpli�ed in the ontext of model sele tion problems by restri tingattention to ertain trans-dimensional jump onstru tions. Suppose that we wish to perform a trans-dimensional jump moving from (�i; i) to (�j ; j), where typi ally ni 6= nj and i; j 2 K. Suppose,without loss of generality, that �i = Rni and let � = (�i; i) denote the state ve tor for the hain.Then for a given model Mk, � 2 Ck = fkg � Rnk and more generally, � 2 C = [k2KCk. Thoughthere are a wide variety of implementations, perhaps the most straightforward is to follow a similarapproa h to that des ribed in Se tion 2. Suppose that we hoose a move of type r 2 R to take the hain from � to �0. We an do this by generating a random ve tor u 2 Uij from some proposalqr;i;j(u) and hoose as our proposed state �0 = (g(�i;u); j) for some fun tion g. This proposal wouldthen be a epted with probability �r(�;�0) = min(1; A), whereA = bT (�0)pr0(�0)qr0;j;i(u0)bT (�)pr(�)qr;i;j(u) �����(g(�i;u);u0)�(�i;u) ���� ; (6)where pr(�) denotes the probability of proposing move type r in state �, move type r0 is the oppositemove type to move r and u0 satis�es g(g(�i;u);u0) = �i, with g hosen so that u0 always exists. SeeGreen (1995) and Brooks et al (2001), for example.An alternative to the RJMCMC updating s heme des ribed above was proposed by Stephens(2000), who simulates a ontinuous-time Birth-Death pro ess in whi h parameters are added (births)

6 Brooks, Friel and Kingand deleted (deaths) in order to move between models. This an be easily implemented by letting theunion of all state spa es (under the di�erent models) be des ribed as a marked Poisson pro ess. Thispro ess is run for a �xed length of time within ea h iteration so that many model moves an be madewithin ea h iteration and the hain may appear to move more rapidly. However, an eÆ ient lengthof time to run the birth-death pro ess may sometimes be diÆ ult to determine and is essentially onfounded with the problem of sele ting the ooling s hedule for the annealing omponent. Though,future resear h in this area may provide an eÆ ient implementation for trans-dimensional simulatedannealing, we will not dis uss it further here, but on entrate solely on the use of reversible jumpMCMC updates.However between-model moves are performed, the posterior distribution bT (�k; k) an be exploredand the minima identi�ed from observations from the hain. However, as before, the identi� ation ofminima from samples of any size is problemati at best and so we use the annealing idea to \freeze" the hain at a point of minimum mass, as des ribed in Se tion 2. The only di�eren e being that, in this ase, we minimise over both parameters and models so that, at any temperature T , the distributionbT (�k; k) is explored using a ombination of MCMC updates for within-model moves (i.e., �xed k)and RJMCMC updates for moves between models i.e., updating k.The losest existing approa h to ours is based upon an observation by George and Foster (2000)who show that model sele tion via various riteria (e.g. the AIC) is equivalent to �nding the maximuma posteriori (MAP) of a \ alibrated" posterior distribution. Andrieu et al (2001) develop this resultand onstru t an algorithm ombining both RJMCMC and SA to optimise the posterior of interest.However, though the original method of George and Foster an be shown to maximise several standard riteria, it annot be dire tly extended to more general fun tion optimisation problems and so themethod of Andrieu et al (2001) is similarly restri ted. Further, these methods riti ally rely uponproviding an \uninformative" prior for the model parameters under all models, whi h often requiressubstantial analyti work and means that the maximum obtained does not stri tly orrespond to theMLE. Alternatively, at priors may be used, but then the approa h su�ers from the potential problemof posterior impropriety and s ale invarian e, and is therefore similarly limited in its appli ability.3.2. Example: autoregressive time series revisitedGiven data x1; :::; xN , we shall restri t our attention to autoregressivemodels of order k = 1; :::; kmax <N and, in order to retain a fair omparison between models, we use the following approximation tothe likelihood in (3): Lk(�k) = NYt=kmax+1r �2� exp0���2 "xt � kX�=1a�xt��#21A ;where �k denotes the k+1 dimensional ve tor of model parameters (a1; :::; ak; �). Thus, if we use theAIC to distinguish between ompeting models, thenbT (�k; k) / exp([logLk(�k)� nk℄=T ) = Lk(�k)1=T exp(�nk=T ): (7)Suppose that we are urrently in state � in model Mk and that we propose a move to modelMk+1. In order to perform this move, we need to sele t a state �k+1 in the larger model to whi hwe propose to move. Within the ontext of the Bayesian analysis of autoregressive time series viaMCMC, Ehlers and Brooks (2002) extend the work of Brooks et al (2001) for onstru ting eÆ ientproposal s hemes and on lude that the best trans-dimensional proposals a ross a broad range ofalternatives are those in whi h the autoregressive parameter values in �k+1 are drawn from theirjoint onditional distribution given the value of the error varian e that remains un hanged. This isa omplished by setting g(�i;u) = (u;�i) and taking bT (a1; :::; aj j�; j) as our proposal q. That is,we take the same proposal for the autoregressive parameters in the new model as we would for thewithin-model updates of the autoregressive parameters in model j, given in (5). This form of proposalis parti ularly useful in the ontext of TDSA as it automati ally adapts (i.e., the proposal varian ede reases) as the temperature is lowered. This ensures good mixing at all temperatures.

Classical Model Selection via Simulated Annealing 7

3.3. Simulation studyTo illustrate the e�e tiveness of the TDSA algorithm in this simple ontext, we ondu t a simulationstudy in whi h 50 data sets are simulated from autoregressive models of various orders. The datasetsare generated by �rst randomly sele ting an order k 2 f5; 7; : : : ; 15g and then simulating 300 datapoints from a model of that order with randomly sele ted parameter values. In ea h ase, we take theerror varian e to be 4:0. To investigate the eÆ ien y of the algorithm and to see how this dependsupon the tuning parameters, we take kmax = 20, and then run our TDSA algorithm 10 times on ea hdataset of data with randomly hosen starting points and di�erent random seeds for di�erent valuesof N and �. Here, 5 di�erent ooling s hedules were used: 10(�)t, where � 2 f0:5; 0:75; 0:9; 0:95; 0:99gand we take N 2 f10; 50; 100; 500; 1000g. Thus, 250 simulations are performed on ea h of the 50 datasets.Sin e the data are randomly generated, the model that minimises the AIC, need not be themodel from whi h the data was originally generated. However an exhaustive sear h over model andparameter spa e is feasible in this simple pedagogi ase. For example, it is straightforward to useSplus for this purpose to �nd the model whi h minimises the AIC riteria, returning both optimalmodel and oeÆ ients. The results presented in Table 1 display the proportion of time the optimalmodel was hosen by the TDSA algorithm for ea h ombination of � and N and it is lear that as the ooling s hedule slows and as N in reases, the performan e of the algorithm improves. A eptableperforman e seem to be a hieved for � � 0:95 and for N � 500.[Table 1 here.℄3.4. Data and ResultsTo illustrate the performan e of our TDSA algorithm in the ontext of a real autoregressive modelsele tion analysis, we take the well-known lynx data set of Priestley (1981, se tion 5.5). This dataset re ords the number of lynx trapped in the Ma kenzie river in Canada from 1821 to 1934 and iswell des ribed by an AR pro ess. We an see from Figure 1a that the AIC identi�es the AR(11)as the best model and that the model spa e is reasonably at for many lower order models so thatmovement towards the best model should be fairly rapid as the temperature de reases.[Figure 1 here.℄To examine the performan e of our algorithm, we take kmax = 15 and an initial temperature ofT0 = 10, whi h is redu ed every 1000 iterations by a fa tor 0:99, as suggested by the simulation study.With these settings, we typi ally observe a total of approximately 600 temperature redu tions withinthe algorithm. Figure 1b provides a plot of the model order ea h time the temperature is redu ed.Clearly, the hain onverges fairly rapidly to the AR(11) pro ess as the temperature is de reased.Figures 1 and 1d provide the orresponding plots of the autoregressive parameter a1 and the errorvarian e, �2. Ea h of these plots demonstrate a great deal of variability at high temperatures thatde reases as the temperature is lowered. The model order, lying on a dis rete spa e, freezes ompletelyafter about 520 temperature redu tions, but the model parameters ontinue to exhibit some variabilityeven at the lowest temperature, as dis ussed in Se tion 2.For this simple ase the MLE's an be omputed dire tly and an exhaustive sear h over the 15models is almost instantaneously performed by any standard pa kage su h as Splus. Nonetheless,this simple example demonstrates the ease with whi h the TDSA algorithm may be implemented andallows us to ondu t a suitably sized simulation study to investigate its performan e.We next turn to two more realisti examples to provide a real test of performan e for our TDSAalgorithm. The �rst uses TDSA to perform model sele tion for a logisti regression problem, exploringa spa e of around 500,000 ompeting models. Though it is possible to automate a pro ess for exhaus-tively enumerating ea h of these models, the omputational ost is enormous and TDSA provides theonly viable alternative. The se ond example analyses a series of integrated re apture/re overy datare orded on a population of shags and also explores a model spa e omprising around 500,000 mod-els. The models used here are most eÆ iently analysed via one or two spe ialised pa kages, but theseanalyses annot be automated. Even if one model ould be �tted per minute, omplete enumerationwould take over 50 weeks. TDSA provides the only viable me hanism for model sele tion over su ha broad range of models.

8 Brooks, Friel and King

4. Example I – Logistic RegressionHere we onsider a logisti regression analysis in whi h we wish to �nd the best �tting and mostparsimonious, yet plausible, model to des ribe the relationship between a binary out ome and a setof independent explanatory variables. We denote the binary response variable by y = (y1; :::; yn), andassume that we have P potential explanatory variables X = fx1;x2; : : : ;xP g.In redit business, banks are interested in whether prospe tive onsumers will pay ba k their reditor not. The aim of redit-s oring is to model or predi t the probability that a onsumer possessing ertain hara teristi s is to be onsidered as a potential risk. As an illustration of a logisti regressionvariable sele tion problem, we examine a redit-s oring data set in whi h the redit worthiness of1000 individuals is to be explained by any or all of 19 response ovariates, given in Table 2. Theoriginal data set an be found at:www.stat.uni-muen hen.de/servi e/datenar hiv/kredit/kredit e.html.Sin e ea h variable an either be in luded in the model or not, there are a total of 219 = 524; 288possible models. [Table 2 here.℄In order to model these data, we shall assume that ea h observed value yi is an independentBernoulli out ome, with probability of \su ess" (i.e., repays the debt), pi. Individual models an beuniquely identi�ed by the set of predi tor variables that they ontain. We letM denote the power set(i.e., set of all subsets) of M = f1; :::; Pg then model m 2 M denotes that with variables xj , j 2 mand therefore has parameters �m = f�j : j 2 mg. Under model m, we havepi = exp(�m0 +Pj2m �mj xji)1 + exp(�m0 +Pj2m �mj xji) ;= ei(X ;�m;m)1 + ei(X;�m;m)where xji denotes the ith element of xj and �mj denotes the jth parameter under model m. Note thatwe ould let M = f0; 1; :::; Pg and allow the possibility of removing the onstant term. However, forthis example, we shall assume that the onstant term is present in every plausible model.Model sele tion for models of this sort is usually performed on the basis of the likelihood fun tion:L(yjX;�) = nYi=1 pyii (1� pi)1�yi :Various riteria are used to ompare ompeting models that allow a trade-o� between goodness-of-�t (maximised by in luding all explanatory variables) and parsimony (maximised by ex luding allexplanatory variables). Te hniques su h as stepwise regression using likelihood ratio statisti s areroutinely used (Lemeshow and Hosmer 1998). In order to fully explore both parameter and modelspa e simultaneously, we shall adopt the AIC riterion, so that our target distribution for �m undermodel m and at temperature T is given bybT (�;m) / exp ([logLm(yjX ;�)� (jmj+ 1)℄ =T ) ;= nYi=1� ei(X;�m;m)1 + ei(X;�m;m)�yi=T � 11 + ei(X ;�m;m)�(1�yi)=T exp(�(jmj+ 1)=T ): (8)where jmj denotes the number of elements of the set m i.e., the number of explanatory variablespresent in model m, and Lm(�j�) denotes the orresponding likelihood fun tion.From (8) it is lear that the onditional distributions of the �i are non-standard. This means thatwe must use Metropolis Hastings jumps to perform the within-model moves. Given the diÆ ulty ofproposing multivariate jumps with high a eptan e probabilities, we shall use univariate random walkMetropolis updates to update ea h element of the � ve tor in turn. In pra ti e, and for parameter

Classical Model Selection via Simulated Annealing 9�j , j 2 m, we propose moving from � to �0 = (�0; :::; �j�1; �0j ; �j+1; :::) by generating �0j � N(�j ; �2w)and a epting this move with probability min[1; bT (�0;m))=bT (�;m)℄.In order to move between models, we propose two distin t s hemes. One performs \large" jumps byswapping ovariates i.e., simultaneously deleting one parameter whilst adding another. This involvesno hange in dimension and may also be performed via Metropolis Hastings. The se ond between-model move involves simply adding or deleting a parameter from the model. This involves a hange indimension of the model spa e and therefore require RJMCMC updates. Suppose that with probabilityrbm we wish to move from state � in model m to model m0 by adding a new explanatory variable xj ,j 62 m say. In this ase, we need to add a new parameter �jm0j whi h we generate from a Normaldistribution with mean �j and varian e �2j , so that �0 = (�; �jm0j). We will return to the problem ofappropriate hoi e of the proposal parameters later. Here the Ja obian term in the a eptan e ratioof (6) is just 1 and the a eptan e ratio be omesA = bT (�0;m0)bT (�;m) rdm0rbm 1q(�jm0j) ; (9)where q(u) = (2��2b )�1=2 exp(�(u � �b)2=2�2b ), rbm = 1=(P � jmj) for m 6= M , and rdm0 = 1=jm0jdenotes the probability of proposing a death when in model m0.In order to hoose the proposal parameters (�j ; �2j ) we �rst al ulate MLE's for the parametervalues for the \saturated" model using, for example, simulated annealing on the full model. We thentake �j to be MLE for �j . The orresponding proposal varian es are obtained via a small pilot studyto ensure reasonably frequent movement between models.4.1. A Simulation StudyIn order to assess the performan e of the TDSA algorithm for the variable sele tion problem, responsevariables for 20 di�erent data sets were generated by randomly hoosing di�erent subsets of ovariateswith orresponding randomly hosen regression oeÆ ients. The aim here, as in Se tion 3.3, is toassess the performan e of TDSA for a ombination of � and N and to investigate the reliability ofthe method.To hoose the proposal parameters we �rst �nd the MLE's for the parameter values for thesaturated model { the model with all possible ovariates. This may be done using simulated annealingon the saturated model, or otherwise using standard statisti al software. Similarly, taking a proposalvarian e �2j equal to the sum of the orresponding standard error and 0.1 appeared to work well.For the study, we examined the performan e of the algorithm for values of � 2 f0:9; 0:95; 0:99; 0:995gand N 2 f100; 500; 1000g. For ea h data set we re orded the lowest AIC value visited.In this situation we do not know a priori whi h model will minimise the AIC, sin e this neednot ne essarily be the model from whi h the data were simulated. Sin e exhaustive enumeration isimpossible, we adopted an ad-ho model sear h te hnique in whi h all neighbours of the original model(from whi h the data were simulated) are �tted via a traditional SA algorithm. Here, a neighbourdenotes all models of the urrent model that an be obtained by adding deleting or swapping a singleparameter. If no neighbours have a lower AIC value than the urrent model, then the s heme stops.However, if a better model is found then the pro ess is begun again using this model i.e., all neighboursof the new model are explored. Sin e the model minimising the AIC is likely to be reasonably lose tothe original model used to generate the data, this provides a reasonably eÆ ient method for exploringmodel spa e without having to evaluate all models. Of ourse, in any real appli ation, the originalmodel will not be known and so no similar pro ess ould be used when examining real data sets.The proportion of orre tly hosen models for ea h ombination of � and N are presented inTable 3. Again it an be seen that the performan e of the algorithm improves as � and N in rease.In parti ular these results suggest that hoosing � � 0:99 and N � 500 should lead to reasonableresults. As a omparison a ba kward sele tion pro edure in whi h we begin with the saturated modeland sequentially remove parameters so as to redu e the AIC manages to lo ate the orre t model foronly 11 of the 20 data sets, a su ess rate of 55%. What this statisti masks is the fa t that when theTDSA algorithm obtains the wrong model it always identi�es a very lose neighbour. However, whenthe ba kward sele tion method fails, the �nal model is ommonly quite di�erent from the true one.[Table 3 here.℄

10 Brooks, Friel and King

4.2. ResultsApplying the TDSA algorithm to the redit s oring data des ribed above, we take an initial temper-ature T0 = 10 whi h we redu e by a fa tor 0:995 every 1000 iterations. We perform 1700 temperatureredu tions so that our �nal temperature is approximately 0:002 and, as before base our proposalparameters upon the MLE's from the saturated model. The TDSA algorithm was run 20 times fromvarious initial model on�gurations and ea h took approximately 6 hours to omplete. The followingmodel based upon 11 ovariates was obtained from 17 of these runs.logit pi = 3:160 + 1:459x1i � 0:0275x2i � 0:898x3i � 0:0000644x5i� 0:864x6i+0:541x7i � 0:515x8i + 0:414x9i � 0:315x13i + 0:305x18i � 1:302x19i:This model has an asso iated AIC value of 1017:828 and omplete enumeration of all models (takingapproximately 250 hours) on�rms that this model does indeed minimise the AIC.5. Example II Recapture/Recovery ModelsData olle tion for wildlife populations are often performed using apture-re apture and/or tag-re overy te hniques (S hwarz and Seber 1998). The parameters that are usually of interest relateto the survival, re apture (of live animals) and/or re overy (of dead animals) rates. Often, theseparameters may be dependent on su h fa tors as time, lo ation and/or age of the animal.Within this �eld, model dis rimination is paramount amongst questions of primary interest, sin eea h model tells us something quite di�erent about the underlying dynami s of the population understudy e.g., whether survival is dependent upon age and/or lo ation. Unfortunately, even the mostsimple of modelling situations an give rise to enormous numbers of plausible model stru tures thatneed to be explored in order to �nd that whi h �ts the data best.Several pa kages su h as SURVIV and MARK exist to analyse data of these sort. However, it isdiÆ ult to automate extensive model sear h s hemes within these pa kages and, though individualanalyses an be performed very rapidly, the sheer numbers of plausible models available in any realisti appli ation means that exhaustive enumeration is impossible. Therefore, urrent best pra ti e is torestri t the model spa e on the basis of biologi al arguments and to onsider all models within a verysmall lass. However, and as we shall see, this an often lead to the automati reje tion of large areasof model spa e that ontain better models than those a tually onsidered in the analysis. This isparti ularly likely when the underlying population is not yet well understood or when even simple,but unpredi table, intera tions drive the underlying population dynami s.We shall onsider the re apture/re overy data relating to a population of island shags presentedby Cat hpole et al (1998). Shags are termed pulli in their �rst year, immature in their next twoyears, before be oming adults in their fourth year. The survival, re apture and re overy parametersare assumed to be possibly dependent on the year and/or age of the shag, so that the saturated modelhas parameters,� (t) = Pr(an animal of age alive in year t survives until year t+ 1);p (t+ 1) = Pr(an animal of age alive in year t+ 1 is resighted at that time);� (t) = Pr(an animal of age whi h dies in [t; t+ 1℄, has its band returned);for 2 f1; 2; 3; Ag, where A represents adult. Di�erent models an then be represented by the di�erentrestri tions that are pla ed upon these parameters.For biologi al reasons we assume that the survival rates may only be the same for onse utivelyaged shags. Conversely, we allow all possible ombination of ages to have ommon re apture and/orre overy parameters. Finally, for ea h of these parameters there is either a year or no-year dependen e.These represent moderate restri tions on the model spa e leaving 477,184 distin t biologi ally plau-sible models. However, these restri tions ould also be removed with little additional omputational omplexity or expense.In the ontext of a Bayesian analysis of these data, King and Brooks (2002) des ribe a reversiblejump MCMC s heme for sampling from the orresponding posterior distribution. The samplings heme used here follows almost identi ally to theirs, but with the removal of their prior terms and

Classical Model Selection via Simulated Annealing 11the addition of the temperature term intrinsi to our TDSA s heme. Thus the omputational detailsare omitted here, but the reader is referred to King and Brooks (2002) for a more detailed dis ussion.We begin with a short simulation study, in whi h 80 data sets are simulated and analysed usingthe TDSA algorithm in order to on�rm that our TDSA algorithm is indeed apable of sear hingmodel spa e to obtain the best model.5.1. A Simulation StudyEa h of the 80 simulated data sets (ea h omprising 10,000 individuals, so as to be omparable to thesize of the shag data set analysed below) are simulated, from randomly sele ted models. We performthe TDSA algorithm twi e for ea h set of data, one starting from the \saturated" model, with full ageand year dependen e for survival, re apture and re overy, and the other from the \smallest" modelwith ommon survival, re apture and re overy parameters over all ages and years. Sin e in total 160simulations are to be run, only relatively short TDSA runs are possible in order to retain a viablesimulation study. With that in mind, ea h simulation was started at a temperature of T0 = 10, whi hwas de reased by a fa tor of � = 0:99 after N = 5; 000 iterations, and 1000 temperature redu tionswere used within the pro edure.As before, the model that minimises the AIC for any parti ular data set need not ne essarily bethat from whi h it was originally simulated. Thus we adopt the same approa h as for the logisti regression example and al ulate the AIC statisti for ea h of the models \neighbouring" the originaltrue model by a single parameter dependen y. If a model with a lower AIC is found, the pro edureis repeated using this model, until no further model is found with a lower AIC statisti .With the settings above, the TDSA algorithm identi�es the same or better model (in terms ofthe AIC) than the ad-ho model-sele tion pro edure des ribed above, in ex ess of 86% of the time.In the few ases in whi h the TDSA algorithm did not identify the orre t model, an immediateneighbour with very similar AIC value was identi�ed instead. This suggests that the ooling s heduleis indeed too slow and/or the the value of N too small. With the urrent values, ea h simulationtook approximately 6 hours and to in rease these settings would make the entire simulation studyinfeasibly time- onsuming. However, even with this rapid ooling s hedule, we get the orre t answerover 86% of the time and within the immediate neighbourhood of the orre t model 100% of the time.5.2. ResultsOn the basis of the simulation study des ribed above, we take � = 0:99 and N = 10; 000, with a totalof 1000 temperature redu tions. Following similar prin iples to those applied to MCMC output inthe Bayesian literature, we also run �ve separate repli ations in order to reassure ourselves that the orre t model is obtained. In fa t all �ve repli ations identify the same model, and give essentiallyidenti al MLE's. Ea h TDSA algorithm took around 12 hours to omplete, but tra e plots suggestthat the system froze in the orre t model in around one 50th of that time.The model identi�ed as having the lowest AIC statisti (12138.72) imposes the following restri -tions: �2(t) = �3(t) = �imm;p1(t) = p1; p2(t) = p2;�1(t) = �1; �2(t) = �3(t) = �imm; �A(t) = �A:We note that a previous Bayesian analysis of these data by King and Brooks (2002) using Uniformpriors over both the parameters and models, identi�es exa tly the same model as being a posteriorimost probable.5.3. Comparison with Competing MethodsCat hpole et al (1998) perform a lassi al analysis on this data, using likelihood ratio tests to omparedi�erent models. However, only a very limited number of models are onsidered, sin e ea h modelneeds to be analysed individually. The �nal model hosen within their analysis has the followingrestri tions.

12 Brooks, Friel and King �2(t) = �3(t) = �imm; �A(t) = �A;p1(t) = p1; p2(t) = p2; p3(t) = p3;�1(t) = �2(t) = �3(t) = �A(t) = �(t):The orresponding AIC for the model is 12197.7, whi h is learly mu h larger than the AIC (of12138.72) for the model identi�ed within the TDSA algorithm. Though our model has a largernumber of parameters, it has a mu h smaller devian e and is learly preferred.The earlier analysis of Cat hpole et al (1998) suggests that re overy parameters are year depen-dent, but that dead birds of ea h age are equally likely to be re overed. However, the dis overy of ouralternative (and better) model suggests that the re overy probabilities are generally age dependentbut independent of the year of the study. This is more onsistent with expert opinion sin e \the lessexperien ed juveniles and �rst winter birds seem to die in way asso iated with man (e.g. aught innets or lobster pots) or in pla es where they are more likely to be found (e.g. blown inland or intoestuaries, whi h rarely happens to older birds)" (Mike Harris { pers. omm.). Thus, we should ex-pe t a larger re overy rate for these shags and a orresponding age-dependent stru ture for re overies,whi h was not observed in the earlier model.6. DiscussionIn this paper we develop a trans-dimensional simulated annealing algorithm apable of exploringboth parameter and model spa e simultaneously to obtain the parameter and model ombinationthat minimises any given riterion, su h as the AIC. Though, in pra ti e, many users might prefer totake a more hands-on approa h to model sele tion, when the number of models is large, this be omesinfeasible and some form of automated model exploration te hnique must be applied. By far themost ommon is some form of sequential method that, by its very nature, is liable to be ome stu k inlo al optima. However, we demonstrate that our TDSA algorithm is apable of exploring large modelspa es and reliably lo ating the optimum model in the ontext of three realisti examples.The TDSA algorithm is essentially a series of (RJ)MCMC algorithms run on a sequen e of targetdistributions and highlights the fa t that (RJ)MCMC methods need not only be useful to Bayesians.As with standard MCMC algorithms, onvergen e is an issue, but standard MCMC onvergen eassessment te hniques annot be applied in this ontext. We suggest a pragmati approa h in whi hseveral independent repli ations of the TDSA algorithm be applied to any parti ular problem in orderto ensure that the orre t solution is obtained. Though this in reases omputational expense, in manyrealisti ally omplex problems running several repli ations of a TDSA algorithm may be many ordersof magnitude qui ker than omplete enumeration of all possible models.In this paper, we ondu t several simulation studies to assess the performan e of our methodand to ompare it with the most natural ompetitors. For the more omplex problems, even withsimulated data, the model that a tually minimises the AIC will not be known. Thus, we introdu ean ad-ho neighbourhood exploration algorithm that exhaustively enumerates all neighbours of themodel from whi h the data were originally generated in order to see if any of these des ribe thesimulated data better. This pro ess is repeated iteratively until no improvement an be found. Inany real problem, this pro ess ould not be adopted, sin e we would not know where to start thealgorithm. However, in the spirit of the hybrid SA algorithm of Brooks and Morgan (1994), we ould ombine this approa h with the TDSA algorithm forming a hybrid algorithm that uses TDSA toobtain a starting point for the ad-ho neighbourhood sear h s heme. Using this approa h, the su essrate for the apture-re apture example in Se tion 5.1 in reases from 86% to 98%.Though, broadly speaking, the performan e of the TDSA algorithm generally remains insensitiveto the value of tuning parameters within a reasonable range, some degree of pilot tuning is oftenne essary to obtain optimal starting temperatures, ooling rates, ooling intervals and proposal pa-rameters. Though the wealth of theoreti al results available for traditional SA algorithms remainsvalid here, there is still s ope for further work in automating the sele tion of appropriate values(though of ourse the proposal generating me hanisms of Brooks et al (2001) provides some usefulguidan e here). When onstru ting bT (�), we ould repla e f(�) by any monotoni fun tion of f aswe do in Se tion 2.1 when we take f(�) = � log(L(xj�). This gives us additional freedom to a�e t

Classical Model Selection via Simulated Annealing 13the behaviour of the hain without altering the solution itself. Me hanisms for determining reparam-eterisations whi h optimise performan e of the TDSA algorithm are urrently under development.7. AcknowledgmentsThe authors would like to thank the parti ipants of the Workshop on reversible jump MCMC methodsheld in Spetses in 2001 for their onstru tive feedba k on an early draft of this paper. This workshopand the work of the se ond author was supported by the EU TMR Network ERB-FMRX-CT96-0095on \Computational and Statisti al Methods for the Analysis of Spatial Data". The work of the �rstand third authors was supported by the EPSRC under grant numbers AF/000537 and RG/M76157respe tively.ReferencesAkaike, H. (1974), A new look at the statisti al identi� ation model. IEEE Transa tions on Auto-mati Control 19, 716{723Andrieu, C., N. deFreitas and A. Dou et (2001), Reversible Jump MCMC Simulated Annealing forNeural Networks. Neural Computing { to appearBesag, J. E. and J. C. York (1989), Bayesian Restoration of Images. In T. Matsunawa (ed.), Analysisof Statisti al Information, pp. 491{507, Tokyo: Institute of Statisti al Mathemati sBox, G. E. P. and G. M. Jenkins (1970), Time Series Analysis: Fore asting and Control . Holden-Day: San Fran is oBrent, R. P. (1973), Algorithms for Minimizing without Derivatives . Prenti e-HallBrooks, S. P. (1998), Markov Chain Monte Carlo Method and its Appli ation. The Statisti ian 47,69{100Brooks, S. P., P. Giudi i and G. O. Roberts (2001), EÆ ient Constru tion of Reversible JumpProposal Distributions. Te hni al report, University of CambridgeBrooks, S. P. and B. J. T. Morgan (1994), Automati Starting Point Sele tion for Fun tion Opti-misation. Statisti s and Computing 4, 173{177Brooks, S. P. and B. J. T. Morgan (1995), Optimisation using Simulated Annealing. The Statisti ian44, 241{257Brooks, S. P., B. J. T. Morgan, M. S. Ridout and S. E. Pa k (1997), Finite Mixture Models forProportions. Biometri s 53, 1097{1115Cat hpole, E. A., S. N. Freeman, B. J. T. Morgan and M. P. Harris (1998), Integrated Re ov-ery/Re apture Data Analysis. Biometri s 54, 33{46Corana, A., M. Mar hesi, C. Martini and S. Ridella (1987), Minimizing Multimodal Fun tions ofContinuous Variables with the Simulated Annealing Algorithm. ACM Transa tions on Mathe-mati al Software 13, 262{280Dempster, A. P., N. M. Laird and D. B. Rubin (1977), Maximum Likelihood from In omplete Datavia the EM Algorithm. Journal of the Royal Statisti al So iety, Series B. 39, 1{38Drago, G., A. Manella, M. Nervi, M. Repetto and G. Se ondo (1992), A Combined Strategyfor Optimization in Non Linear Magneti Problems Using Simulated Annealing and Sear hTe hniques. IEEE Transa tions on Magneti s 28, 1541{1544Draper, D. and D. Fouskakis (2001), A Case Study of Sto hasti Optimization in Health Poli y:Problem Formulation and Prelimiminary results. Journal of Global Optimization To appearEhlers, R. S. and S. P. Brooks (2002), Model Un ertainty in Integrated ARMA Pro esses. Te hni alreport, University of CambridgeFlet her, R. (1987), Pra ti al methods of Optimization. Wiley: New YorkGamerman, D. (1995), Monte Carlo Markov Chains for Dynami Generalized Linear Models. Te h-ni al report, Universidade Federal do Rio de Janeiro, Brazil

14 Brooks, Friel and KingGelfand, A. E. (1996), Model Determination using Sampling-based Methods. In W. R. Gilks,S. Ri hardson and D. J. Spiegelhalter (eds.), Markov Chain Monte Carlo in Pra ti e, pp. 145{162, Chapman and HallGelfand, A. E. and A. F. M. Smith (1990), Sampling Based Approa hes to Cal ulating MarginalDensities. Journal of the Ameri an Statisti al Asso iation 85, 398{409Geman, S. and D. Geman (1984), Sto hasti Relaxation , Gibbs Distributions and the BayesianRestoration of Images. IEEE Transa tions on pattern analysis and ma hine intelligen e 6, 721{741George, E. I. and D. P. Foster (2000), Calibration and Empiri al Bayes Variable Sele tion.Biometrika 87, 731{747Gidas, B. (1985), Nonstationary Markov Chains and the Convergen e of the Annealing Algorithm.Journal of Statisti al Physi s 39, 73{131Gilks, W. R., S. Ri hardson and D. J. Spiegelhalter (1996),Markov Chain Monte Carlo in Pra ti e.Chapman and HallGlover, F. and M. Laguna (1997), TABU Sear h. KluwerGreen, P. J. (1995), Reversible jump Markov hain Monte Carlo omputation and Bayesian modeldetermination. Biometrika 82(4), 711{732Ingber, L. (1992), Geneti Algorithms and Very Fast Simulated Re-annealing: A Comparison.Mathemati al and Computer modelling 16, 87{100King, R. and S. P. Brooks (2002), Model Sele tion for Integrated Re overy/Re apture Data. Bio-metri s in press.Kirkpatri k, S. (1984), Optimization by Simulated Annealing: Quantitative Studies. Journal ofStatisti al Physi s 34, 975{986Kirkpatri k, S., C. D. Gelatt and M. P. Ve hi (1983), Optimisation using Simulated Annealing.S ien e 220, 671{680Lemeshow, S. L. and D. W. Hosmer (1998), Logisti Regression. In P. Armitage and T. Colton(eds.), En y lopedia of Biostatisti s , pp. 2311{2327, Chi hester: WileyMarriott, J., N. Ravishanker, A. Gelfand and J. Pai (1996), Bayesian Analysis of ARMA Pro esses.In D. Berry, K. Chaloner and J. Geweke (eds.), Bayesian Statisti s and E onometri s: Essaysin Honor of Arnold Zellner , pp. 243{256, North Holland: AmsterdamMetropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller and E. Teller (1953), Equationsof State Cal ulations by Fast Computing Ma hines. Journal of Chemi al Physi s 21, 1087{1091Nelder, J. A. and R. Mead (1965), A Simplex Method for Fun tion Minimization. The ComputerJournal 7, 308{313Norman, G. E. and V. S. Filinov (1969), Investigation of Phase Transitions by a Monte CarloMethod. High Temperature 7, 216{222Priestley, M. B. (1981), Spe tral Analysis and Time Series. Volume 1: Univariate Series . London:A ademi PressRipley, B. D. (1987), Sto hasti Simulation. John Wiley and SonsRoberts, G. O. and S. K. Sahu (1997), Updating S hemes, Covarian e Stru ture, Blo king andParameterisation for the Gibbs Sampler. Journal of the Royal Statisti al So iety, Series B 59,291{318S hwarz, C. J. and G. A. F. Seber (1998), A Review of Estimating Animal Abundan e III. Statisti alS ien e 14, 427{456S hwarz, G. (1978), Estimating the Dimension of a Model. Annals of Statisti s 6, 461{464Simkin, J. and C. W. Trowbridge (1992), Optimizing Ele tromagneti Devi es Combining Dire tSear h Methods with Simulated Annealing. IEEE Transa tions on Magneti s 28, 1545{1548Smyth, G. K. (1996), Partitioned Algorithms for Maximum Likelihood and other Nonlinear Esti-mation. Statisti s and Computing 6, 201{216

Classical Model Selection via Simulated Annealing 15Stephens, M. (2000), Bayesian Analysis of Mixture Models with an Unknown Number of Compo-nents: An Alternative to Reversible Jump Methods. Annals of Statisti s 28, 40{74Troughton, P. and S. Godsill (1997), Reversible Jump Samplers for Autoregressive Time Series, Em-ploying Full Conditionals to A hieve EÆ ient Model Spa e Moves. Te hni al report, Universityof CambridgeVanderbilt, D. and S. G. Louie (1984), A Monte Carlo Simulated Annealing Approa h to Opti-mization over Continuous Variables. Journal of Computational Physi s 56, 259{271

�0.5 0.75 0.9 0.95 0.9910 0.23 0.3 0.53 0.57 0.7750 0.27 0.42 0.65 0.69 0.86N 100 0.53 0.69 0.81 0.90 0.97500 0.77 0.82 0.93 0.98 0.991000 0.87 0.95 0.97 0.99 0.99Table 1. Each table entry gives the proportion of correct models chosen for each combination of � and N

X1 = Indi ator of balan e: 0 = no balan e/a ount, 1 = balan e.X2 = Duration of a ount (months).X3 = Previous redit problems: 0 = no problems, 1 = some problems.X4 = Purpose of redit: 0 = household pur hase, 1 = other.X5 = Amount of redit (Deuts he mark).X6 = Value of savings: 0 = � 500DM, 1 = > 500DM .X7 = Duration of employment: 0 = Unemployed or � 4 years, 1 = > 4 years.X8 = % Installment of in ome: 0 = > 25%, 1 = < 25%.X9 = Guarantor: 0 = no, 1 = yes.X10 = Duration living in urrent household: 0 = < 4 years, 1 = � 4 years.X11 = Valuable assets: 0 = ar/no asset, 1 = house/other savingsX12 = Age (years).X13 = Further redits: 0 = no, 1 = yes.X14 = Type of apartment: 0 =owned, 1 = other.X15 = No. of previous redits: 0 = � 3 redits, 1 = � 4 redits.X16 = O upation: 0 = unemployed/unskilled, 1 = skilled.X17 = No. of persons entitled to maintenan e: 0 = 0 to 2, 1 = 3 or more.X18 = Telephone: 0 = no, 1 = yes.X19 = Foreigner: 0 = no, 1 = yes.Table 2. Description of covariates in credit scoring data set.

�0.9 0.95 0.99 0.995100 0.13 0.24 0.82 0.90N 500 0.57 0.79 0.91 0.921000 0.76 0.88 0.93 0.95Table 3. Each table entry gives the proportion of ‘best’ models chosen for each combination of � and N

0 5 10 15

05

01

00

15

02

00

PSfrag repla ements�2 logLk(�) AICBIC Model OrderModel Ordera1�2No. of Temp. redu tionsNo. of Temp. redu tions (a) 0 100 200 300 400 500 600

05

1015

PSfrag repla ements�2 logLk(�)AICBICModel Order ModelOrdera1�2 No. of Temp. redu tionsNo. of Temp. redu tions (b)

0 100 200 300 400 500 600

0.6

0.8

1.0

1.2

1.4

1.6

1.8

PSfrag repla ements�2 logLk(�)AICBICModel OrderModel Order a 1�2 No. of Temp. redu tionsNo. of Temp. redu tions ( ) 0 100 200 300 400 500 600

0.05

0.10

0.15PSfrag repla ements�2 logLk(�)AICBICModel OrderModel Ordera1 �2

No. of Temp. redu tions No. of Temp. redu tions(d)Fig. 1. TDSA output for the Lynx example. (a) AIC against autoregressive model order. (b) Trace plot ofmodel order against number of temperature reductions for the TDSA algorithm. (c) Corresponding plot of a1 (d)Corresponding plot for �2.