7
Three roads diverged? Routes to phylogeographic inference Erik W. Bloomquist 1 , Philippe Lemey 2 and Marc A. Suchard 3, 4 1 Mathematical Biosciences Institute, The Ohio State University, Columbus, OH 43210, USA 2 Department of Microbiology and Immunology, Rega Institute, K.U. Leuven, Leuven 3000, Belgium 3 Department of Biostatistics, UCLA School of Public Health, Los Angeles, CA 90095, USA 4 Departments of Biomathematics and Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA Phylogeographic methods facilitate inference of the geo- graphical history of genetic lineages. Recent examples explore human migration and the origins of viral pan- demics. There is longstanding disagreement over the use and validity of certain phylogeographic inference methodologies. In this paper, we highlight three distinct frameworks for phylogeographic inference to give a taste of this disagreement. Each of the three approaches presents a different viewpoint on phylogeography, most fundamentally on how we view the relationship be- tween the inferred history of a sample and the history of the population the sample is embedded in. Satisfac- tory resolution of this relationship between history of the tree and history of the population remains a chal- lenge for all but the most trivial models of phylogeo- graphic processes. Intriguingly, we believe that some recent methods that entirely avoid inference about the history of the population will eventually help to reach a resolution. Emerging pathways of phylogeographic inference The influence of phylogeography is spreading throughout biology. Among other examples, phylogeographic techni- ques have enabled us to infer the origins of mice [1], modern humans [2,3], and man’s ‘‘best friend’’, the domes- ticated dog [4]. Phylogeographic analyses also enable pub- lic health officials to understand the origin and spread of emerging infectious diseases [58]. In spite of these suc- cesses, disagreement and confusion persist regarding the most effective ways to learn about phylogeographic pro- cesses from geospatially identified molecular sequence data. Major points of contention include: how to model the phylogeographic spread of represented taxa under study; what statistical frameworks provide effective infer- ence tools; and how to best reconcile population frame- works with geographic information. Over the past few years, these questions have been extensively addressed in other reviews and we do not repeat points already made [912]. Instead, we highlight several newer phylogeo- graphic approaches: a Bayesian approach to nested clade phylogeographic analysis (NCPA) [13,14], stochastic-pro- cess-driven spatial diffusion models that map viral outbreaks [15], [16], and several recent population genetic approaches to phylogeography [1719]. We draw parallels between these methods in an attempt to shed light on their similarities and differences. We hope that by expos- ing a general audience to a cross-section of methods, new perspectives will be shed on longstanding issues in phylo- geography. Opinion Glossary Ancestral history: any information about the direct ancestors of a sample of molecular sequences. This term can refer, for example, to inferred sequence composition or phenotype, such as geography, and is often associated with a time scale. Approximate Bayesian computation (ABC): simulation technique used to draw statistical inference based on data summaries, often used when computation of the full data likelihood is impractical. Bayes factor: ratio of the marginal likelihoods of a given data set comparing two competing models that naturally incorporates uncertainty about unknown parameters in both models. Bayesian stochastic search variable selection: framework that estimates the posterior probability that a particular explanatory variable should be included in a model. The method is most commonly used in Bayesian inference of linear regression. BEAST: open-source MCMC package for analysis of several Bayesian evolu- tionary models for molecular sequences and associated traits, such as geography. Brownian diffusion: stochastic process on the real number line or, in the work described in this article, on a geographic surface in which increments are independent and normally distributed with mean zero and variance that scales linearly with duration. Continuous-time Markov chain: stochastic process on a discrete state space or, in the work described in this article, a set of locations that is memoryless and whose waiting times between transitions are exponentially distributed. Comparative approach: framework for relating observed phenotype informa- tion to an evolutionary history. Nested clade phylogeographic analysis assumes that geography is a phenotypic trait and falls into this category. Model-based approach: method in which a fully specified probabilistic model describes how observed data are generated. Unknown parameters can characterize this model, and statistical inference involves estimating and testing these parameters. Nested clade phylogeographic analysis (NCPA): resampling-based approach for inference of geographic information from haplotype trees and networks. Phylogeography: interdisciplinary field involving study of the evolutionary history and geographic spread of biological populations or taxa. Population genetics: as used in this article, a framework that uses a sample of molecular sequences to make statements about a population under study. The coalescent is the most widely used population genetics framework. Extensions of the coalescent to phylogeography typically focus on migration rates between multiple populations fixed in space. Spatial diffusion model: application of independent stochastic processes to describe changes in geographic phenotypes on a two-dimensional surface. Under the discrete framework described in this article, the surface is divided into discrete regions and movement between regions is modeled as a continuous-time Markov chain. Under the continuous diffusion framework, these regions are subdivided until they become infinitesimal and general- izations of Brownian diffusion are then considered. Structured coalescent: extension of the basic coalescent model to multiple populations. This method focuses on ancestral population sizes and migration rates between populations. Corresponding author: Suchard, M.A. ([email protected]). 626 0169-5347/$ see front matter ß 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.tree.2010.08.010 Trends in Ecology and Evolution, November 2010, Vol. 25, No. 11

Three roads diverged? Routes to phylogeographic inference

Embed Size (px)

Citation preview

Three roads diverged? Routes tophylogeographic inferenceErik W. Bloomquist1, Philippe Lemey2 and Marc A. Suchard3,4

1 Mathematical Biosciences Institute, The Ohio State University, Columbus, OH 43210, USA2 Department of Microbiology and Immunology, Rega Institute, K.U. Leuven, Leuven 3000, Belgium3 Department of Biostatistics, UCLA School of Public Health, Los Angeles, CA 90095, USA4 Departments of Biomathematics and Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA

Opinion

Glossary

Ancestral history: any information about the direct ancestors of a sample of

molecular sequences. This term can refer, for example, to inferred sequence

composition or phenotype, such as geography, and is often associated with a

time scale.

Approximate Bayesian computation (ABC): simulation technique used to draw

statistical inference based on data summaries, often used when computation

of the full data likelihood is impractical.

Bayes factor: ratio of the marginal likelihoods of a given data set comparing

two competing models that naturally incorporates uncertainty about unknown

parameters in both models.

Bayesian stochastic search variable selection: framework that estimates the

posterior probability that a particular explanatory variable should be included

in a model. The method is most commonly used in Bayesian inference of linear

regression.

BEAST: open-source MCMC package for analysis of several Bayesian evolu-

tionary models for molecular sequences and associated traits, such as

geography.

Brownian diffusion: stochastic process on the real number line or, in the work

described in this article, on a geographic surface in which increments are

independent and normally distributed with mean zero and variance that scales

linearly with duration.

Continuous-time Markov chain: stochastic process on a discrete state

space or, in the work described in this article, a set of locations that is

memoryless and whose waiting times between transitions are exponentially

distributed.

Comparative approach: framework for relating observed phenotype informa-

tion to an evolutionary history. Nested clade phylogeographic analysis

assumes that geography is a phenotypic trait and falls into this category.

Model-based approach: method in which a fully specified probabilistic model

describes how observed data are generated. Unknown parameters can

characterize this model, and statistical inference involves estimating and

testing these parameters.

Nested clade phylogeographic analysis (NCPA): resampling-based approach

for inference of geographic information from haplotype trees and networks.

Phylogeography: interdisciplinary field involving study of the evolutionary

history and geographic spread of biological populations or taxa.

Population genetics: as used in this article, a framework that uses a sample of

molecular sequences to make statements about a population under study. The

coalescent is the most widely used population genetics framework. Extensions

of the coalescent to phylogeography typically focus on migration rates

between multiple populations fixed in space.

Spatial diffusion model: application of independent stochastic processes to

describe changes in geographic phenotypes on a two-dimensional surface.

Under the discrete framework described in this article, the surface is divided

into discrete regions and movement between regions is modeled as a

continuous-time Markov chain. Under the continuous diffusion framework,

these regions are subdivided until they become infinitesimal and general-

izations of Brownian diffusion are then considered.

Structured coalescent: extension of the basic coalescent model to multiple

populations. This method focuses on ancestral population sizes and migration

Phylogeographic methods facilitate inference of the geo-graphical history of genetic lineages. Recent examplesexplore human migration and the origins of viral pan-demics. There is longstanding disagreement over theuse and validity of certain phylogeographic inferencemethodologies. In this paper, we highlight three distinctframeworks for phylogeographic inference to give ataste of this disagreement. Each of the three approachespresents a different viewpoint on phylogeography, mostfundamentally on how we view the relationship be-tween the inferred history of a sample and the historyof the population the sample is embedded in. Satisfac-tory resolution of this relationship between history ofthe tree and history of the population remains a chal-lenge for all but the most trivial models of phylogeo-graphic processes. Intriguingly, we believe that somerecent methods that entirely avoid inference about thehistory of the population will eventually help to reach aresolution.

Emerging pathways of phylogeographic inferenceThe influence of phylogeography is spreading throughoutbiology. Among other examples, phylogeographic techni-ques have enabled us to infer the origins of mice [1],modern humans [2,3], and man’s ‘‘best friend’’, the domes-ticated dog [4]. Phylogeographic analyses also enable pub-lic health officials to understand the origin and spread ofemerging infectious diseases [5–8]. In spite of these suc-cesses, disagreement and confusion persist regarding themost effective ways to learn about phylogeographic pro-cesses from geospatially identified molecular sequencedata. Major points of contention include: how to modelthe phylogeographic spread of represented taxa understudy; what statistical frameworks provide effective infer-ence tools; and how to best reconcile population frame-works with geographic information. Over the past fewyears, these questions have been extensively addressedin other reviews and we do not repeat points already made[9–12]. Instead, we highlight several newer phylogeo-graphic approaches: a Bayesian approach to nested cladephylogeographic analysis (NCPA) [13,14], stochastic-pro-cess-driven spatial diffusion models that map viraloutbreaks [15], [16], and several recent population geneticapproaches to phylogeography [17–19]. We draw parallelsbetween these methods in an attempt to shed light on

Corresponding author: Suchard, M.A. ([email protected]).

626 0169-5347/$ – see front matter � 2010 Elsevier Ltd. All rights reserved. doi:10.1

their similarities and differences. We hope that by expos-ing a general audience to a cross-section of methods, newperspectives will be shed on longstanding issues in phylo-geography.

rates between populations.

016/j.tree.2010.08.010 Trends in Ecology and Evolution, November 2010, Vol. 25, No. 11

Opinion Trends in Ecology and Evolution Vol.25 No.11

Comparative approachFor almost a decade, supporters of NCPA [14] and ofmodel-based phylogeographic methods [20] have arguedover the merits of each method. Similar to previous debatein phylogenetics, this has generated several positive out-comes; most importantly, investigators now vigorouslyquestion assumptions when analyzing geographic, demo-graphic and evolutionary data [21]. Nevertheless, thislong-standing discussion has introduced considerable con-fusion [22,23]. However, the debate now seems to be com-ing to an end [20]. In addition, a newmodel-based approach(highlighted below) overcomes many points of contentionover NCPA, making the debate increasingly moot at thispoint. For those not familiar with the debate over NCPAand model-based approaches, we provide a synopsis inBox 1.

Before discussing this newer approach, we provide ashort background to NCPA, a technique that stood alonefor many years in explicitly modeling geography in anevolutionary context. The method originated in previousefforts to jointly analyze phenotypic and phylogenetic datain a comparative framework [24,25] and follows the samegeneral three-step scheme [9,13,26,27]. First, a haplotypetree or network is constructed from molecular data using avariety of approaches [28–30]. Second, clades on the hap-lotype tree are nested, typically following the stepwiseguidelines of Templeton [31,32]. Finally, using thesenested clades, a permutation test assesses the significanceof the geographical spread.

Significant ambiguity occurs in all three stages of theNCPA approach [9]. During the first and second stages,

Box 1. Background to the NCPA debate

NCPA originated in a single-locus formulation in a landmark paper

by Templeton [26] that addresses the construction of a haplotype

tree [28], the nesting of clades on a haplotype tree [31,32], and the

use of a permutation test and inference key to interpret significant

results [26,68]. The debate began when Knowles and Maddison

showed that single-locus NCPA has a high false-positive rate [69].

Later, Panchal and Beaumont found similar high false-positive rates

[27,36].

In response to these claims, Templeton [35,70] modified his

original NCPA, introduced a multi-locus cross-validated version, and

conducted an extensive empirical validation study of single-locus

NCPA. Nevertheless, supporters of NCPA and those against it soon

released a flurry of papers either denouncing the method [34,71,72]

or defending it [73–75]. Following this, both sides began to question

the philosophical basis of the two methods [9,76]. Within the past 6

months, Templeton [14,77] and those criticizing his method [20]

have continued this debate.

The debate has focused around two general points: (i) does

single-locus and multi-locus NCPA have an inherently high false-

positive rate and does this preclude its use? and (ii) do model-

based methods or NCPA provide a more appropriate way to

analyze phylogeographic data? A decade ago, the answer to the

latter question was NCPA, because model-based methods for

phylogeography were not widely available and, arguably, only

NCPA could explicitly incorporate geographic features. Today,

however, this situation has changed. Spatial diffusion and popula-

tion genetics approaches are attracting attention; the former

currently incorporates important features, such as physical dis-

tance and known geographic barriers, and the latter has the

potential to do so. We believe that Beaumont et al. [20] resolve

the questions posed above in favor of model-based methods.

Nevertheless, we encourage others to read the relevant publica-

tions to form their own opinion.

alternative methods can lead to different, and possiblybetter, haplotype tree inference and nesting structures[13,33]. More importantly, in the third stage, determina-tion of the significance level for a particular clade does nottypically take into account multiple testing, which proba-bly leads to a high false-positive rate [9,34] (Box 1). Tem-pleton introduced a multi-locus cross-validated remedy forthe apparent high false-positive rate [35], although thisrevision does not provide a complete solution [20,36]. Itshould be noted that Templeton disagrees with the ambig-uous label for NCPA and its apparent high false-positiverate [14], but extensive research suggests otherwise(Box 1).

In addition to these issues, the pipeline nature of NCPAleads to overconfidence in the final result [13]. This phe-nomenon is well known in the literature on statistics andmodel-basedmolecular evolution. Multiple sequence align-ment provides an excellent example in which conditioningon a unknown multiple sequence alignment can lead tobiased conclusions and overconfidence [37,38]. Fortunate-ly, a joint Bayesian approach provides a formal andstraightforward fix to pipelined analyses [39–41]. Brookset al. [13] and Manolopoulou (PhD thesis, University ofCambridge, 2008) adopted this simultaneous paradigm tocombine the three stages of NCPA. For the first stage,Manolopoulou introduced a tractable procedure to infer ahaplotype tree from molecular sequence data that takesinto account possible homoplasy. She then used approxi-mate Bayesian computation to achieve computational effi-ciency for this part. For the second stage of NCPA, Brookset al. [13] described a multivariate clustering model appli-cable to both phenotypic and phylogeographic data. Theauthors did not provide an analog of the third stage ofNCPA, but suggested that posterior probabilities of vari-ous qualitative events provide a solid alternative to theinference key of NCPA. To demonstrate the model, Man-olopoulou applied both methods to three empirical exam-ples. In general, the two methods give similar conclusions,but the Bayesian approach formalizes the intuitive resultsfrom NCPA. This Bayesian framework should excite thosewho favor the comparative method because it retains thespirit of the comparative method, but corrects many of thestatistical issues that plagued previous efforts. We lookforward to many new developments arising from thisapproach.

Spatial diffusion approachAs an alternative to the comparative method and itsBayesian extensions, we now highlight recent develop-ments towards model-based approaches that take a prob-abilistic perspective on spatial diffusion [15,16]. Althoughmuch of this work was implemented as part of a compre-hensive statistical inference package that led to fruitfuladvances in demographic models [42,43], we want to em-phasize that, unlike spatial coalescent approaches [44],phylogenetic diffusion models do not infer population-based spatial histories. Instead, they identify the ancestralhistory of a particular sample of molecular sequences,namely when and where direct ancestors of the sampleexisted. Nevertheless, such models can extract importantinformation about the processes that underlie geographic

627

Opinion Trends in Ecology and Evolution Vol.25 No.11

structure in genetic data. For example, the inferred move-ment of direct ancestors reflects movement of or betweenhistorical populations. Notably, by limiting the search toancestors, the spatial diffusion approach can accommodategeographical context more explicitly by modeling the dif-fusion as trait evolution in a phylogeny.

Phylogenies naturally lend themselves to exploration ofpopulation structure and to assessment of significance invarious ways [45]. More challenging, however, is an ana-lytical reconstruction of how this structure took shapethroughout the phylogenetic history. It is almost a lawof statistical inquiry that challenging problems are ini-tially probed by heuristic approaches. Parsimony methodsrepresent popular model-free heuristic approaches to re-construct ancestral histories [46] for nucleotides, discreteorganism traits and geographical locations [47]. Probabi-listic approaches to ancestral reconstruction use continu-ous-time Markov chain (CTMC) models to infer discreterealizations in continuous time. Although conceptually sim-ple, they can capture considerable complexity of the discretetransition process among many states, such as geographiclocations, albeit at the expense of an increased number ofinstantaneous rate parameters that make up the CTMCratematrix.An obvious criticism is that a single observationdoes not hold the information required to estimate allparameters in the rate matrix because the intermediatestates of this process are not observed. Nonetheless, this isnot a reason to abandon probabilistic approaches.

Others have reviewed ancestral reconstruction methodsand highlighted the advantages of likelihood-based statis-tical methods [48,49]. Maximum likelihood methods and,to a greater extent, Bayesian inference accommodate im-portant sources of uncertainty inherent in estimatingcharacter evolution [49]. A recent example of a Bayesianapproach to island biogeography, in which specific CanaryIslands represent the character state data, demonstratedgreat potential when applied to many groups of organismsevolving in independent phylogenies [50]. For a singlecharacter, however, a probabilistic approach that diligent-ly accounts for phylogenetic and parameter uncertaintycan yield particularly high variance estimates.

Recent developments in Bayesian phylogeography haveyielded great advances towards efficient statistical infer-ence of spatial diffusion [15]. In general, Bayesianapproaches aremore robust against over-parameterizationvia priors and this can be exploited most efficiently indiscrete phylogeography by specifying informative priorsfor the rate parameters. For example, prior distributionsbased on distance could formalize the prior expectation ofhigher organismal lineage exchange between adjacentisland groups in the Canary Island biogeography [50]. Suchdistance-informed priors have recently been explored inviral epidemiology [15]. Arguably a more important ad-vance in the latter study is the ability to reduce the set ofrate parameters to a limited number that provides aparsimonious description of the spatial diffusion process.This was achieved using a Bayesian stochastic search forvariable selection [51], which augments the rate matrixwith indicator variables and imposes a prior distributionon the total sum of non-zero rate parameters. Parameteri-zation of the truncated Poisson prior distribution proposed

628

by Lemey et al. yields statistical efficiency and wasmotivated by the belief that most pairwise exchange ratesare probably not required to explain diffusion along aphylogeny [15]. This approach has been used to constructa formal Bayes factor test to establish epidemiologicallinkage in viral phylogeographic histories [15] and removesthe need to resort to ad hoc procedures [47].

A basic assumption of discretized dispersal is that an-cestral organisms at any point along a phylogeny reside inthe locations from which samples were drawn. Spatialdiffusion in continuous space represents an attractivealternative for more realistic phylogeographic inference.Following Brownian motion models for quantitative traits,bivariate Brownian random walks have been used forphylogeographic analysis in a likelihood approach [52].This basic model was greatly expanded and introducedinto a Bayesian statistical framework [16]. To overcomethe limiting assumption of a constant diffusion ratethroughout a phylogeny, rate heterogeneity is achievedusing notions borrowed from uncorrelated relaxed clockmodels [53]. In our opinion, this represents a much-neededextension because it might be difficult to find any real lifeexample of spatial diffusion that adheres to a homogeneousBrownian process. For example, mountains, valleys, waterbodies and the environmental tolerance of organisms com-plicate the situation.

In particular, these relaxed randomwalks admit severaldiffusion processes commonly used to model animal move-ment in ecological contexts [54]. These include indepen-dent Cauchy- and, more generally, Student-t-distributedincrements along each branch of an uncertain phylogeny.These distributions emit infinite variance and derive mo-tivation from Levy flight models [55,56]. Further, therelaxed walks do not require a strict enforcement of pow-er-law tail behavior that remains contentious in someanimal movement studies [57]. Encouragingly, even forrelatively simple epidemic expansions, relaxed randomwalks yield reconstructions with increased statistical effi-ciency. We believe this can add considerable credibility tospatial reconstructions in continuous space, in particularbecause Brownian trait estimates have often been foundtoo variable for practical use [58].

The continuous and discrete stochastic processes fur-nish complementary models that expand our ability toinfer phylogeographic processes. They are both implemen-ted in the BEAST software package that has matured intoa rich resource of stochastic evolutionary models [59].Focusing entirely on time-measured phylogenies and of-fering different flavors of demographic models, such anintegrated approach seems to be very suitable for bio-geographical analyses. The first phylogeographic stepswith this framework were made in the field of viral epide-miology, in particular for rapidly evolving RNA viruses[8,16]. As general tools, however, these models are alsouseful in analyses of genetic data from more slowly evolv-ing organisms. To illustrate this, we provide a phylogeo-graphic reconstruction in continuous space for thefreshwater snail Biomphalaria glabrata, a major vectorof schistosomiasis in the NewWorld [60] (Box 2). Box 2 alsoopens a discussion on visualization of inferred evolutionaryhistories through both time and space.

Box 2. Continuous phylogeographic diffusion and visualization challenges

We revisit the phylogeography of the freshwater snail Biomphalaria

glabrata, which has limited dispersal ability and occupies fragmented

habitats in South America and the Caribbean islands [60]. Mavarez et

al. originally obtained sequence data from populations of B. glabrata

sampled at the scale of its current geographic distribution to study its

phylogeography [60]. As illustrated in Figure I, we applied the recently

developed continuous diffusion model to 17 concatenated nuclear

internal transcribed spacer-2 (484 bp) gene sequences and the partial

mitochondrial large ribosomal subunit (374 bp) generated in the

original study. The reconstruction confirms clear separation into

northern (Venezuela and Lesser Antilles) and southern clades (Brazil),

in line with the Amazon River as an important barrier to gene flow

[60]. Because the equatorial forest of the Amazon basin does not

provide appropriate habitats for this species, it might therefore

represent an additional barrier to gene flow [78]. The considerable

divergence among sampled haplotypes led to the suggestion that B.

glabrata constitutes a species complex [60]. As suggested by

clustering of the Lesser Antilles haplotypes in the same monophyletic

clade [79,80], there also seems to be significant isolation by distance

on a more restricted geographic scale.

This illustration also touches on visualization tools that comple-

ment Bayesian inference. Phylogenies have previously been projected

in virtual globe software [81], but we argue that these efforts should

be directed at model-based estimates. If probabilistic approaches can

rigorously capture the uncertainty in phylogeographic inference,

visualization tools that can fully present this uncertainty are critical

for appropriate interpretation. For discrete ancestral reconstruction,

this still remains an open issue; whereas location probability

distributions for ancestral nodes can be readily obtained, phyloge-

netic uncertainty adds a level of complexity that hinders straightfor-

ward visual summaries. Phylogeographic estimates through time

and continuous space are more naturally streamlined into rich

visualization [16]. Not only does this model offer an ability to draw

inference about location realizations and various summary statistics

at arbitrary points in time, but researchers can now also explore

results in an interactive fashion (for several examples, refer to

www.phylogeography.org). Finally, more work is needed to introduce

these summaries in more dedicated visualization software [82] or

accommodate them in feature-rich, multi-purpose geographic infor-

mation systems software [83].

[(Figure_I)TD$FIG]

Figure I. Freshwater snail phylogeny (maximum clade credibility tree) obtained

using Bayesian phylogeographic inference in continuous space. The maps in the

virtual globe visualization are based on satellite pictures available in Google

Earth (http://earth.google.com). (a) Projection of the complete phylogeny with a

red-white gradient denoting the relative age of the branches. The location

uncertainty for each node is visualized by a polygon representing an 80%

credibility contour. (b) Detailed view of the Northern phylogeographic clade, as

indicated by the rectangle in panel a. The orientation of the view is indicated by

matching arrows in both panels. Taxon habitat (freshwater) is an important

geographic feature to begin modeling directly.

Opinion Trends in Ecology and Evolution Vol.25 No.11

Population genetics approachBy far the most popular statistical approaches to phylo-geography rely on the structured-coalescent framework[9,44,61,62]. In general, these methods assume that evolu-tionary trees are random draws from some underlyingpopulation-level process [9]. Essentially, population-levelprocesses fossilize their histories as evolutionary trees thatwe indirectly view through molecular sequence and otherdata [10,61]. These processes include selection, migration,population size changes and recombination.

Nearly all population-level approaches embed evolu-tionary trees in a structured-coalescent framework tomake inference [63]. Herein lies the strength and limita-tions of these methods. If constructed appropriately, struc-tured-coalescentmethods allow us to infer population-levelinformation from a relatively small sample. Unfortunately,constructing these methods in an appropriate mannerrequires extensive time, computation, and ingenuity.The task is not trivial. More than a decade ago, Felsensteinet al. reported that any implementation of structured-

coalescent and related models took nearly 2 years to devel-op [64]. Nielsen andBeaumont recently suggested that thisis still the case [9]. Moreover, the field still has no way tohandle even simple recombination between linked loci in astructured-coalescent framework.

Nevertheless, population genetic approaches for ances-tral inference are flourishing. In the past few years, nu-merous highly significant contributions have beenpublished. As examples, Liu and Pearl [65] and Heledand Drummond [18] fully took into account gene treeuncertainty and species tree uncertainty throughBayesianstatistical modeling [66,67]. Focusing on gene flow andmigration, Hey [17] extended his previous work to incor-porate migration between multiple species. Also focusingon gene flow, Beerli and Palczewski [19] used thermody-namic integration to compute Bayes factors assessingpanmixia and migration in a structured-coalescent frame-work [44]. Approximate Bayesian computation enablesempirical investigators to assess complex demographichistories, conditional likelihood methods facilitate approx-

629

Box 3. Open issues approaching solution

Availability of geo-coded sequence data

The inclusion of geographical data in molecular phylogenetics

increases the size and complexity of data sets. As a benefit,

investigators can learn the relationships between several disparate

aspects, gaining a better understanding of biological diversity and

history. A disadvantage of these complex data sets, however, is

their reproducibility and validity. In the past, investigators could

reproduce and validate inferences because models and data sets

were readily available. At present, however, inferences are more

difficult to reproduce because geographic data are rarely provided

with the molecular sequences and many modeling assumptions go

Opinion Trends in Ecology and Evolution Vol.25 No.11

imation of the coalescent with recombination, and efficientcalculation can be used for full assessment [9].

In summary,much aswith the previous two approaches,application of population genetics to phylogeographyremains a vibrant field of research. Highly significant workis being carried out, but numerous biologically relevantmethods need to be created, implemented and adapted.This route to phylogeography is far from a dead end.

Three routes to the same destinationWith the expansion of phylogeography in new Bayesiandirections involving NCPA and spatial diffusion, the fieldmight seem to be fragmenting. However, we believe allthree approaches address the same basic question posedat the beginning of this article: what are the most effec-tive ways to learn about phylogeographic processes fromgeospatially identified molecular sequence data? Essen-tially, we believe that all three frameworks produceeffective answers, if we know what questions to ask.Under the comparative approach, if we want to assess[(Figure_1)TD$FIG]

Figure 1. Three approaches to phylogeography. In all three windows, the data

represent three species (red, purple and blue) sampled across an island (green). For

the comparative approach, the haplotype tree is displayed on the left and concentric

circles on the island represent the geographical distribution of the three species. For

the spatial diffusion approach, the phylogenetic tree superimposed on the

geographical location of the samples is displayed on the left. The phylogenetic

tree is shown in black to reinforce the notion that population information is not

inferred. The gray areas behind the phylogenetic tree and on the island represent a

high-probability region contour for the locations of the ancestors of the taxa

sampled. For the population genetics approach, the five colored polygons represent

the ancestral populations and their sizes. Arrows represent strong migrations

between the populations. Currently, the idealized population polygons abstract all

geographic features to keep the models tractable. This image is based on Hey [17].

630

geographical spread and molecular data without model-ing this spread explicitly, the method of Manolopoulouwill be the most appropriate. If we want to model rates ofspread and take into account geographical features, themethod of Lemey et al. [16] will be most appropriate. Ifwe want information about population size and migra-tion, but consider population to be a coarse feature, thepopulation genetics approach will be most appropriate.

unreported. One way to avoid this difficultly is to require geo-coded

sequence submission and a central online repository in which

investigators can upload their computational scripts. Some journals

have started to require such uploads for publications, but the push

needs to be stronger.

Incorporating geographic features and niche modeling

The structured-coalescent and discretized spatial diffusion ap-

proaches generate models into which researchers can input a

limited amount of information about geographic or niche features.

This input involves the definition of prior distributions for migration

rates between populations or geographic locations. For these forms

of discrete Markov chain models, O’Brien et al. offered advice on

how to make summary statistics about the diffusion process robust

to some of the model misspecification that results from ignoring

additional features [84]. However, it is unclear how effective

informed prior specification will remain when researchers attempt

to learn from observed data which geographic or niche features

significantly affect migration or diffusion. In these situations, direct

modeling of geographic features as hard or soft barriers in a

continuous diffusion framework seems to be a tenable solution.

Ranking of features according to their barrier strength or testing

which strengths do not significantly differ from zero produces a

sound statistical framework. Efficient computation of transition

probabilities along the tree of migration or diffusion in the presence

of multiple, possibly irregular, barriers remains a modeling

challenge.

Next-generation sequencing

With the advent of next-generation sequencing technologies,

biologists can now obtain de novo genomic samples from entire

biological communities. However, metagenomic studies are still in

their infancy. In particular, the analysis and design of metagenomic

studies are unrefined and rudimentary. Advances in computational

efficiency will be vital for application of model-based methods in

this area. Moreover, model-based methods will need to account for

the data collection technology, as well as the phylogeographic

history of a sample under study. Currently, a lack of theoretical tools

is hindering scientific progress in this area.

To tree or not to tree

The relationship between an inferred phylogenetic tree and its

embedded population has been extensively described over the past

30 years [9,62,85]. Nevertheless, this relationship remains poorly

understood, as evidenced by the three distinct approaches high-

lighted in this paper. Put simply, we still poorly understand whether

inferred evolutionary history is a nuisance or a fundamental entity

[86]. Resolution of this issue will probably require extensive work in

the coming decade.

Opinion Trends in Ecology and Evolution Vol.25 No.11

Figure 1 provides a schematic of these three phylogeo-graphical frameworks.

The question arises as to whether the advantages of allthree approaches can be combined into a single framework.We hesitate to speculate on this particular question, andsimply leave it as open and tractable and encourage itsexploration (see Box 3 for other open issues on the verge ofbeing solved). We note that with a shift towards likelihood-based statistical inference, ad hoc procedures are no longerthe only or preferable methods available. Instead, seekingnovel insights into phylogeographic processes is reduced toenvisioning stochastic process-driven models that incorpo-rate the relevant aspects of the research question at handand then developing the tools necessary to efficiently fitthese models to data. The limiting aspects of this approachare chiefly creativity in modeling and effective strategiesfor reducing high-dimensional model fits to a human-in-terpretable form.

AcknowledgmentsWe thank Chris Simon, Allen Rodrigo, Brian C. Carstens, H. Lisle Gibbs,Laura S. Kubatko, Peter Beerli and Ioanna Manolopoulou for theircomments and suggestions. We also thank the National EvolutionarySynthesis Center (NSF #EF-0423641) for fostering our collaboration.Alexei J. Drummond contributed greatly to discussions on an earlierversion of this paper. EWB is supported in part by the National ScienceFoundation under Agreement No. 0635561. PL is supported by apostdoctoral fellowship from the Fund for Scientific Research (FWO)Flanders. MAS is supported in part by the United States NationalInstitutes of Health (R01 GM086887), the National Science Foundation(DMS 0856099) and the Marsden Fund, New Zealand. Research leadingto this article received funding from the European Research Councilunder the European Union Seventh Framework Programme (FP7/2007-2013) and ERC Grant agreement #260864.

References1 Searle, J. et al. (2009) Of mice and (Viking?) men: phylogeography

of British and Irish house mice. Proc. R. Soc. B Biol. Sci. 276, 201–

2072 Fagundes, N. et al. (2007) Statistical evaluation of alternative

models of human evolution. Proc. Natl. Acad. Sci. U. S. A 104,17614–17619

3 Li, J. et al. (2008) Worldwide human relationships inferred fromgenome-wide patterns of variation. Science 319, 1100–1104

4 von Holdt, B. et al. (2010) Genome-wide SNP and haplotype analysesreveal a rich history underlying dog domestication. Nature 464, 898–

9025 Biek, R. et al. (2006) A virus reveals population structure and recent

demographic history of its carnivore host. Science 311, 538–5416 Biek, R. et al. (2007) A high-resolution genetic signature of

demographic and spatial expansion in epizootic rabies virus. Proc.Natl. Acad. Sci. U. S. A. 104, 7993–7998

7 Smith, G. et al. (2009) Origins and evolutionary genomics of the2009 swine-origin H1N1 influenza A epidemic. Nature 459, 1122–

11258 Lemey, P. et al. (2009) Reconstructing the initial global spread of a

human influenza pandemic: a Bayesian spatial–temporal model for theglobal spread of H1N1. PLoS Curr. Influenza RRN1031

9 Nielsen, R. and Beaumont, M. (2009) Statistical inferences inphylogeography. Mol. Ecol. 18, 1034–1047

10 Knowles, L. (2009) Statistical phylogeography. Annu. Rev. Ecol. Evol.Syst. 40, 593–612

11 Avise, J. (2009) Phylogeography: retrospect and prospect. J. Biogeog.36, 3–15

12 Hickerson, M. et al. (2010) Phylogeography’s past, present, and future:10 years after Avise, 2000. Mol. Phylogenet. Evol. 54, 291–301

13 Brooks, S. et al. (2007) Assessing the effect of genetic mutation: aBayesian framework for determining population history from DNA

sequence data. In Bayesian Statistics (Vol.8) (Bernardo, J.M. et al.,eds), In pp. 1–26, Oxford University Press

14 Templeton, A. (2010) Coalescent-based, maximum likelihood inferencein phylogeography. Mol. Ecol. 19, 431–435

15 Lemey, P. et al. (2009) Bayesian phylogeography finds its roots. PLoSComput. Biol. 5, e1000520

16 Lemey, P. et al. (2010) Phylogeography takes a relaxed randomwalk incontinuous space and time. Mol. Biol. Evol. 27, 1877–1885

17 Hey, J. (2010) Isolation with migration models for more than twopopulations. Mol. Biol. Evol. 27, 905–920

18 Heled, J. and Drummond, A. (2010) Bayesian inference of species treesfrom multilocus data. Mol. Biol. Evol. 27, 570–580

19 Beerli, P. and Palczewski, M. (2010) Unified framework to evaluatepanmixia and migration direction among multiple sampling locations.Genetics 185, 313–326

20 Beaumont, M. et al. (2010) In defence of model-based inference inphylogeography. Mol. Ecol. 19, 436–446

21 Camargo, A. et al. (2009) Phylogeography of the frog Leptodactylusvalidus (Amphibia: Anura): patterns and timing of colonization eventsin the Lesser Antilles. Mol. Phylogenet. Evol. 53, 571–579

22 Phillipsen, I. and Metcalf, A. (2009) Phylogeography of a stream-dwelling frog ( Pseudacris cadaverina) in southern California. Mol.Phylogenet. Evol. 53, 152–170

23 Gante, H. et al. (2009) Diversification within glacial refugia: tempo andmode of evolution of the polytypic fish Barbus sclateri. Mol. Ecol. 18,3240–3255

24 Felsenstein, J. (1985) Phylogenies and the comparative method. Am.Nat. 125, 1–15

25 Harvey, P. and Pagel, D. (1991) The Comparative Method inEvolutionary Biology, Oxford University Press

26 Templeton, A. (1998) Nested clade analyses of phylogeographic data:testing hypotheses about gene flow and population history. Mol. Ecol.7, 381–397

27 Panchal, M. and Beaumont, M. (2007) The automation and evaluationof nested clade phylogeographic analysis. Evolution 61, 1466–1480

28 Templeton, A. et al. (1992) A cladistic analysis of phenotypicassociations with haplotypes inferred from restriction endonucleasemapping. III. Cladogram estimation. Genetics 134, 659–669

29 Clement, M. et al. (2000) TCS: a computer program to estimate genegenealogies. Mol. Ecol. 9, 1657–1659

30 Posada, D. and Crandall, K. (2001) Intraspecific gene genealogies: treesgrafting into networks. Trends Ecol. Evol. 16, 37–45

31 Templeton, A. et al. (1987) A cladistic analysis of phenotypicassociations with haplotypes inferred from restriction endonucleasemapping. I. Basic theory and an analysis of alcohol dehydrogenaseactivity in Drosophilia. Genetics 117, 343–351

32 Templeton, A. et al. (1993) A cladistic analysis of phenotypicassociations with haplotypes inferred from restriction endonucleasemapping. IV. Nested analyses with cladogram uncertainty andrecombination. Genetics 134, 659–669

33 Cassens, I. et al. (2005) Evaluating intraspecific ‘‘network’’ constructionmethods using simulated sequence data: do existing algorithmsoutperform the global maximum parsimony approach? Syst. Biol.54, 363–372

34 Knowles, L. (2008) Why does a method that fails continue to be used?Evolution 62, 2713–2717

35 Templeton, A. (2004) A maximum likelihood framework for crossvalidation of phylogeographic hypotheses. In Evolutionary Theoryand Processes: Modern Horizons (Wasser, S., ed.), pp. 209–230,Kluwer Academic Publishers

36 Panchal, M. and Beaumont, M. (2010) Evaluating nested cladephylogenetic analysis under models of restricted gene flow. Syst.Biol. 59, 415–432

37 Wong, K. et al. (2008) Alignment uncertainty and genomic analysis.Science 319, 473–476

38 Redelings, B. and Suchard, M. (2009) Robust inferences fromambiguous alignments. In Sequence Alignment: Methods, Models,Concepts, and Strategies (Rosenberg, M., ed.), pp. 209–271,University of California Press

39 Suchard, M. et al. (2001) Bayesian selection of continuous-timeMarkovchain evolutionary models. Mol. Biol. Evol. 18, 1001–1013

40 Redelings, B. and Suchard, M. (2005) Joint Bayesian estimation ofalignment and phylogeny. Syst. Biol. 54, 401–418

631

Opinion Trends in Ecology and Evolution Vol.25 No.11

41 Novak, A. et al. (2008) StatAlign: an extendable software package forjoint Bayesian estimation of alignments and evolutionary trees.Bioinformatics 24, 2403–2404

42 Drummond, A. et al. (2005) Bayesian coalescent inference of pastpopulation dynamics from molecular sequences. Mol. Biol. Evol. 22,1185–1192

43 Minin, V. et al. (2008) Smooth skyride through a rough skyline:Bayesian coalescent-based inference of population dynamics. Mol.Biol. Evol. 25, 1459–1471

44 Beerli, P. and Felsenstein, J. (2001) Maximum likelihood estimation ofa migration matrix and effective population sizes in n subpopulationsby using a coalescent approach. Proc. Natl. Acad. Sci. U. S. A. 98, 4563–

456845 Zarate, S. et al. (2007) Comparative study of methods for detecting

sequence compartmentalization in human immunodeficiency virustype 1. Virology 81, 6643–6651

46 Swofford, D. and Maddison, W. (1992) Parsimony, character-statereconstructions and evolutionary inferences. In Systematics,Historical Ecology, and North American Freshwater Fishes (Mayden,R., ed.), pp. 186–223, Stanford University Press

47 Wallace, R. et al. (2007) A statistical phylogeography of influenza AH5N1. Proc. Natl. Acad. Sci. U. S. A. 104, 4473–4478

48 Cunningham, C. et al. (1998) Reconstructing ancestral characterstates: a critical reappraisal. Trends Ecol. Evol. 13, 361–366

49 Ronquist, F. (2004) Bayesian inference of character evolution. TrendsEcol. Evol. 19, 475–481

50 Sanmartin, I. et al. (2008) Inferring dispersal: a Bayesian approach tophylogeny-based island biogeography, with special reference to theCanary Islands. J. Biogeog. 35, 428–449

51 Kuo, L. andMallick, B. (1998) Variable selection for regression models.Sankhya B 60, 65–81

52 Lemmon, A. and Lemmon, E. (2008) A likelihood framework forestimating phylogeographical history on a continuous landscape.Syst. Biol. 57, 544–561

53 Drummond, A. et al. (2006) Relaxed phylogenetics and dating withconfidence. PLoS Biol. 4, e88

54 Paradis, E. et al. (2002) Modeling large-scale dispersal distances. Ecol.Model. 151, 279–292

55 Viswanathan, G. et al. (1996) Levy flight search patterns of wanderingalbatrosses. Nature 381, 413–415

56 Reynolds, A. and Rhodes, C. (2009) The Levy flight paradigm: randomsearch patterns and mechanisms. Ecology 90, 877–887

57 Edwards, A. et al. (2007) Revisiting Levy flight search patterns ofwandering albatrosses, bumblebees and deer. Nature 449, 1044–1048

58 Schluter, D. et al. (1997) Likelihood of ancestral states in adaptiveradiation. J. Org. Evol. 51, 1699–1711

59 Drummond, A. and Rambaut, A. (2007) BEAST: Bayesian evolutionaryanalysis by sampling trees. BMC Evol. Biol. 7, 214

60 Mavarez, J. et al. (2002) Evolutionary history and phylogeographyof the schistosome-vector freshwater snail Biomphalaria glabratabased upon nuclear and mitochondrial DNA sequences. Heredity 89,266–272

61 Avise, J. (2000) Phylogeography: The History and Formation of Species,Harvard University Press

62 Hey, J. andMachado, C. (2003) The study of structured populations – anew hope for a difficult and divided science.Nat. Rev. Genet. 4, 535–543

63 Kingman, J. (1982) The coalescent. Stochastic Processes Appl. 13, 235–

248

632

64 Felsenstein, J. et al. (1999) Likelihoods on coalescents: a Monte Carlosampling approach to inferring parameters from populations samplesof molecular data. In Statistics in Molecular Biology (Seillier-Moiseiwitsch, F., ed.) IMS Lecture Notes, Vol. 33, pp. 163–184,Institute of Mathematical Statistics

65 Liu, L. and Pearl, D. (2007) Species trees from gene trees:reconstructing Bayesian posterior distributions of a speciesphylogeny using estimated gene tree distributions. Syst. Biol. 56,504–514

66 Edwards, S. (2009) Is a new and general theory of molecularsystematics emerging? Evolution 63, 1–19

67 Degnan, J. and Rosenberg, N. (2009) Gene tree discordance,phylogenetic inference and the multispecies coalescent. Trends Ecol.Evol. 24, 332–340

68 Templeton, A. et al. (1995) Separating population structure frompopulation history – a cladisitic analysis of the geographicaldistribution of mitochondrial DNA haplotypes in the tigersalamander, Ambystoma tigrinum. Genetics 140, 767–782

69 Knowles, L. and Maddison, W. (2002) Statistical phylogeography.Mol.Ecol. 11, 2623–2635

70 Templeton, A. (2004) Statistical phylogeography: methods ofevaluating and minimizing inference errors. Mol. Ecol. 13, 789–809

71 Petit, R. (2008) The coup de grace for nested clade phylogeographicanalysis. Mol. Ecol. 17, 516–518

72 Beaumont, M. and Panchal, M. (2008) On the validity of nested cladephylogeographic analysis. Mol. Ecol. 17, 2563–2565

73 Garrick, R. et al. (2008) Babies and bathwater: a comment on thepremature obituary for nested clade phylogeographical analysis. Mol.Ecol. 17, 1401–1403

74 Templeton, A. (2008) Nested clade analysis: an extensively validatedmethod for strong phylogeographic inference.Mol. Ecol. 17, 1877–1880

75 Templeton, A. (2009)Why does amethod that fails continue to be used?The answer. Evolution 63, 807–812

76 Templeton, A. (2009) Statistical hypothesis testing in intraspecificphylogeography: nested clade phylogeographic analysis vs.approximate Bayesian computation. Mol. Ecol. 18, 319–331

77 Templeton, A. (2010) Coherent and incoherent inference inphylogeography and human evolution. Proc. Natl. Acad. Sci. U. S. A.107, 6376–6381

78 Paraense, W. (1983) A survey of planorbid molluscs in the Amazonianregion of Brazil. Mem. Inst. Oswaldo Cruz 78, 343–361

79 Mavarez, J. et al. (2002) Fine-scale population structure and dispersalof Biomphalaria glabrata, the intermediate snail host of Schistosomamansoni, in Venezuela. Mol. Ecol. 11, 879–899

80 Mavarez, J. et al. (2002) Genetic differentiation, dispersal and matingsystem in the schistosome-transmitting freshwater snailBiomphalaria glabrata. Heredity 89, 258–265

81 Janies, D. et al. (2007) Genomic analysis and geographic visualizationof the spread of avian influenza (H5N1). Syst. Biol. 56, 321–329

82 Parks, D. et al. (2009) GenGIS: a geospatial information system forgenomic data. Genome Res. 19, 1896–1904

83 Kidd, D. and Ritchie, M. (2006) Phylogeographic information systems:putting the geography into phylogeography. J. Biogeog. 33, 1851–1865

84 O’Brien, J. et al. (2009) Learning to count: robust estimates for labeleddistances between molecular sequences. Mol. Biol. Evol. 26, 801–814

85 Felsenstein, J. (1988) Phylogenies frommolecular sequences: inferenceand reliability. Annu. Rev. Genet. 22, 521–565

86 Smouse, P. (1998) To tree or not to tree. Mol. Ecol. 7, 399–412