Upload
independent
View
1
Download
0
Embed Size (px)
Citation preview
Three roads diverged? Routes tophylogeographic inferenceErik W. Bloomquist1, Philippe Lemey2 and Marc A. Suchard3,4
1 Mathematical Biosciences Institute, The Ohio State University, Columbus, OH 43210, USA2 Department of Microbiology and Immunology, Rega Institute, K.U. Leuven, Leuven 3000, Belgium3 Department of Biostatistics, UCLA School of Public Health, Los Angeles, CA 90095, USA4 Departments of Biomathematics and Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA
Opinion
Glossary
Ancestral history: any information about the direct ancestors of a sample of
molecular sequences. This term can refer, for example, to inferred sequence
composition or phenotype, such as geography, and is often associated with a
time scale.
Approximate Bayesian computation (ABC): simulation technique used to draw
statistical inference based on data summaries, often used when computation
of the full data likelihood is impractical.
Bayes factor: ratio of the marginal likelihoods of a given data set comparing
two competing models that naturally incorporates uncertainty about unknown
parameters in both models.
Bayesian stochastic search variable selection: framework that estimates the
posterior probability that a particular explanatory variable should be included
in a model. The method is most commonly used in Bayesian inference of linear
regression.
BEAST: open-source MCMC package for analysis of several Bayesian evolu-
tionary models for molecular sequences and associated traits, such as
geography.
Brownian diffusion: stochastic process on the real number line or, in the work
described in this article, on a geographic surface in which increments are
independent and normally distributed with mean zero and variance that scales
linearly with duration.
Continuous-time Markov chain: stochastic process on a discrete state
space or, in the work described in this article, a set of locations that is
memoryless and whose waiting times between transitions are exponentially
distributed.
Comparative approach: framework for relating observed phenotype informa-
tion to an evolutionary history. Nested clade phylogeographic analysis
assumes that geography is a phenotypic trait and falls into this category.
Model-based approach: method in which a fully specified probabilistic model
describes how observed data are generated. Unknown parameters can
characterize this model, and statistical inference involves estimating and
testing these parameters.
Nested clade phylogeographic analysis (NCPA): resampling-based approach
for inference of geographic information from haplotype trees and networks.
Phylogeography: interdisciplinary field involving study of the evolutionary
history and geographic spread of biological populations or taxa.
Population genetics: as used in this article, a framework that uses a sample of
molecular sequences to make statements about a population under study. The
coalescent is the most widely used population genetics framework. Extensions
of the coalescent to phylogeography typically focus on migration rates
between multiple populations fixed in space.
Spatial diffusion model: application of independent stochastic processes to
describe changes in geographic phenotypes on a two-dimensional surface.
Under the discrete framework described in this article, the surface is divided
into discrete regions and movement between regions is modeled as a
continuous-time Markov chain. Under the continuous diffusion framework,
these regions are subdivided until they become infinitesimal and general-
izations of Brownian diffusion are then considered.
Structured coalescent: extension of the basic coalescent model to multiple
populations. This method focuses on ancestral population sizes and migration
Phylogeographic methods facilitate inference of the geo-graphical history of genetic lineages. Recent examplesexplore human migration and the origins of viral pan-demics. There is longstanding disagreement over theuse and validity of certain phylogeographic inferencemethodologies. In this paper, we highlight three distinctframeworks for phylogeographic inference to give ataste of this disagreement. Each of the three approachespresents a different viewpoint on phylogeography, mostfundamentally on how we view the relationship be-tween the inferred history of a sample and the historyof the population the sample is embedded in. Satisfac-tory resolution of this relationship between history ofthe tree and history of the population remains a chal-lenge for all but the most trivial models of phylogeo-graphic processes. Intriguingly, we believe that somerecent methods that entirely avoid inference about thehistory of the population will eventually help to reach aresolution.
Emerging pathways of phylogeographic inferenceThe influence of phylogeography is spreading throughoutbiology. Among other examples, phylogeographic techni-ques have enabled us to infer the origins of mice [1],modern humans [2,3], and man’s ‘‘best friend’’, the domes-ticated dog [4]. Phylogeographic analyses also enable pub-lic health officials to understand the origin and spread ofemerging infectious diseases [5–8]. In spite of these suc-cesses, disagreement and confusion persist regarding themost effective ways to learn about phylogeographic pro-cesses from geospatially identified molecular sequencedata. Major points of contention include: how to modelthe phylogeographic spread of represented taxa understudy; what statistical frameworks provide effective infer-ence tools; and how to best reconcile population frame-works with geographic information. Over the past fewyears, these questions have been extensively addressedin other reviews and we do not repeat points already made[9–12]. Instead, we highlight several newer phylogeo-graphic approaches: a Bayesian approach to nested cladephylogeographic analysis (NCPA) [13,14], stochastic-pro-cess-driven spatial diffusion models that map viraloutbreaks [15], [16], and several recent population geneticapproaches to phylogeography [17–19]. We draw parallelsbetween these methods in an attempt to shed light on
Corresponding author: Suchard, M.A. ([email protected]).
626 0169-5347/$ – see front matter � 2010 Elsevier Ltd. All rights reserved. doi:10.1
their similarities and differences. We hope that by expos-ing a general audience to a cross-section of methods, newperspectives will be shed on longstanding issues in phylo-geography.
rates between populations.
016/j.tree.2010.08.010 Trends in Ecology and Evolution, November 2010, Vol. 25, No. 11
Opinion Trends in Ecology and Evolution Vol.25 No.11
Comparative approachFor almost a decade, supporters of NCPA [14] and ofmodel-based phylogeographic methods [20] have arguedover the merits of each method. Similar to previous debatein phylogenetics, this has generated several positive out-comes; most importantly, investigators now vigorouslyquestion assumptions when analyzing geographic, demo-graphic and evolutionary data [21]. Nevertheless, thislong-standing discussion has introduced considerable con-fusion [22,23]. However, the debate now seems to be com-ing to an end [20]. In addition, a newmodel-based approach(highlighted below) overcomes many points of contentionover NCPA, making the debate increasingly moot at thispoint. For those not familiar with the debate over NCPAand model-based approaches, we provide a synopsis inBox 1.
Before discussing this newer approach, we provide ashort background to NCPA, a technique that stood alonefor many years in explicitly modeling geography in anevolutionary context. The method originated in previousefforts to jointly analyze phenotypic and phylogenetic datain a comparative framework [24,25] and follows the samegeneral three-step scheme [9,13,26,27]. First, a haplotypetree or network is constructed from molecular data using avariety of approaches [28–30]. Second, clades on the hap-lotype tree are nested, typically following the stepwiseguidelines of Templeton [31,32]. Finally, using thesenested clades, a permutation test assesses the significanceof the geographical spread.
Significant ambiguity occurs in all three stages of theNCPA approach [9]. During the first and second stages,
Box 1. Background to the NCPA debate
NCPA originated in a single-locus formulation in a landmark paper
by Templeton [26] that addresses the construction of a haplotype
tree [28], the nesting of clades on a haplotype tree [31,32], and the
use of a permutation test and inference key to interpret significant
results [26,68]. The debate began when Knowles and Maddison
showed that single-locus NCPA has a high false-positive rate [69].
Later, Panchal and Beaumont found similar high false-positive rates
[27,36].
In response to these claims, Templeton [35,70] modified his
original NCPA, introduced a multi-locus cross-validated version, and
conducted an extensive empirical validation study of single-locus
NCPA. Nevertheless, supporters of NCPA and those against it soon
released a flurry of papers either denouncing the method [34,71,72]
or defending it [73–75]. Following this, both sides began to question
the philosophical basis of the two methods [9,76]. Within the past 6
months, Templeton [14,77] and those criticizing his method [20]
have continued this debate.
The debate has focused around two general points: (i) does
single-locus and multi-locus NCPA have an inherently high false-
positive rate and does this preclude its use? and (ii) do model-
based methods or NCPA provide a more appropriate way to
analyze phylogeographic data? A decade ago, the answer to the
latter question was NCPA, because model-based methods for
phylogeography were not widely available and, arguably, only
NCPA could explicitly incorporate geographic features. Today,
however, this situation has changed. Spatial diffusion and popula-
tion genetics approaches are attracting attention; the former
currently incorporates important features, such as physical dis-
tance and known geographic barriers, and the latter has the
potential to do so. We believe that Beaumont et al. [20] resolve
the questions posed above in favor of model-based methods.
Nevertheless, we encourage others to read the relevant publica-
tions to form their own opinion.
alternative methods can lead to different, and possiblybetter, haplotype tree inference and nesting structures[13,33]. More importantly, in the third stage, determina-tion of the significance level for a particular clade does nottypically take into account multiple testing, which proba-bly leads to a high false-positive rate [9,34] (Box 1). Tem-pleton introduced a multi-locus cross-validated remedy forthe apparent high false-positive rate [35], although thisrevision does not provide a complete solution [20,36]. Itshould be noted that Templeton disagrees with the ambig-uous label for NCPA and its apparent high false-positiverate [14], but extensive research suggests otherwise(Box 1).
In addition to these issues, the pipeline nature of NCPAleads to overconfidence in the final result [13]. This phe-nomenon is well known in the literature on statistics andmodel-basedmolecular evolution. Multiple sequence align-ment provides an excellent example in which conditioningon a unknown multiple sequence alignment can lead tobiased conclusions and overconfidence [37,38]. Fortunate-ly, a joint Bayesian approach provides a formal andstraightforward fix to pipelined analyses [39–41]. Brookset al. [13] and Manolopoulou (PhD thesis, University ofCambridge, 2008) adopted this simultaneous paradigm tocombine the three stages of NCPA. For the first stage,Manolopoulou introduced a tractable procedure to infer ahaplotype tree from molecular sequence data that takesinto account possible homoplasy. She then used approxi-mate Bayesian computation to achieve computational effi-ciency for this part. For the second stage of NCPA, Brookset al. [13] described a multivariate clustering model appli-cable to both phenotypic and phylogeographic data. Theauthors did not provide an analog of the third stage ofNCPA, but suggested that posterior probabilities of vari-ous qualitative events provide a solid alternative to theinference key of NCPA. To demonstrate the model, Man-olopoulou applied both methods to three empirical exam-ples. In general, the two methods give similar conclusions,but the Bayesian approach formalizes the intuitive resultsfrom NCPA. This Bayesian framework should excite thosewho favor the comparative method because it retains thespirit of the comparative method, but corrects many of thestatistical issues that plagued previous efforts. We lookforward to many new developments arising from thisapproach.
Spatial diffusion approachAs an alternative to the comparative method and itsBayesian extensions, we now highlight recent develop-ments towards model-based approaches that take a prob-abilistic perspective on spatial diffusion [15,16]. Althoughmuch of this work was implemented as part of a compre-hensive statistical inference package that led to fruitfuladvances in demographic models [42,43], we want to em-phasize that, unlike spatial coalescent approaches [44],phylogenetic diffusion models do not infer population-based spatial histories. Instead, they identify the ancestralhistory of a particular sample of molecular sequences,namely when and where direct ancestors of the sampleexisted. Nevertheless, such models can extract importantinformation about the processes that underlie geographic
627
Opinion Trends in Ecology and Evolution Vol.25 No.11
structure in genetic data. For example, the inferred move-ment of direct ancestors reflects movement of or betweenhistorical populations. Notably, by limiting the search toancestors, the spatial diffusion approach can accommodategeographical context more explicitly by modeling the dif-fusion as trait evolution in a phylogeny.
Phylogenies naturally lend themselves to exploration ofpopulation structure and to assessment of significance invarious ways [45]. More challenging, however, is an ana-lytical reconstruction of how this structure took shapethroughout the phylogenetic history. It is almost a lawof statistical inquiry that challenging problems are ini-tially probed by heuristic approaches. Parsimony methodsrepresent popular model-free heuristic approaches to re-construct ancestral histories [46] for nucleotides, discreteorganism traits and geographical locations [47]. Probabi-listic approaches to ancestral reconstruction use continu-ous-time Markov chain (CTMC) models to infer discreterealizations in continuous time. Although conceptually sim-ple, they can capture considerable complexity of the discretetransition process among many states, such as geographiclocations, albeit at the expense of an increased number ofinstantaneous rate parameters that make up the CTMCratematrix.An obvious criticism is that a single observationdoes not hold the information required to estimate allparameters in the rate matrix because the intermediatestates of this process are not observed. Nonetheless, this isnot a reason to abandon probabilistic approaches.
Others have reviewed ancestral reconstruction methodsand highlighted the advantages of likelihood-based statis-tical methods [48,49]. Maximum likelihood methods and,to a greater extent, Bayesian inference accommodate im-portant sources of uncertainty inherent in estimatingcharacter evolution [49]. A recent example of a Bayesianapproach to island biogeography, in which specific CanaryIslands represent the character state data, demonstratedgreat potential when applied to many groups of organismsevolving in independent phylogenies [50]. For a singlecharacter, however, a probabilistic approach that diligent-ly accounts for phylogenetic and parameter uncertaintycan yield particularly high variance estimates.
Recent developments in Bayesian phylogeography haveyielded great advances towards efficient statistical infer-ence of spatial diffusion [15]. In general, Bayesianapproaches aremore robust against over-parameterizationvia priors and this can be exploited most efficiently indiscrete phylogeography by specifying informative priorsfor the rate parameters. For example, prior distributionsbased on distance could formalize the prior expectation ofhigher organismal lineage exchange between adjacentisland groups in the Canary Island biogeography [50]. Suchdistance-informed priors have recently been explored inviral epidemiology [15]. Arguably a more important ad-vance in the latter study is the ability to reduce the set ofrate parameters to a limited number that provides aparsimonious description of the spatial diffusion process.This was achieved using a Bayesian stochastic search forvariable selection [51], which augments the rate matrixwith indicator variables and imposes a prior distributionon the total sum of non-zero rate parameters. Parameteri-zation of the truncated Poisson prior distribution proposed
628
by Lemey et al. yields statistical efficiency and wasmotivated by the belief that most pairwise exchange ratesare probably not required to explain diffusion along aphylogeny [15]. This approach has been used to constructa formal Bayes factor test to establish epidemiologicallinkage in viral phylogeographic histories [15] and removesthe need to resort to ad hoc procedures [47].
A basic assumption of discretized dispersal is that an-cestral organisms at any point along a phylogeny reside inthe locations from which samples were drawn. Spatialdiffusion in continuous space represents an attractivealternative for more realistic phylogeographic inference.Following Brownian motion models for quantitative traits,bivariate Brownian random walks have been used forphylogeographic analysis in a likelihood approach [52].This basic model was greatly expanded and introducedinto a Bayesian statistical framework [16]. To overcomethe limiting assumption of a constant diffusion ratethroughout a phylogeny, rate heterogeneity is achievedusing notions borrowed from uncorrelated relaxed clockmodels [53]. In our opinion, this represents a much-neededextension because it might be difficult to find any real lifeexample of spatial diffusion that adheres to a homogeneousBrownian process. For example, mountains, valleys, waterbodies and the environmental tolerance of organisms com-plicate the situation.
In particular, these relaxed randomwalks admit severaldiffusion processes commonly used to model animal move-ment in ecological contexts [54]. These include indepen-dent Cauchy- and, more generally, Student-t-distributedincrements along each branch of an uncertain phylogeny.These distributions emit infinite variance and derive mo-tivation from Levy flight models [55,56]. Further, therelaxed walks do not require a strict enforcement of pow-er-law tail behavior that remains contentious in someanimal movement studies [57]. Encouragingly, even forrelatively simple epidemic expansions, relaxed randomwalks yield reconstructions with increased statistical effi-ciency. We believe this can add considerable credibility tospatial reconstructions in continuous space, in particularbecause Brownian trait estimates have often been foundtoo variable for practical use [58].
The continuous and discrete stochastic processes fur-nish complementary models that expand our ability toinfer phylogeographic processes. They are both implemen-ted in the BEAST software package that has matured intoa rich resource of stochastic evolutionary models [59].Focusing entirely on time-measured phylogenies and of-fering different flavors of demographic models, such anintegrated approach seems to be very suitable for bio-geographical analyses. The first phylogeographic stepswith this framework were made in the field of viral epide-miology, in particular for rapidly evolving RNA viruses[8,16]. As general tools, however, these models are alsouseful in analyses of genetic data from more slowly evolv-ing organisms. To illustrate this, we provide a phylogeo-graphic reconstruction in continuous space for thefreshwater snail Biomphalaria glabrata, a major vectorof schistosomiasis in the NewWorld [60] (Box 2). Box 2 alsoopens a discussion on visualization of inferred evolutionaryhistories through both time and space.
Box 2. Continuous phylogeographic diffusion and visualization challenges
We revisit the phylogeography of the freshwater snail Biomphalaria
glabrata, which has limited dispersal ability and occupies fragmented
habitats in South America and the Caribbean islands [60]. Mavarez et
al. originally obtained sequence data from populations of B. glabrata
sampled at the scale of its current geographic distribution to study its
phylogeography [60]. As illustrated in Figure I, we applied the recently
developed continuous diffusion model to 17 concatenated nuclear
internal transcribed spacer-2 (484 bp) gene sequences and the partial
mitochondrial large ribosomal subunit (374 bp) generated in the
original study. The reconstruction confirms clear separation into
northern (Venezuela and Lesser Antilles) and southern clades (Brazil),
in line with the Amazon River as an important barrier to gene flow
[60]. Because the equatorial forest of the Amazon basin does not
provide appropriate habitats for this species, it might therefore
represent an additional barrier to gene flow [78]. The considerable
divergence among sampled haplotypes led to the suggestion that B.
glabrata constitutes a species complex [60]. As suggested by
clustering of the Lesser Antilles haplotypes in the same monophyletic
clade [79,80], there also seems to be significant isolation by distance
on a more restricted geographic scale.
This illustration also touches on visualization tools that comple-
ment Bayesian inference. Phylogenies have previously been projected
in virtual globe software [81], but we argue that these efforts should
be directed at model-based estimates. If probabilistic approaches can
rigorously capture the uncertainty in phylogeographic inference,
visualization tools that can fully present this uncertainty are critical
for appropriate interpretation. For discrete ancestral reconstruction,
this still remains an open issue; whereas location probability
distributions for ancestral nodes can be readily obtained, phyloge-
netic uncertainty adds a level of complexity that hinders straightfor-
ward visual summaries. Phylogeographic estimates through time
and continuous space are more naturally streamlined into rich
visualization [16]. Not only does this model offer an ability to draw
inference about location realizations and various summary statistics
at arbitrary points in time, but researchers can now also explore
results in an interactive fashion (for several examples, refer to
www.phylogeography.org). Finally, more work is needed to introduce
these summaries in more dedicated visualization software [82] or
accommodate them in feature-rich, multi-purpose geographic infor-
mation systems software [83].
[(Figure_I)TD$FIG]
Figure I. Freshwater snail phylogeny (maximum clade credibility tree) obtained
using Bayesian phylogeographic inference in continuous space. The maps in the
virtual globe visualization are based on satellite pictures available in Google
Earth (http://earth.google.com). (a) Projection of the complete phylogeny with a
red-white gradient denoting the relative age of the branches. The location
uncertainty for each node is visualized by a polygon representing an 80%
credibility contour. (b) Detailed view of the Northern phylogeographic clade, as
indicated by the rectangle in panel a. The orientation of the view is indicated by
matching arrows in both panels. Taxon habitat (freshwater) is an important
geographic feature to begin modeling directly.
Opinion Trends in Ecology and Evolution Vol.25 No.11
Population genetics approachBy far the most popular statistical approaches to phylo-geography rely on the structured-coalescent framework[9,44,61,62]. In general, these methods assume that evolu-tionary trees are random draws from some underlyingpopulation-level process [9]. Essentially, population-levelprocesses fossilize their histories as evolutionary trees thatwe indirectly view through molecular sequence and otherdata [10,61]. These processes include selection, migration,population size changes and recombination.
Nearly all population-level approaches embed evolu-tionary trees in a structured-coalescent framework tomake inference [63]. Herein lies the strength and limita-tions of these methods. If constructed appropriately, struc-tured-coalescentmethods allow us to infer population-levelinformation from a relatively small sample. Unfortunately,constructing these methods in an appropriate mannerrequires extensive time, computation, and ingenuity.The task is not trivial. More than a decade ago, Felsensteinet al. reported that any implementation of structured-
coalescent and related models took nearly 2 years to devel-op [64]. Nielsen andBeaumont recently suggested that thisis still the case [9]. Moreover, the field still has no way tohandle even simple recombination between linked loci in astructured-coalescent framework.
Nevertheless, population genetic approaches for ances-tral inference are flourishing. In the past few years, nu-merous highly significant contributions have beenpublished. As examples, Liu and Pearl [65] and Heledand Drummond [18] fully took into account gene treeuncertainty and species tree uncertainty throughBayesianstatistical modeling [66,67]. Focusing on gene flow andmigration, Hey [17] extended his previous work to incor-porate migration between multiple species. Also focusingon gene flow, Beerli and Palczewski [19] used thermody-namic integration to compute Bayes factors assessingpanmixia and migration in a structured-coalescent frame-work [44]. Approximate Bayesian computation enablesempirical investigators to assess complex demographichistories, conditional likelihood methods facilitate approx-
629
Box 3. Open issues approaching solution
Availability of geo-coded sequence data
The inclusion of geographical data in molecular phylogenetics
increases the size and complexity of data sets. As a benefit,
investigators can learn the relationships between several disparate
aspects, gaining a better understanding of biological diversity and
history. A disadvantage of these complex data sets, however, is
their reproducibility and validity. In the past, investigators could
reproduce and validate inferences because models and data sets
were readily available. At present, however, inferences are more
difficult to reproduce because geographic data are rarely provided
with the molecular sequences and many modeling assumptions go
Opinion Trends in Ecology and Evolution Vol.25 No.11
imation of the coalescent with recombination, and efficientcalculation can be used for full assessment [9].
In summary,much aswith the previous two approaches,application of population genetics to phylogeographyremains a vibrant field of research. Highly significant workis being carried out, but numerous biologically relevantmethods need to be created, implemented and adapted.This route to phylogeography is far from a dead end.
Three routes to the same destinationWith the expansion of phylogeography in new Bayesiandirections involving NCPA and spatial diffusion, the fieldmight seem to be fragmenting. However, we believe allthree approaches address the same basic question posedat the beginning of this article: what are the most effec-tive ways to learn about phylogeographic processes fromgeospatially identified molecular sequence data? Essen-tially, we believe that all three frameworks produceeffective answers, if we know what questions to ask.Under the comparative approach, if we want to assess[(Figure_1)TD$FIG]
Figure 1. Three approaches to phylogeography. In all three windows, the data
represent three species (red, purple and blue) sampled across an island (green). For
the comparative approach, the haplotype tree is displayed on the left and concentric
circles on the island represent the geographical distribution of the three species. For
the spatial diffusion approach, the phylogenetic tree superimposed on the
geographical location of the samples is displayed on the left. The phylogenetic
tree is shown in black to reinforce the notion that population information is not
inferred. The gray areas behind the phylogenetic tree and on the island represent a
high-probability region contour for the locations of the ancestors of the taxa
sampled. For the population genetics approach, the five colored polygons represent
the ancestral populations and their sizes. Arrows represent strong migrations
between the populations. Currently, the idealized population polygons abstract all
geographic features to keep the models tractable. This image is based on Hey [17].
630
geographical spread and molecular data without model-ing this spread explicitly, the method of Manolopoulouwill be the most appropriate. If we want to model rates ofspread and take into account geographical features, themethod of Lemey et al. [16] will be most appropriate. Ifwe want information about population size and migra-tion, but consider population to be a coarse feature, thepopulation genetics approach will be most appropriate.
unreported. One way to avoid this difficultly is to require geo-coded
sequence submission and a central online repository in which
investigators can upload their computational scripts. Some journals
have started to require such uploads for publications, but the push
needs to be stronger.
Incorporating geographic features and niche modeling
The structured-coalescent and discretized spatial diffusion ap-
proaches generate models into which researchers can input a
limited amount of information about geographic or niche features.
This input involves the definition of prior distributions for migration
rates between populations or geographic locations. For these forms
of discrete Markov chain models, O’Brien et al. offered advice on
how to make summary statistics about the diffusion process robust
to some of the model misspecification that results from ignoring
additional features [84]. However, it is unclear how effective
informed prior specification will remain when researchers attempt
to learn from observed data which geographic or niche features
significantly affect migration or diffusion. In these situations, direct
modeling of geographic features as hard or soft barriers in a
continuous diffusion framework seems to be a tenable solution.
Ranking of features according to their barrier strength or testing
which strengths do not significantly differ from zero produces a
sound statistical framework. Efficient computation of transition
probabilities along the tree of migration or diffusion in the presence
of multiple, possibly irregular, barriers remains a modeling
challenge.
Next-generation sequencing
With the advent of next-generation sequencing technologies,
biologists can now obtain de novo genomic samples from entire
biological communities. However, metagenomic studies are still in
their infancy. In particular, the analysis and design of metagenomic
studies are unrefined and rudimentary. Advances in computational
efficiency will be vital for application of model-based methods in
this area. Moreover, model-based methods will need to account for
the data collection technology, as well as the phylogeographic
history of a sample under study. Currently, a lack of theoretical tools
is hindering scientific progress in this area.
To tree or not to tree
The relationship between an inferred phylogenetic tree and its
embedded population has been extensively described over the past
30 years [9,62,85]. Nevertheless, this relationship remains poorly
understood, as evidenced by the three distinct approaches high-
lighted in this paper. Put simply, we still poorly understand whether
inferred evolutionary history is a nuisance or a fundamental entity
[86]. Resolution of this issue will probably require extensive work in
the coming decade.
Opinion Trends in Ecology and Evolution Vol.25 No.11
Figure 1 provides a schematic of these three phylogeo-graphical frameworks.
The question arises as to whether the advantages of allthree approaches can be combined into a single framework.We hesitate to speculate on this particular question, andsimply leave it as open and tractable and encourage itsexploration (see Box 3 for other open issues on the verge ofbeing solved). We note that with a shift towards likelihood-based statistical inference, ad hoc procedures are no longerthe only or preferable methods available. Instead, seekingnovel insights into phylogeographic processes is reduced toenvisioning stochastic process-driven models that incorpo-rate the relevant aspects of the research question at handand then developing the tools necessary to efficiently fitthese models to data. The limiting aspects of this approachare chiefly creativity in modeling and effective strategiesfor reducing high-dimensional model fits to a human-in-terpretable form.
AcknowledgmentsWe thank Chris Simon, Allen Rodrigo, Brian C. Carstens, H. Lisle Gibbs,Laura S. Kubatko, Peter Beerli and Ioanna Manolopoulou for theircomments and suggestions. We also thank the National EvolutionarySynthesis Center (NSF #EF-0423641) for fostering our collaboration.Alexei J. Drummond contributed greatly to discussions on an earlierversion of this paper. EWB is supported in part by the National ScienceFoundation under Agreement No. 0635561. PL is supported by apostdoctoral fellowship from the Fund for Scientific Research (FWO)Flanders. MAS is supported in part by the United States NationalInstitutes of Health (R01 GM086887), the National Science Foundation(DMS 0856099) and the Marsden Fund, New Zealand. Research leadingto this article received funding from the European Research Councilunder the European Union Seventh Framework Programme (FP7/2007-2013) and ERC Grant agreement #260864.
References1 Searle, J. et al. (2009) Of mice and (Viking?) men: phylogeography
of British and Irish house mice. Proc. R. Soc. B Biol. Sci. 276, 201–
2072 Fagundes, N. et al. (2007) Statistical evaluation of alternative
models of human evolution. Proc. Natl. Acad. Sci. U. S. A 104,17614–17619
3 Li, J. et al. (2008) Worldwide human relationships inferred fromgenome-wide patterns of variation. Science 319, 1100–1104
4 von Holdt, B. et al. (2010) Genome-wide SNP and haplotype analysesreveal a rich history underlying dog domestication. Nature 464, 898–
9025 Biek, R. et al. (2006) A virus reveals population structure and recent
demographic history of its carnivore host. Science 311, 538–5416 Biek, R. et al. (2007) A high-resolution genetic signature of
demographic and spatial expansion in epizootic rabies virus. Proc.Natl. Acad. Sci. U. S. A. 104, 7993–7998
7 Smith, G. et al. (2009) Origins and evolutionary genomics of the2009 swine-origin H1N1 influenza A epidemic. Nature 459, 1122–
11258 Lemey, P. et al. (2009) Reconstructing the initial global spread of a
human influenza pandemic: a Bayesian spatial–temporal model for theglobal spread of H1N1. PLoS Curr. Influenza RRN1031
9 Nielsen, R. and Beaumont, M. (2009) Statistical inferences inphylogeography. Mol. Ecol. 18, 1034–1047
10 Knowles, L. (2009) Statistical phylogeography. Annu. Rev. Ecol. Evol.Syst. 40, 593–612
11 Avise, J. (2009) Phylogeography: retrospect and prospect. J. Biogeog.36, 3–15
12 Hickerson, M. et al. (2010) Phylogeography’s past, present, and future:10 years after Avise, 2000. Mol. Phylogenet. Evol. 54, 291–301
13 Brooks, S. et al. (2007) Assessing the effect of genetic mutation: aBayesian framework for determining population history from DNA
sequence data. In Bayesian Statistics (Vol.8) (Bernardo, J.M. et al.,eds), In pp. 1–26, Oxford University Press
14 Templeton, A. (2010) Coalescent-based, maximum likelihood inferencein phylogeography. Mol. Ecol. 19, 431–435
15 Lemey, P. et al. (2009) Bayesian phylogeography finds its roots. PLoSComput. Biol. 5, e1000520
16 Lemey, P. et al. (2010) Phylogeography takes a relaxed randomwalk incontinuous space and time. Mol. Biol. Evol. 27, 1877–1885
17 Hey, J. (2010) Isolation with migration models for more than twopopulations. Mol. Biol. Evol. 27, 905–920
18 Heled, J. and Drummond, A. (2010) Bayesian inference of species treesfrom multilocus data. Mol. Biol. Evol. 27, 570–580
19 Beerli, P. and Palczewski, M. (2010) Unified framework to evaluatepanmixia and migration direction among multiple sampling locations.Genetics 185, 313–326
20 Beaumont, M. et al. (2010) In defence of model-based inference inphylogeography. Mol. Ecol. 19, 436–446
21 Camargo, A. et al. (2009) Phylogeography of the frog Leptodactylusvalidus (Amphibia: Anura): patterns and timing of colonization eventsin the Lesser Antilles. Mol. Phylogenet. Evol. 53, 571–579
22 Phillipsen, I. and Metcalf, A. (2009) Phylogeography of a stream-dwelling frog ( Pseudacris cadaverina) in southern California. Mol.Phylogenet. Evol. 53, 152–170
23 Gante, H. et al. (2009) Diversification within glacial refugia: tempo andmode of evolution of the polytypic fish Barbus sclateri. Mol. Ecol. 18,3240–3255
24 Felsenstein, J. (1985) Phylogenies and the comparative method. Am.Nat. 125, 1–15
25 Harvey, P. and Pagel, D. (1991) The Comparative Method inEvolutionary Biology, Oxford University Press
26 Templeton, A. (1998) Nested clade analyses of phylogeographic data:testing hypotheses about gene flow and population history. Mol. Ecol.7, 381–397
27 Panchal, M. and Beaumont, M. (2007) The automation and evaluationof nested clade phylogeographic analysis. Evolution 61, 1466–1480
28 Templeton, A. et al. (1992) A cladistic analysis of phenotypicassociations with haplotypes inferred from restriction endonucleasemapping. III. Cladogram estimation. Genetics 134, 659–669
29 Clement, M. et al. (2000) TCS: a computer program to estimate genegenealogies. Mol. Ecol. 9, 1657–1659
30 Posada, D. and Crandall, K. (2001) Intraspecific gene genealogies: treesgrafting into networks. Trends Ecol. Evol. 16, 37–45
31 Templeton, A. et al. (1987) A cladistic analysis of phenotypicassociations with haplotypes inferred from restriction endonucleasemapping. I. Basic theory and an analysis of alcohol dehydrogenaseactivity in Drosophilia. Genetics 117, 343–351
32 Templeton, A. et al. (1993) A cladistic analysis of phenotypicassociations with haplotypes inferred from restriction endonucleasemapping. IV. Nested analyses with cladogram uncertainty andrecombination. Genetics 134, 659–669
33 Cassens, I. et al. (2005) Evaluating intraspecific ‘‘network’’ constructionmethods using simulated sequence data: do existing algorithmsoutperform the global maximum parsimony approach? Syst. Biol.54, 363–372
34 Knowles, L. (2008) Why does a method that fails continue to be used?Evolution 62, 2713–2717
35 Templeton, A. (2004) A maximum likelihood framework for crossvalidation of phylogeographic hypotheses. In Evolutionary Theoryand Processes: Modern Horizons (Wasser, S., ed.), pp. 209–230,Kluwer Academic Publishers
36 Panchal, M. and Beaumont, M. (2010) Evaluating nested cladephylogenetic analysis under models of restricted gene flow. Syst.Biol. 59, 415–432
37 Wong, K. et al. (2008) Alignment uncertainty and genomic analysis.Science 319, 473–476
38 Redelings, B. and Suchard, M. (2009) Robust inferences fromambiguous alignments. In Sequence Alignment: Methods, Models,Concepts, and Strategies (Rosenberg, M., ed.), pp. 209–271,University of California Press
39 Suchard, M. et al. (2001) Bayesian selection of continuous-timeMarkovchain evolutionary models. Mol. Biol. Evol. 18, 1001–1013
40 Redelings, B. and Suchard, M. (2005) Joint Bayesian estimation ofalignment and phylogeny. Syst. Biol. 54, 401–418
631
Opinion Trends in Ecology and Evolution Vol.25 No.11
41 Novak, A. et al. (2008) StatAlign: an extendable software package forjoint Bayesian estimation of alignments and evolutionary trees.Bioinformatics 24, 2403–2404
42 Drummond, A. et al. (2005) Bayesian coalescent inference of pastpopulation dynamics from molecular sequences. Mol. Biol. Evol. 22,1185–1192
43 Minin, V. et al. (2008) Smooth skyride through a rough skyline:Bayesian coalescent-based inference of population dynamics. Mol.Biol. Evol. 25, 1459–1471
44 Beerli, P. and Felsenstein, J. (2001) Maximum likelihood estimation ofa migration matrix and effective population sizes in n subpopulationsby using a coalescent approach. Proc. Natl. Acad. Sci. U. S. A. 98, 4563–
456845 Zarate, S. et al. (2007) Comparative study of methods for detecting
sequence compartmentalization in human immunodeficiency virustype 1. Virology 81, 6643–6651
46 Swofford, D. and Maddison, W. (1992) Parsimony, character-statereconstructions and evolutionary inferences. In Systematics,Historical Ecology, and North American Freshwater Fishes (Mayden,R., ed.), pp. 186–223, Stanford University Press
47 Wallace, R. et al. (2007) A statistical phylogeography of influenza AH5N1. Proc. Natl. Acad. Sci. U. S. A. 104, 4473–4478
48 Cunningham, C. et al. (1998) Reconstructing ancestral characterstates: a critical reappraisal. Trends Ecol. Evol. 13, 361–366
49 Ronquist, F. (2004) Bayesian inference of character evolution. TrendsEcol. Evol. 19, 475–481
50 Sanmartin, I. et al. (2008) Inferring dispersal: a Bayesian approach tophylogeny-based island biogeography, with special reference to theCanary Islands. J. Biogeog. 35, 428–449
51 Kuo, L. andMallick, B. (1998) Variable selection for regression models.Sankhya B 60, 65–81
52 Lemmon, A. and Lemmon, E. (2008) A likelihood framework forestimating phylogeographical history on a continuous landscape.Syst. Biol. 57, 544–561
53 Drummond, A. et al. (2006) Relaxed phylogenetics and dating withconfidence. PLoS Biol. 4, e88
54 Paradis, E. et al. (2002) Modeling large-scale dispersal distances. Ecol.Model. 151, 279–292
55 Viswanathan, G. et al. (1996) Levy flight search patterns of wanderingalbatrosses. Nature 381, 413–415
56 Reynolds, A. and Rhodes, C. (2009) The Levy flight paradigm: randomsearch patterns and mechanisms. Ecology 90, 877–887
57 Edwards, A. et al. (2007) Revisiting Levy flight search patterns ofwandering albatrosses, bumblebees and deer. Nature 449, 1044–1048
58 Schluter, D. et al. (1997) Likelihood of ancestral states in adaptiveradiation. J. Org. Evol. 51, 1699–1711
59 Drummond, A. and Rambaut, A. (2007) BEAST: Bayesian evolutionaryanalysis by sampling trees. BMC Evol. Biol. 7, 214
60 Mavarez, J. et al. (2002) Evolutionary history and phylogeographyof the schistosome-vector freshwater snail Biomphalaria glabratabased upon nuclear and mitochondrial DNA sequences. Heredity 89,266–272
61 Avise, J. (2000) Phylogeography: The History and Formation of Species,Harvard University Press
62 Hey, J. andMachado, C. (2003) The study of structured populations – anew hope for a difficult and divided science.Nat. Rev. Genet. 4, 535–543
63 Kingman, J. (1982) The coalescent. Stochastic Processes Appl. 13, 235–
248
632
64 Felsenstein, J. et al. (1999) Likelihoods on coalescents: a Monte Carlosampling approach to inferring parameters from populations samplesof molecular data. In Statistics in Molecular Biology (Seillier-Moiseiwitsch, F., ed.) IMS Lecture Notes, Vol. 33, pp. 163–184,Institute of Mathematical Statistics
65 Liu, L. and Pearl, D. (2007) Species trees from gene trees:reconstructing Bayesian posterior distributions of a speciesphylogeny using estimated gene tree distributions. Syst. Biol. 56,504–514
66 Edwards, S. (2009) Is a new and general theory of molecularsystematics emerging? Evolution 63, 1–19
67 Degnan, J. and Rosenberg, N. (2009) Gene tree discordance,phylogenetic inference and the multispecies coalescent. Trends Ecol.Evol. 24, 332–340
68 Templeton, A. et al. (1995) Separating population structure frompopulation history – a cladisitic analysis of the geographicaldistribution of mitochondrial DNA haplotypes in the tigersalamander, Ambystoma tigrinum. Genetics 140, 767–782
69 Knowles, L. and Maddison, W. (2002) Statistical phylogeography.Mol.Ecol. 11, 2623–2635
70 Templeton, A. (2004) Statistical phylogeography: methods ofevaluating and minimizing inference errors. Mol. Ecol. 13, 789–809
71 Petit, R. (2008) The coup de grace for nested clade phylogeographicanalysis. Mol. Ecol. 17, 516–518
72 Beaumont, M. and Panchal, M. (2008) On the validity of nested cladephylogeographic analysis. Mol. Ecol. 17, 2563–2565
73 Garrick, R. et al. (2008) Babies and bathwater: a comment on thepremature obituary for nested clade phylogeographical analysis. Mol.Ecol. 17, 1401–1403
74 Templeton, A. (2008) Nested clade analysis: an extensively validatedmethod for strong phylogeographic inference.Mol. Ecol. 17, 1877–1880
75 Templeton, A. (2009)Why does amethod that fails continue to be used?The answer. Evolution 63, 807–812
76 Templeton, A. (2009) Statistical hypothesis testing in intraspecificphylogeography: nested clade phylogeographic analysis vs.approximate Bayesian computation. Mol. Ecol. 18, 319–331
77 Templeton, A. (2010) Coherent and incoherent inference inphylogeography and human evolution. Proc. Natl. Acad. Sci. U. S. A.107, 6376–6381
78 Paraense, W. (1983) A survey of planorbid molluscs in the Amazonianregion of Brazil. Mem. Inst. Oswaldo Cruz 78, 343–361
79 Mavarez, J. et al. (2002) Fine-scale population structure and dispersalof Biomphalaria glabrata, the intermediate snail host of Schistosomamansoni, in Venezuela. Mol. Ecol. 11, 879–899
80 Mavarez, J. et al. (2002) Genetic differentiation, dispersal and matingsystem in the schistosome-transmitting freshwater snailBiomphalaria glabrata. Heredity 89, 258–265
81 Janies, D. et al. (2007) Genomic analysis and geographic visualizationof the spread of avian influenza (H5N1). Syst. Biol. 56, 321–329
82 Parks, D. et al. (2009) GenGIS: a geospatial information system forgenomic data. Genome Res. 19, 1896–1904
83 Kidd, D. and Ritchie, M. (2006) Phylogeographic information systems:putting the geography into phylogeography. J. Biogeog. 33, 1851–1865
84 O’Brien, J. et al. (2009) Learning to count: robust estimates for labeleddistances between molecular sequences. Mol. Biol. Evol. 26, 801–814
85 Felsenstein, J. (1988) Phylogenies frommolecular sequences: inferenceand reliability. Annu. Rev. Genet. 22, 521–565
86 Smouse, P. (1998) To tree or not to tree. Mol. Ecol. 7, 399–412