12
doi: 10.1098/rstb.2008.0176 , 3985-3995 363 2008 Phil. Trans. R. Soc. B Vladimir N Minin and Marc A Suchard Fast, accurate and simulation-free stochastic mapping References http://rstb.royalsocietypublishing.org/content/363/1512/3985.full.html#ref-list-1 This article cites 23 articles, 9 of which can be accessed free Subject collections (577 articles) evolution (40 articles) bioinformatics (70 articles) taxonomy and systematics Articles on similar topics can be found in the following collections Email alerting service here right-hand corner of the article or click Receive free email alerts when new articles cite this article - sign up in the box at the top http://rstb.royalsocietypublishing.org/subscriptions go to: Phil. Trans. R. Soc. B To subscribe to This journal is © 2008 The Royal Society on 14 April 2009 rstb.royalsocietypublishing.org Downloaded from

Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

doi: 10.1098/rstb.2008.0176, 3985-3995363 2008 Phil. Trans. R. Soc. B

 Vladimir N Minin and Marc A Suchard Fast, accurate and simulation-free stochastic mapping 

Referenceshttp://rstb.royalsocietypublishing.org/content/363/1512/3985.full.html#ref-list-1

This article cites 23 articles, 9 of which can be accessed free

Subject collections

(577 articles)evolution   � (40 articles)bioinformatics   �

(70 articles)taxonomy and systematics   � Articles on similar topics can be found in the following collections

Email alerting service hereright-hand corner of the article or click

Receive free email alerts when new articles cite this article - sign up in the box at the top

http://rstb.royalsocietypublishing.org/subscriptions go to: Phil. Trans. R. Soc. BTo subscribe to

This journal is © 2008 The Royal Society

on 14 April 2009rstb.royalsocietypublishing.orgDownloaded from

Page 2: Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

Phil. Trans. R. Soc. B (2008) 363, 3985–3995

doi:10.1098/rstb.2008.0176

on 14 April 2009rstb.royalsocietypublishing.orgDownloaded from

Fast, accurate and simulation-freestochastic mapping

Published online 7 October 2008

Vladimir N. Minin1,* and Marc A. Suchard2,3,4

One concomputa

*Autho

1Department of Statistics, University of Washington, Seattle, WA 98195-4322, USA2Department of Biomathematics, and 3Department of Human Genetics, David Geffen School of Medicine,

University of California, Los Angeles, CA 90095-1766, USA4Department of Biostatistics, School of Public Health, University of California, Los Angeles,

CA 90095-1766, USA

Mapping evolutionary trajectories of discrete traits onto phylogenies receives considerable attentionin evolutionary biology. Given the trait observations at the tips of a phylogenetic tree, researchers areoften interested where on the tree the trait changes its state and whether some changes arepreferential in certain parts of the tree. In a model-based phylogenetic framework, such questionstranslate into characterizing probabilistic properties of evolutionary trajectories. Current methods ofassessing these properties rely on computationally expensive simulations. In this paper, we present anefficient, simulation-free algorithm for computing two important and ubiquitous evolutionarytrajectory properties. The first is the mean number of trait changes, where changes can be dividedinto classes of interest (e.g. synonymous/non-synonymous mutations). The mean evolutionaryreward, accrued proportionally to the time a trait occupies each of its states, is the second property.To illustrate the usefulness of our results, we first employ our simulation-free stochastic mapping toexecute a posterior predictive test of correlation between two evolutionary traits. We conclude bymapping synonymous and non-synonymous mutations onto branches of an HIV intrahostphylogenetic tree and comparing selection pressure on terminal and internal tree branches.

Keywords: stochastic mapping; counting processes; evolutionary rewards;posterior predictive diagnostics

1. INTRODUCTION AND BACKGROUNDReconstructing evolutionary histories from present-day

observations is a central problem in quantitative

biology. Phylogenetic estimation is one example ofsuch reconstruction. However, phylogenetic recon-

struction alone does not provide a full picture of an

evolutionary history, because evolutionary paths (map-

pings) describing trait states along the phylogenetic tree

remain hidden. Although one is rarely interested in

detailed reconstruction of such mappings, certain

probabilistic properties of the paths are frequently

used in evolutionary hypotheses testing (Nielsen 2002;Huelsenbeck et al. 2003; Leschen & Buckley 2007).

For example, given a tree and a Markov model of

amino acid evolution, one can compute the expected

number of times a transition from a hydrophobic to a

hydrophilic state occurs, conditional on the observed

amino acid sequence alignment. Such expectations can

inform researchers about model adequacy and provide

insight into the features of the evolutionary processoverlooked by standard phylogenetic techniques

(Dimmic et al. 2005).

Nielsen (2002) introduced stochastic mapping of

trait states on trees and employed this new technique in

a model-based evolutionary hypothesis testing context.

tribution of 17 to a Discussion Meeting Issue ‘Statistical andtional challenges in molecular phylogenetics and evolution’.

r for correspondence ([email protected]).

3985

The author starts with a discrete evolutionary trait Xthat attains m states. He further assumes that this traitevolves according to an evolutionary model describedby a parameter vector q, where q consists of a tree t

with n tips and branch lengths TZ ðt1;.; tBnÞ, root

distribution pZ ðp1;.;pmÞ and a continuous-timeMarkov chain (CTMC) generator QZ{qij} fori; jZ1;.;m. Let mapping MqZ ðfX1tg;.; fXBnt

gÞ bea collection of CTMC trajectories along all branches oft and H(Mq) be a real-valued summary of Mq. Clearly,even when parameters q are fixed, h(Mq) remains arandom variable. Nielsen (2002) proposed to testevolutionary hypotheses using prior and posteriorexpectations E½HðMqÞ� and E½HðMqÞ jD�, whereDZ ðD1;.;DnÞ are trait values observed at the n tipsof t. Since these expectations are deterministicfunctions of q and D, they can be used as discrepancymeasures for posterior predictive p-value calculations(Meng 1994; Gelman et al. 1996).

A major advantage of Nielsen’s stochastic mappingframework is its ability to account for uncertainty inmodel parameters, including phylogenies. A majorlimitation of stochastic mapping is its currentimplementation that relies on time-consumingsimulations. In describing his method for calculatingE½HðMqÞ� and E½HðMqÞ jD�, Nielsen (2002) wrote ‘Ingeneral, we can not evaluate sums in equations 5 and 6directly, because the set [of all possible mappings] isnot of finite size.’ However, the infinite number ofpossible mappings does not prevent one from explicitly

This journal is q 2008 The Royal Society

Page 3: Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

;

3986 V. N. Minin & M. A. Suchard Simulation-free stochastic mapping

on 14 April 2009rstb.royalsocietypublishing.orgDownloaded from

calculating E½HðMqÞ� for some choices of H. Forexample, if

HðMqÞZ1 if Mq is consistent with D

0 if Mq is inconsistent with D;

(ð1:1Þ

then E½HðMqÞ�ZPrðDÞ, the familiar phylogeneticlikelihood that can be evaluated without simulations(Felsenstein 2004). Therefore, hope remains that otherchoices of H may also permit evaluation of E½HðMqÞ�

and E½HðMqÞ jD� without simulations.In this paper, we consider a class of additive

mapping summaries of the form

HðMqÞZXb2U

hðfXbtgÞ; ð1:2Þ

where hðfXbtgÞ is a one-dimensional summary of theMarkov chain path along branch b and U is an arbitrarysubset of all branches of t. Moreover, we restrict ourattention to the two most popular choices of function h.Let L3f1;.;mg2 be a set of ordered index pairs thatlabel transitions of trait X. For each Markov path {Xt}and interval [0, t), we count the number of labelledtransitions in this interval and arrive at

h1ðfXtgÞZNL - number of state transitions labelled by setLð1:3Þ

where we omit dependence on q and t for brevity.Although our second choice of h is more abstract, it ismotivated by Huelsenbeck et al. (2003), who usedNielsen’s stochastic mapping algorithm to calculate themean dwelling time of a trait in a particular state. LetwZ ðw1;.;wmÞ be a set of rewards assigned to eachtrait state. Trait X is ‘rewarded’ the amount t!wi forspending time t in state i. We obtain the total reward ofMarkov path {Xt} by summing up all rewards that Xaccumulates during interval [0, t),

h2ðfXtgÞZRw - evolutionary reward defined by vector w:

ð1:4Þ

To obtain dwelling times of X in a predefined set of traitstates, we set wiZ1 if i belongs to the set of interest andwiZ0 otherwise.

For these two choices of function h, we providean algorithm for exact, simulation-free computation ofE½HðMqÞ� and E½HðMqÞ jD�. Similar to phylogeneticlikelihood calculations of Pr(D), this algorithm relies onthe eigen decomposition of Q and requires traversing t.Our work generalizes and unifies exact calculationspreviously developed for stochastic mapping (Holmes &Rubin 2002; Guindon et al. 2004; Dutheil et al. 2005).Despite the restricted form of the considered mappingsummaries, our results cover nearly all current appli-cations of stochastic mapping. We conclude with twoapplications of stochastic mapping illustrating thecapabilities of exact computation. In our first example,we examine coevolution of two binary traits anddemonstrate that a previously developed simulation-based test of independent evolution can be executedwithout simulations. We then turn to a large codonMarkov state space, on which simulation-based stochas-tic mapping generally experiences severe comput-ational limitations. Using our exact computations, we

Phil. Trans. R. Soc. B (2008)

study temporal patterns of synonymous and non-synonymous mutations in intrahost HIV evolution.

2. LOCAL, ONE BRANCH CALCULATIONSIn this section, we provide mathematical details neededfor calculating expectations of stochastic mappingsummaries on one branch of a phylogenetic tree.We first motivate the need for such local computationsby making further analogies between calculations ofPr(D) and expectations of stochastic mapping sum-maries. The additive form of H reduces calculationof E½HðMqÞ� and E½HðMqÞ jD� to compute branch-specific expectations E½hðfXbtgÞ� and E½hðfXbtÞg jD�.Recall that according to the most phylogenetic models,trait X evolves independently on each branch of t,conditional on trait states at all internal nodes of t. Thisconditional independence is the key behind the dynamicprogramming algorithm that allows for efficient calcu-lation of Pr(D) (Felsenstein 1981). For this likelihoodcalculation algorithm, it suffices to compute finite-timetransition probabilities PðtÞZ fpijðtÞg, where

pijðtÞZPrðXt Z j jX0 Z i Þ; ð2:1Þ

for arbitrary branch length t. Similarly, to obtainE½hðfXtgÞ� and E½hðfXtgÞ jD�, we require means ofcomputing local expectations Eðh; tÞZ feijðh; tÞg,where

eijðh; tÞZE�hðfXtgÞ1fXtZjg jX0 Z i

�ð2:2Þ

and 1{$} is the indicator function. After illustratinghow to compute EðNL; tÞ and EðRw; tÞ withoutresorting to simulations, we provide an algorithmthat efficiently propagates local expectations E(h, t)and finite-time transition probabilities P(t) along t

to arrive at E½hðfXtgÞ� and E½hðfXtgÞ jD�.

(a) Expected number of labelled

Markov transitions

Abstracting from phylogenetics, let NLðtÞ count thenumber of labelled transitions of a CTMC {Xt} duringtime interval [0, t). It follows from the theory of Markovchain-induced counting processes that

EðNL; tÞZ

ðt0

eQzQLeQðtKzÞ dz; ð2:3Þ

where QLZ fqij!1fði; j Þ2Lgg (Ball & Milne 2005). Sincemost evolutionary models are locally reversible, we cansafely assume that Q is diagonalizable with eigen decom-position QZU!diagðd1;.; dmÞ!UK1, where eigen-vectors of Q form the columns of U, d1;.; dm are thereal eigenvalues of Q, and diagðd1;.; dmÞ is a diagonalmatrix with elements d1;.; dm on its main diagonal.Such analytic or numeric diagonalization procedurepermits calculation of finite-time transition proba-bilities PðtÞZU !diagðed1t ;.; edmtÞ!UK1, needed forlikelihood calculations (Lange 2004). Minin & Suchard(2008) have shown that one can use the same eigendecomposition of Q to calculate local expectations

EðNL; tÞZXmiZ1

XmjZ1

SiQLSj IijðtÞ; ð2:4Þ

Page 4: Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

Simulation-free stochastic mapping V. N. Minin & M. A. Suchard 3987

on 14 April 2009rstb.royalsocietypublishing.orgDownloaded from

where SiZUEiUK1, Ei is a matrix with zero entries

everywhere except at the ii-th entry, which is one, and

IijðtÞZ

tedi t if di Z dj ;

edi tKedj t

diK djif disdj :

8>>><>>>:

ð2:5Þ

(b) Expected Markov rewards

For the reward processRw(t), we define a matrix cumul-ative distribution function V ðx; tÞZ fVijðx; tÞg, where

Vijðx; tÞZPrðRwðtÞ%x; Xt Z j jX0 Z i Þ: ð2:6Þ

Neuts (1995, ch. 5) demonstrated that local rewardexpectations can be expressed as

EðRw; tÞZKd

dsV

�ðs; tÞ���sZ0

; ð2:7Þ

where

V�ðs; tÞZ

ðN0

eKsx dV ðx; tÞZ e½QKdiagðw1 ;. ;wmÞs�t ð2:8Þ

is the Laplace–Stieltjes transform of V(x, t). It is easy tosee that the matrix exponential in equation (2.8) satisfiesthe following differential equation:

d

dtV

�ðs; tÞZV�ðs; tÞ½QKdiagðw1;.;wmÞs�: ð2:9Þ

Differentiating this matrix differential equation withrespect to s, exchanging the order of differentiation andevaluating both sides of the resulting equation at sZ0, wearrive at the differential equation for local expectations

d

dtEðRw; tÞZEðRw; tÞQCeQtdiagðw1;.;wmÞ; ð2:10Þ

where E(Rw, 0) is the m!m zero matrix. Multiplicationof both sides of equation (2.10) by integrating factor eKQt

from the right and integration with respect to t producesthe solution,

EðRw; tÞZ

ðt0

eQzdiagðw1;.;wmÞeQðtKzÞ dz: ð2:11Þ

Similarity between equations (2.3) and (2.11) invitescalculation of the expected Markov rewards via spectraldecomposition of Q,

EðRw; tÞZXmiZ1

XmjZ1

Sidiagðw1;.;wmÞSj IijðtÞ: ð2:12Þ

In summary, formulae (2.4) and (2.12) provide a recipefor exact calculationsof local expectations for the numberof labelled transitions and rewards.

3. ASSEMBLING PIECES TOGETHER OVERA TREE(a) Notation for tree traversal

Let us label the internal nodes of t with integers{1, ., nK1} starting from the root of the tree. Recallthat we have already arbitrarily labelled the tips of t

with integers {1, ., n}. Let I be the set of internalbranches and E be the set of terminal branches of t. Foreach branch b2I , we denote the internal node labels ofthe parent and child of branch b by p(b) and c(b),

Phil. Trans. R. Soc. B (2008)

respectively. We use the same notation for eachterminal branch b except p(b), which is an internalnode index, while c(b) is a tip index. Let iZ ði1;.; inK1Þ

denote the internal node trait states. Then, thecomplete likelihood of unobserved internal node statesand the observed states at the tips of t is

Prði;DÞZpi1

Yb2I

pip ðbÞic ðbÞ ðtbÞYb2E

pip ðbÞDc ðbÞðtbÞ: ð3:1Þ

We form the likelihood of the observed data bysumming over all possible states of internal nodes,

PrðDÞZXmi1Z1

/Xm

inK1Z1

pi1

Yb2I

pip ðbÞic ðbÞ ðtbÞYb2E

pip ðbÞDc ðbÞðtbÞ:

ð3:2Þ

Clearly, when data on the tips are not observed, theprior distribution of internal nodes becomes

PrðiÞZpi1

Yb2I

pip ðbÞic ðbÞ ðtbÞ: ð3:3Þ

(b) Posterior expectations of mapping

summaries

Consider an arbitrary branch b� connecting parentinternal node p(b�) to its child c(b�). First, we introducerestricted moments,

E�hðfXb�tgÞ1D

�ZE

�hðfXb�tgÞ jD

�!PrðDÞ: ð3:4Þ

The expectation (3.4) integrates over all evolutionarymappings consistent with D on the tips of t. Invokingthe law of total expectation and the definition ofconditional probability, we deduce

E�hðfXb�tgÞ1D

�ZE½hðfXb�tÞg jD�!PrðDÞ

ZXi

E½hðfXb�tgÞ j i;D�!Prði;DÞ

ZXi

E�hðfXb�tgÞ j ipðb�Þ; icðb�Þ

�pi1

Yb2I

pip ðbÞic ðbÞ ðtbÞ

!Yb2E

pip ðbÞDc ðbÞðtbÞ

ZXi

eip ðb�Þic ðb�Þ ðh; tb� Þpi1

Yb2Infb�g

pip ðbÞic ðbÞ ðtbÞ

!Y

b2Enfb�gpip ðbÞDc ðbÞ

ðtbÞ: ð3:5Þ

The last expression in derivation (3.5) illustrates that inorder to calculate the posterior restricted moment (3.4)along branch b� 2I , we merely need to replace finite-time transition probability pipðb�Þicðb�Þ ðtb� Þ with local

expectation eipðb�Þicðb�Þ ðh; tb� Þ in the likelihood formula

(3.2). Similarly, if b� 2E, we substitute eipðb�ÞDcðb�Þðh; tb� Þ

for pipðb�ÞDcðb�Þðtb� Þ in (3.2). Given matrices PðtbÞ for

bsb� and Eðh; tb� Þ, we can sum over internal nodestates using Felsenstein’s pruning algorithm to arriveat the restricted mean E½hðfXb�tgÞ1D� and then dividethis quantity by Pr(D) to obtain E½hðfXb�tgÞ jD�.

This procedure is efficient for calculating theposterior expectations of mapping summaries for onebranch of t. However, in practice, we need to calculatethe mapping expectations over many branches andconsequently execute the computationally intensive

Page 5: Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

(b)

2

A A

T

1

3

3 3

GG

4

A A

T GG

1

2

3

4

(a)

Figure 1. Sandwich formula illustration. (a) An examplephylogenetic tree in which we label internal nodes numeri-cally and two branches b� and b 0. We break this tree at nodes3 and 4 into the subtrees shown in (b). Assuming that thetrait states are i and j at nodes 3 and 4, respectively, we markeach subtree by the corresponding quantity needed forcalculating the posterior expectation of a mapping summaryon branch b�.

3988 V. N. Minin & M. A. Suchard Simulation-free stochastic mapping

on 14 April 2009rstb.royalsocietypublishing.orgDownloaded from

pruning algorithm many times. A similar problem isencountered in maximum-likelihood phylogenetic esti-mation during differentiation of the likelihood withrespect to branch lengths. An algorithm, reviewed bySchadt et al. (1998), allows for computationally efficient,repeated replacement of one of the finite-time transitionprobabilities with an arbitrary function of the corres-ponding branch length in equation (3.2). This algorithmfinds informal use since the 1980s in pedigree analysis(Cannings et al. 1980) and is implemented in mostmaximum-likelihood phylogenetic reconstruction soft-ware such as PAUP (J. Huelsenbeck 2007, personalcommunication).

Let FuZ ðFu1;.;FumÞ be the vector of forward, oftencalled partial or fractional, likelihoods at nodeu. ElementFui is the probability of the observed data at only the tipsthat descend from the node u, given that the state of u is i.If u is a tip, then we initialize partial likelihoods viaFuiZ1fiZDug

. In case of missing or ambiguous data, Du

denotes the subset of possible trait states, and forwardlikelihoods are set to FuiZ1fi2Dug

. During the first,upward traversal oft, we compute forward likelihoods foreach internal node u using the recursion

Fui ZXmjZ1

Fcðb1Þjpijðtb1

Þ

" #!

XmjZ1

Fcðb2Þjpijðtb2

Þ

" #; ð3:6Þ

Phil. Trans. R. Soc. B (2008)

where b1 and b2 are indices of the branches descendingfrom node u and c(b1) and c(b2) are the correspondingchildren of u. We denote the directional likelihoods insquare brackets of equation (3.6) by Sb1i

and Sb2iand

record SbZ ðSb1;.;SbmÞ together with Fu. Finally,we define backward likelihoods GuZ ðGu1;.;GumÞ,where Gui is the probability of observing state i at nodeu together with other tip states on the subtree of t

obtained by removing all lineages downstream of node u.A second, downward traversal of t yields Gu given theprecomputed Sb.

For each branch b�, we can sandwich pijðtb� Þ among theforward, directional and backward likelihoods and write

PrðDÞZXmiZ1

XmjZ1

Gpðb�ÞiSb0ipijðtb� ÞFcðb�Þj ; ð3:7Þ

where b0 is the second branch descending from the parentnode p(b�). Therefore, with Fu, Sb and Gu precomputedfor all nodes and branches of t, we can replace pijðtb� Þwithany other quantity for any arbitrary branch b withoutrepeatedly traversing t. In particular, the posteriorrestricted moment for branch b� can be expressed as

E½hðfXb�tgÞ1D�ZXmiZ1

XmjZ1

Gpðb�ÞiSb0ieijðh; tb� ÞFcðb�Þj :

ð3:8Þ

In figure 1, we use an example tree to illustrate thecorrespondence between each quantity in sandwichformula (3.8) and the part of the tree involved in thisquantity computation. We summarize all steps that lead tothe computation of the global mean E½HðMqÞ jD� inalgorithm 1. Note that Pr(D), needed to transitionbetween conditional and restricted expectations informula (3.4), is computed with virtually no additionalcost in step (iv) of the algorithm.

(c) Pulley principle for evolutionary mappings

Suppose that we are interested in a mean mappingsummary E½HðMqÞ jD�, obtained as a sum of localmapping summaries over all branches of the phylogen-etic tree t. We would like to know whether quantityE½HðMqÞ jD� changes when we move the root of t to adifferent location.

Recall that reversibility of the Markov chain {Xt}makes Pr(D) invariant to the root placement in t if theroot distribution p is the stationary distribution of {Xt}(Felsenstein 1981). Felsenstein’s pulley principle restson the detailed balance condition

pipijðtÞZpjpjiðtÞ ð3:9Þ

and Chapman–Kolmogorov relationship

pijðt1 C t2ÞZXmkZ1

pikðt1Þpkjðt2Þ: ð3:10Þ

Applying both formulae (3.9) and (3.10) once inequation (3.2) allows one to move the root to anyposition along the two root branches without changingPr(D). Therefore, Pr(D) is invariant to moving the rootto any position on any branch of t since we canrepeatedly apply the detailed balance condition and theChapman–Kolmogorov equation.

Invariance of Pr(D) to root placement together withformula (3.4) suggests that root position invariance of

Page 6: Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

Algorithm 1. Calculating posterior expectations E[H(Mq) jD].

(i) obtain an eigen decomposition of the infinitesimalgenerator Q

(ii) use this decomposition to compute PðtbÞ for eachbranch b of t

(iii) employing the same eigen decomposition, computeEðh; tb� Þ for each branch b� in the set of interest Uusing either equation (2.4) or equation (2.12)

(iv) traverse t once and calculate Fu and Sb for each nodeu and each branch b. Compute data likelihood Pr(D)as the dot product of Froot and root distribution p

(v) traverse t the second time and calculate backwardlikelihoods Gu for all nodes u

(vi) for each b�2U, apply equation (3.8) to obtain

E½hðfXb�tgÞ1D�

(vii) return E½HðMqÞ jD�Z1=PrðDÞP

b�2UE½hðfXb�tgÞ1D�

Simulation-free stochastic mapping V. N. Minin & M. A. Suchard 3989

on 14 April 2009rstb.royalsocietypublishing.orgDownloaded from

conditional expectations E½HðMqÞ jD� holds if andonly if invariance of joint expectations E½HðMqÞ1D� issatisfied. Consider a two-tip phylogeny with branchesof length t1 and t2 leading to observed trait states D1

and D2, respectively. According to formulae (1.2) and(3.5), we may expect that

E½HðMqÞ1D�ZXmkZ1

pkekD1ðh; t1ÞpkD2

ðt2Þ

CXmkZ1

pkekD2ðh; t2ÞpkD1

ðt1Þ

ZXmkZ1

pD1eD1k

ðh; t1ÞpkD2ðt2Þ

CXmkZ1

pD1ekD2

ðh; t2ÞpD1kðt1Þ

ZpD1eD1D2

ðh; t1 C t2Þ; ð3:11Þ

depends only on the sum t1Ct2. Therefore, we canmove the root anywhere on this phylogeny withoutaltering the expectations if {Xt} is reversible. It is easyto see that repeated application of derivation (3.11)readily allows for extension of the root invarianceprinciple to n-tip phylogenies.

In derivation (3.11), we use identities

eijðh; t1 C t2ÞZXmkZ1

�eikðh; t1Þpkjðt2ÞCekjðh; t2Þpikðt1Þ

�ð3:12Þ

and

pieijðh; tÞZpj ejiðh; tÞ: ð3:13Þ

Equation (3.12) splits computation of the expectedsummary on interval [0, t1Ct2) into calculations boundto intervals [0, t1) and [t1, t1Ct2) with the help of the totalexpectation law and Markov property. This derivationparallels that of Chapman–Kolmogorov equation(3.10). Identity (3.13) requires more care as it does nothold for all choices of function h. Using equation (2.12),detailed balance condition (3.13) holds for h2ZRw.However, equation (2.3) suggests that we only canguarantee the detailed balanced condition (3.13) forh1ZNL when ði; j Þ2L if and only if ð j; i Þ2L.

Phil. Trans. R. Soc. B (2008)

(d) Prior expectations of mapping summaries

In many applications of stochastic mapping, one wishesto compare the prior to posterior expectations of sum-maries (Nielsen 2002). In this section, we derive theformulae necessary for computing prior expectations.Similar to our derivation of posterior expectations, webegin by considering an arbitrary branch b� and use thelaw of total expectation to arrive at

E½hðfXb�tgÞ�

Z

Pi eipðb�Þicðb�Þ ðh; tb� Þpi1

Qb2Infb�g pipðbÞicðbÞ ðtbÞ if b� 2IP

i eipðb�Þ ðh; tb� Þpi1

Qb2I pipðbÞicðbÞ ðtbÞ if b� 2E;

(

ð3:14Þ

where eiðh; tÞZPm

jZ1 eijðh; tÞ is the marginal localexpectation of the mapping summary.

Identity PðtÞ1Z1 allows us to eliminate summationover some internal node states in formula (3.14) andconsider only those internal nodes that lie on the pathconnecting the root of t and c(b�). If p is the stationarydistribution of {Xt}, then formula (3.14) together withidentities pTPðtÞZpT and PðtÞ1Z1 simplifies priorlocal expectations even further,

E½hðfXb�tgÞ�ZXi

Xj

pieijðh; tb� Þ

ZpTEðh; tb� Þ1Z

pTQL1tb� if hZNL

XmiZ1

piwitb� if hZRw

8>><>>:

ð3:15Þ

The fact that prior local expectations at stationaritycompute with virtually no additional burden hasimmediate practical implications. In the context ofthe posterior predictive model checking, researchersoften need to simulate L independent and identicallydistributed (iid) realizations D1;.;DL of data at thetips of t and then calculate the summary-baseddiscrepancy measure 1=L

PLlZ1 E½hðfXtgÞ jDl� (Nielsen

2002). Since we know the ‘true’ model undersimulation, we can approximate this discrepancymeasure with the prior expectation

1

L

XLlZ1

E½hðfXtgÞ jDl�zXD�

E½hðfXtgÞ jD��PrðD�Þ

ZE½hðfXtgÞ�; ð3:16Þ

where D� ranges over all possible trait values at the tipsof t. In molecular evolution applications, L is of theorder of 103–105, and hence approximation (3.16)should work well.

(e) Comparison with simulation-based

stochastic mapping

We comment earlier that the Monte Carlo algorithm forstochastic mapping is an alternative and a very popularway to compute expectations of mapping summaries.This algorithm consists of two major steps. In the firststep, the tree is traversed once to compute Fu for eachnode. In the second step, internal node states i aresimulated conditional on D (Pagel 1999). Then,conditional on i, one simulates CTMC trajectories oneach branch of the tree and computes summaries of

Page 7: Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

Table 1. Efficiency and accuracy of stochastic mapping. (Foreach number of iterations, we report the median number ofrejected CTMC trajectories over the entire phylogenetic treeper iteration and the sum of absolute errors (SAE) ofsimulation-based estimates of the mean number of syn-onymous mutations along branches of the phylogenetic tree.)

slow evolving site fast evolving site

iterationsrejections/iteration SAE

rejections/iteration SAE

100 100 0.0598 38 845 0.4624500 105 0.0255 39 247 0.33191000 102 0.0259 42 075 0.290510 000 106 0.0205 40 805 0.2809

3990 V. N. Minin & M. A. Suchard Simulation-free stochastic mapping

on 14 April 2009rstb.royalsocietypublishing.orgDownloaded from

interest. This second step is repeated N times producinga Monte Carlo sample of mapping summaries whoseaverages approximate the branch-specific expectations.

The running time of the algorithm depends on N andthe computational efficiency of generating CTMCtrajectories. The number of samples N, required foraccurate estimation, varies with Q and T. Unfortunately,this aspect of stochastic mapping is largely ignored in theliterature and by practitioners. Although more than oneway exists to simulate trajectories conditional on startingand ending states (Rodrigue et al. 2008), rejectionsampling is the most common (Nielsen 2002). Assessingthe efficiency of rejection sampling is complicated,because the efficiency depends not only on the choiceof CTMC parameters, but also on the observed datapatternsD. We illustrate these difficulties using a CTMCon a state space of 64 codon triplets. We take two sitesfrom an alignment of 129 HIV sequences that we discusslater in one of our examples. We call the first site slowevolving as it obtains only three different codons. Thesecond, fast evolving site emits nine different codons. Wetake one-parameter slice of our Markov chain MonteCarlo (MCMC) sample and run rejection sampling toestimate the expected number of mutations for allbranches of t. For each trial, we record the total numberof rejections on all branches of t and the absolute errorsof the estimated mean number of synonymous mutationssummed over all branches. We summarize the results ofthis experiment in table 1. We see a 400! increase in thenumber of rejections required to simulate trajectoriesconditional on the fast site compared to equivalentsimulations based on the slow site. The Monte Carloerror decreases as the number of Monte Carlo iterationsincreases, but not as fast as one would hope.

In summary, simulation-based stochastic mappingrequires simulation of CTMC trajectories; this is not atrivial computational task. Assessing accuracy ofmethods is cumbersome and difficult to automate.Our algorithm 1 replaces both simulation componentsfrom stochastic mapping calculations and thereforeshould be a preferred way of calculating expectations ofmapping summaries.

4. EXAMPLES(a) Detecting coevolution via dwelling times

In this section, we reformulate a previously developedsimulation-based method for detection of correlated

Phil. Trans. R. Soc. B (2008)

evolution (Huelsenbeck et al. 2003) in terms of a Markovreward process. We consider two primate evolutionarytraits, oestrus advertisement (EA) and multimale matingsystem (MS), analysed by Pagel & Meade (2006). Theseauthors first use cytochrome b molecular sequences toestimate the posterior distribution of the phylogeneticrelationship among 60 Old World monkeys and apespecies. Using 500 MCMC samples from the posteriordistribution of phylogenetic trees, Pagel and Meade runanother MCMC chain, this time with a reversible jumpcomponent that explores a number of CTMC modelsinvolving EA and MS traits and assess the models’posterior probabilities to learn about EA/MS coevolu-tion. The authors find support in favour of a hypothesisstating that EA presence correlates with MS presence.The trait data are shown in figure 2 together with aphylogenetic tree, randomly chosen from the posteriorsample. While this is not the case for Pagel & Meade(2006), reversible jump MCMC for model selection canbe difficult to implement, especially as the number of traitstates grows. Methods that simply require data fittingunder the null model are warranted. Consequentially, werevisit this dataset and apply posterior predictive modeldiagnostics to test the hypothesis of independentevolution between EA and MS traits.

Our null model assumes that EA and MS evolveindependently as 2 two-state Markov chains X ð1Þ

t ,X ð2Þ

t 2 f0; 1g, where 0 and 1, respectively, stand for traitabsence and presence. Let the infinitesimal generatorsof the EA and MS CTMCs be

QEA ZKa0 a0

a1 Ka1

� �and

QMS ZKb0 b0

b1 Kb1

!: ð4:1Þ

We form a product Markov chain YtZ X ð1Þt ;X ð2Þ

t

� �on

the state space fð0; 0Þ; ð0;1Þ; ð1; 0Þ; ð1;1Þg that keepstrack of presence/absence of the two traits simul-taneously assuming that they evolve independently.The generator of the product chain is obtained via theKronecker sum (4),

FZQMS4QEA

Z

Kða0Cb0Þ b0 a0 0

b1 Kða0Cb1Þ 0 a0

a1 0 Kða1Cb0Þ b0

0 a1 b1 Kða1Cb1Þ

0BBBB@

1CCCCA:

ð4:2Þ

The Kronecker sum representation extends togeneral finite state-space Markov chains and to anarbitrary number of independently evolving traits(Neuts 1995). Computationally, this representationis advantageous, because eigenvalues and eigenvec-tors of a potentially high-dimensional product chaingenerator derive analytically from eigenvalues andeigenvectors of low-dimensional individual generators(Laub 2004).

To test the independent evolution model fit viaposterior predictive diagnostics, we need a discrepancymeasure (Meng 1994). Following Huelsenbeck et al.(2003), we employ mean dwelling times to form a

Page 8: Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

Homo sapiensPan paniscus

Pan troglodytesGorilla gorilla

Pongo pygmaeusPongo pygmaeus abelii

Hylobates leucogenysHylobates gabriellae

Hylobates concolorHylobates hoolock

Hylobates syndactylusHylobates lar

Hylobates muelleriHylobates agilis

Hylobates molochHylobates pileatus

Hylobates klossiiMacaca arctoidesMacaca cyclopisMacaca mulatta

Macaca fascicularisMacaca silenus

Macaca sylvanusMacaca nemestrina

Macaca tonkeanaMacaca maura

Macaca heckiMacaca nigra

Macaca nigriscensMacaca brunnescens

Macaca ochreataCercocebus torquatus

Mandrillus sphinxMandrillus leucophaeus

Papio anubisPapio cynocephalus

Papio hamadryasCercopithecus aethiops

Cercopithecus monaCercopithecus nictitans

Pygathrix roxellanaRhinopithecus bieti

Rhinopithecus avunculusNasalis larvatus

Pygathrix nemaeusPresbytis senex

Trachypithecus vetulusSemnopithecus entellus

Trachypithecus johniiTrachypithecus geei

Trachypithecus pileatusTrachypithecus francoisi

Presbytis francoisiPresbytis phayrei

Trachypithecus phayreiTrachypithecus cristatus

Colobus guerezaColobus polykomosColobus angolensisProcolobus badius

0110000000000000011––111111111111111110011101001000000

0–

011

0110000000000000000––111111111111111100000000000000000

0–

000

11

EA MS

Figure 2. Primate trait data. We plot a phylogenetic tree,randomly chosen from the posterior sample, of 60 primatespecies. Branches of the sampled tree are not drawn to scale,nor is the tree ultrametric. Taxa names and trait values (‘0’,absence; ‘1’, presence; ‘K’, missing) for oestrus advertise-ment (EA) and multimale mating system (MS) are depictedat the tips of the tree.

Simulation-free stochastic mapping V. N. Minin & M. A. Suchard 3991

on 14 April 2009rstb.royalsocietypublishing.orgDownloaded from

discrepancy measure. Let ZZPB

bZ1 tb be the treelength of t. We define the mean dwelling times Zð1Þ

i

and Zð2Þi of traits X ð1Þ

t and X ð2Þt in state i, and the mean

dwelling time Zij of the product chain Yt in state (i, j )

Phil. Trans. R. Soc. B (2008)

for i, jZ0, 1. More formally, we set

Zð1Þi ZE Rwi

jDð1Þ� �

; Z ð2Þi ZE Rwi

jDð2Þ� �

and

Zij ZE RwijjDð1Þ;Dð2Þ

h i; ð4:3Þ

where w1Z ð1;0Þ, w2Z ð0;1Þ, w00Z ð1;0; 0;0Þ,w01Z ð0; 1;0; 0Þ, w10Z ð0; 0;1; 0Þ, w11Z ð0; 0;0; 1Þ

and D(1), D(2) are observations of the two traits onthe tips of t.

Using the dwelling times, we define ‘expected’ 3ijand ‘observed’ hij fractions of time the two traits spendin states i and j,

3ij ZZð1Þi

Z!

Z ð2Þj

Zand hij Z

Zij

Z: ð4:4Þ

We use quotation marks because both quantities arenot observed. To quantify the deviation from the nullhypothesis of independence seen in the data, weintroduce the discrepancy measure,

DðQEA;QMS;DÞZX1

iZ0

X1

jZ0

ð3ijK hijÞ2: ð4:5Þ

This measure returns small values under the null andimplicitly depends on t and branch lengths T. Weaccount for this dependence and phylogenetic uncer-tainty by averaging our results over a finite sample fromthe posterior distribution of t and T, obtained from themolecular sequence data.

We use the software package BAYESTRAITS to accom-plish this averaging and to produce a MCMC samplefrom the posterior distribution of QEA and QMS,assuming the null model of independent evolution ofthe two traits (Pagel et al. 2004). Each iteration samplefrom the output of BAYESTRAITS consists of ðt;T ;a0;a1;b0;b1Þ drawn from their posterior distribution. Giventhese model parameters, we generate a new datasetDrep and compute the observed DðQEA;QMS;DÞ andpredicted DðQEA;QMS;D

repÞ discrepancies for eachiteration. We then compare their marginal distributionsby plotting their corresponding histograms (figure 3). Infigure 3, we also plot the observed against predicteddiscrepancies todisplay the correlationbetween these tworandom variables. The apparent disagreement betweenobserved and predicted discrepancies is a manifestationof poor fit of the independent model of evolution. Theobserved discrepancy consistently exceeds the predictedquantity. To illustrate the performance of predictivediagnostics when the independent model fits datawell, we simulate trait data under this model along oneof the 500 a posteriori supported phylogenetic trees.Figure 3c,d depicts the result of the posterior modeldiagnostics applied to the simulated data. In contrastto the observed primate data, the simulated data donot exhibit disagreement between the observed andpredicted discrepancies.

Disagreement between the observed and predicteddiscrepancies can be quantified using a tail probability,called a posterior predictive p value,

pppZPr DðQEA;QMS;DrepÞODðQEA;QMS;DÞ jD;H0

� �;

ð4:6Þ

where the tail probability is taken over the posteriordistribution of the independent model. In practice,

Page 9: Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

freq

uenc

y (×

103 )

0 0.02 0.04 0.06 0.08

0

2

4

6

8(a) (b)

(c) (d )

0

2

4

6

8

discrepancy measure

freq

uenc

y (×

103 )

0 0.01 0.02 0.03

0 0.04 0.08 0.12

0

0.04

0.08

0.12

pred

icte

d di

scre

panc

y

0 0.02 0.04 0.06 0.08

0

0.02

0.04

0.06

0.08

observed discrepancy

pred

icte

d di

scre

panc

y

Figure 3. Testing coevolution. (a,c) Plots depicting observed (white bars) and predicted (grey bars) distributions of thediscrepancy measure for the (a,b) primate and (c,d ) simulated data. (b,d ) The scatter plots of these distributions.

3992 V. N. Minin & M. A. Suchard Simulation-free stochastic mapping

on 14 April 2009rstb.royalsocietypublishing.orgDownloaded from

given N MCMC iterations, one estimates posteriorpredictive p values via

pppz1

N

XNgZ1

1D Q

ðgÞ

EA;Q

ðgÞ

MS;Drep;g

� �OD Q

ðgÞ

EA;Q

ðgÞ

MS;D

� �� ; ð4:7Þ

where QðgÞEA;Q

ðgÞMS are the parameter values realized at

iteration g, and Drep.g is a dataset simulated using theseparameter values. Following this recipe, we estimateppp for the primate data and the artificial data. Thedisagreement between the ‘observed’ and ‘predicted’discrepancies is reflected in a low pppZ0.0128. Bycontrast, the pppZ0.3139, for the simulated data,supports agreement between the observed and pre-dicted distributions of D.

(b) Mapping synonymous and non-synonymous

mutations

In this section, we consider the important task ofmapping synonymous and non-synonymous mutationsonto branches of a phylogenetic tree. Our point ofdeparture is a recent ambitious analysis of HIV intrahostevolution by Lemey et al. (2007), who use sequence dataoriginally reported by Shankarappa et al. (1999). Theauthors attempt to estimate branch-specific syn-onymous and non-synonymous mutation rates andthen project these measurements onto a time axis.This projection enables one to relate the time evolutionof selection processes with clinical covariates. Lemeyet al. (2007) find fitting codon models computationallyprohibitive in this case. Instead, they first fit a DNA

Phil. Trans. R. Soc. B (2008)

model to the intrahost HIV sequences, obtain aposterior sample of phylogenies with branch lengths,and then use these phylogenies to fit a codon model tothe same DNA sequence alignment. Instead of fittingtwo different models to the data, we propose to use just aDNA model and exploit mapping summaries to predictsynonymous and non-synonymous mutation rates.

Suppose we observe a multiple DNA sequencealignment DZ ðD1;.;DLÞ of a protein-coding regionwith L sites and that all CZL/3 codons are aligned toeach other such that the coding region starts at site 1 ofD (in other words, there is no frameshift). Followingthe codon partitioning model of Yang (1996), weassume that sites corresponding to all first codonpositions, D1;D4;.;DLK2, evolve according to astandard HKY model with generator QHKYðk1;p1Þ,where k1 is the transition–transversion ratio and p1 isthe stationary distribution, both just for the first codonposition appropriately constrained. Similarly, we defineCTMC generators QHKYðk2;p2Þ and QHKYðk3;p3Þ forthe other two codon positions with independentparameters. Assuming that all L nucleotide sites in D

evolve independently together with the three codonposition HKY models induces a product Markov chainmodel on the space of codons ðAAA;AAG;.;TTT Þ,where codons are arranged in lexicographic order withrespect to our nucleotide order A!G!C!T , thegenerator of this product CTMC is

Qcodon ZQHKYðk1;p1Þ4QHKYðk2;p2Þ4QHKYðk3;p3Þ:

ð4:8Þ

Page 10: Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

high

(a)

(b)

(d )

(c)

low

syno

nym

ous

rate

sno

n-sy

nony

mou

sra

tes

time before last sample (months)

non-

syno

nym

ous

frac

tion

4.4×10–8

8.7×10–6

1.7×10–3

3.4×10–1

6.7×101

2.7×10–8

6.6×10–6

1.6×10–3

4.0×10–1

9.9×101

120 107 93 80 67 53 40 27 13 00

0.25

0.50

0.75

1.00

Figure 4. Time evolution of synonymous and non-synonymous rates. (a) A representative phylogeny of 129 intrahost HIVsequences. The three heat maps depict the marginal posterior densities of the (b) synonymous and (c) non-synonymous rates,and (d ) the proportion of non-synonymous mutations over time.

Simulation-free stochastic mapping V. N. Minin & M. A. Suchard 3993

on 14 April 2009rstb.royalsocietypublishing.orgDownloaded from

With this Markov chain on the codon space, wedefine a labelling L(s) that contains all possible pairsof codons that translate into the same amino acid.All other codon pairs are collected into a labelling setL(n). Clearly, transitions between elements of L(s)constitute synonymous mutations, and non-synonymousmutations are represented by transitions betweenelements of L(n). In this manner, counting processesmap synonymous and non-synonymous mutationsonto specific branches of t. We consider HIV sequencesfrom patient 1 of Shankarappa et al. (1999) andapproximate the posterior distribution of our DNAmodel parameters PrðQcodon; t;T jDÞ using MCMCsampling implemented in the software package BEAST(Drummond & Rambaut 2007). The serially sampledHIV sequences permit us to estimate the branch lengthsT in units of clock time, months in this case. For eachsaved MCMC sample, we compute branch-specific ratesof synonymous and non-synonymous mutations,

rbðsÞZ1C

PCcZ1 E

�NLðsÞðtbÞ jDc:cC2

�tb

and

rbðnÞZ1C

PCcZ1 E NLðnÞðtbÞ jDc:cC2

� �tb

;ð4:9Þ

where we denote data at codon site c by Dc:cC2. Wealso record the fraction rbðnÞ=ðrbðsÞC rbðnÞÞ of non-synonymous mutations. Similar to Lemey et al.(2007), we summarize these measurements by project-ing them on the time axis. To this end, we form a finitetime grid and produce a density profile of thesynonymous and non-synonymous rates, and of the

Phil. Trans. R. Soc. B (2008)

non-synonymous mutation fractions for each timeinterval between grid points (figure 4). Both synon-ymous and non-synonymous rate density profilesare consistently bimodal across time. Interestingly,the modes also stay appreciably constant. The densityprofile of the non-synonymous mutation fractionsis multimodal and fairly complex. There are aconsiderable number of branches that exhibit strongnegative ððrbðnÞÞ=ðrbðsÞC rbðnÞÞw0Þ and positive ððrbðnÞÞ=ðrbðsÞC rbðnÞÞw1Þ selection. In general, when comparingbranches that occur earlier in the tree to those thatoccur later, the non-synonymous mutation fractionhas first a modest upward trend through time and thendescends to lower values, consistent with other patternsof evolutionary diversity reported by Shankarappaet al. (1999).

Intrigued by the multimodality observed in figure 4,we investigate this issue further. Lemey et al. (2007)consider several branch categories in their analysis, e.g.internal and terminal. We decide to test whetherdifferences exist between selection forces acting oninternal and terminal branches of t. We define thefractions of non-synonymous mutations on internal andterminal branches as

rEðQcodon;t;T ;DÞ

Z

Pb2E

PCcZ1

E�NLðnÞðtbÞ jDc:cC2

�P

b2EPCcZ1

�E�NLðsÞðtbÞ jDc:cC2

�CE

�NLðnÞðtbÞ jDc:cC2

�ð4:10Þ

Page 11: Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

predicted fraction of non-synonymous mutations

freq

uenc

y

0.54 0.56 0.58 0.60 0.62 0.64 0.66

0

50

100

150

Figure 5. Bimodality of the fraction of non-synonymousmutations. We plot the predicted fraction of non-synonymousmutations computed for terminal (grey bars) and internal(white bars) branches.

3994 V. N. Minin & M. A. Suchard Simulation-free stochastic mapping

on 14 April 2009rstb.royalsocietypublishing.orgDownloaded from

and

rI ðQcodon;t;T ;DÞ

Z

Pb2I

PCcZ1

E�NLðnÞðtbÞ jDc:cC2

�P

b2IPCcZ1

�E�NLðsÞðtbÞ jDc:cC2

�CE

�NLðnÞðtbÞ jDc:cC2

� :ð4:11Þ

We plot their posterior histograms in figure 5. Thesehistograms do not overlap, suggesting different fractionsof non-synonymous mutations for internal and externalbranches. To test this hypothesis more formally, we forma discrepancy measure

DðQcodon; t;T ;DÞ

Z rEðQcodon; t;T ;DÞK rI ðQcodon; t;T ;DÞ: ð4:12Þ

As in our previous example, we compare the observeddiscrepancy DðQcodon; t;T ;DÞ with the expected dis-crepancy DðQcodon; t;T ;DrepÞ, where Drep is a multiplesequence alignment simulated under the codon parti-tioning model with parameters Qcodon, t and T. Evokingapproximation (3.16) and recalling that our modelassumes the same substitution rates for each branch oft, we deduce that

DðQcodon; t;T ;DrepÞz0; ð4:13Þ

for all parameter values Qcodon, T and t and replicateddata Drep. Plugging in our new discrepancy measuresinto equation (4.7), we find that ppp!0.001. Therefore,our posterior predictive test suggests that there is asignificant heterogeneity in selective pressure among thebranches of t.

5. DISCUSSIONIn this paper, we develop a computationally efficientframework for mapping evolutionary trajectories ontophylogenies. Although we aim to keep this mathemat-ical framework fairly general, our main interest withevolutionary mappings lies in computing the mean

Phil. Trans. R. Soc. B (2008)

number of labelled trait transitions and the meanevolutionary reward that depends linearly on the time atrait occupies each of its states. These two mappingsummaries have been the most promising buildingblocks for constructing statistical tests.

We build upon our earlier work involving singlebranch calculations for Markov-induced countingprocesses (Minin & Suchard 2008). In our extension,we introduce single branch calculations for evolution-ary reward processes and devise algorithms to extendsingle branch calculations to mapping expectations ofcounting and reward processes onto branches across anentire phylogeny. Our main result generalizes Felsen-stein’s pruning algorithm that forms the workhorse ofmodern phylogenetic computation. The generalizedpruning algorithm warrants two comments about itsefficiency for performing simulation-free stochasticmapping. First, when the infinitesimal rates areunknown, a traditionally slow component of phylogen-etic inference is the eigen decomposition of theirmatrix. Fortunately, this decomposition finds immedi-ate reuse in our algorithm to calculate posteriorexpectations of mappings. Second, the algorithmrequires only two traversals of the phylogenetictree, and is therefore at most two times slower than thestandard likelihood evaluation algorithm. In practice, wefind our algorithm approximately 1.5 times slower thanthe likelihood calculation. We achieve this advantagebecause during the second traversal n terminal branchesare not visited. Finally, the Felsenstein’s algorithmanalogy yields a pulley principle for stochastic mappingand reduction in computation for prior expectations.

Our examples demonstrate how our novel algorithmfacilitates phylogenetic exploratory analysis andhypothesis testing. First, we use simulation-free stochas-tic mapping of occupancy times to re-implementHuelsenbeck et al.’s (2003) posterior predictive test ofindependent evolution. In our second example, weattempt to recover synonymous and non-synonymousmutation rates without resorting to codon models andinstead use an independent codon partitioning model.We overcome this gross model misspecification withstochastic mapping, find intriguing multimodality ofsynonymous and non-synonymous rates, and use aposterior predictive model check to test differences inselection pressures between terminal and internalbranches. We stress that our predictions are only asgood as the model we use. For example, the terminal/internal branch differences may be due to a general badfit of our purposely misspecified model to the intrahostHIV data. However, we find this scenario unlikely inlight of the recent demonstration of the excellentperformance of codon partitioning models duringanalyses of protein coding regions (Shapiro et al. 2006).

Our examples illustrate importance of hypothesistesting in statistical phylogenetics. In recent years, it hasbecome clear that an evolutionary analysis almost neverends with tree estimation. Importantly, phylogen-etic inference enables evolutionary biologists to tacklescientific hypotheses, appropriately accounting forancestry-induced correlation in observed trait values(Huelsenbeck et al. 2000; Pagel & Lutzoni 2002).Several authors demonstrate that mapping evolution-ary histories onto inferred phylogenies provides

Page 12: Fast, accurate and simulation-free stochastic mapping...Fast, accurate and simulation-free stochastic mapping Vladimir N. Minin1,* and Marc A. Suchard2,3,4 1Department of Statistics,

Simulation-free stochastic mapping V. N. Minin & M. A. Suchard 3995

on 14 April 2009rstb.royalsocietypublishing.orgDownloaded from

a convenient and probabilistically grounded basis fordesigning statistically rigorous tests of evolutionaryhypotheses (Nielsen 2002; Huelsenbeck et al. 2003;Dimmic et al. 2005). Unfortunately, this importantstatistical technique has been hampered by the highcomputational cost of stochastic mapping. Our generalmathematical framework and fast algorithms shouldsecure a central place for stochastic mapping in thestatistical toolbox of evolutionary biologists.

We would like to thank Philippe Lemey for sharing his HIVcodon alignments. We are also grateful to Andrew Meade forhis help on using the BAYESTRAITS software and Ziheng Yangand two anonymous reviewers for their constructive criticism.M.A.S. is supported by an Alfred P. Sloan ResearchFellowship, John Simon Guggenheim Memorial Fellowshipand NIH R01 GM086887.

REFERENCESBall, F. & Milne, R. 2005 Simple derivations of properties of

counting processes associated with Markov renewalprocesses. J. Appl. Prob. 42, 1031–1043. (doi:10.1239/jap/1134587814)

Cannings, C., Thompson, E. & Skolnick, M. 1980 Pedigreeanalysis of complex models. In Current developments inanthropological genetics, pp. 251–298. New York, NY:Plenum Press.

Dimmic, M., Hubisz, M., Bustamante, C. & Nielsen, R. 2005Detecting coevolving amino acid sites using Bayesianmutational mapping. Bioinformatics 21, i126–i135. (doi:10.1093/bioinformatics/bti1032)

Drummond, A. & Rambaut, A. 2007 BEAST: Bayesianevolutionary analysis by sampling trees. BMCEvol. Biol. 7,214. (doi:10.1186/1471-2148-7-214)

Dutheil, J., Pupko, T., Marie, A. & Galtier, N. 2005 A model-based approach for detecting coevolving positions in amolecule. Mol. Biol. Evol. 22, 1919–1928. (doi:10.1093/molbev/msi183)

Felsenstein, J. 1981 Evolutionary trees from DNA sequences:a maximum likelihood approach. J.Mol. Evol. 13, 93–104.

Felsenstein, J. 2004 Inferring phylogenies. Sunderland, MA:Sinauer Associates.

Gelman, A., Meng, X. & Stern, H. 1996 Posterior predictiveassessment of model fitness via realized discrepancies.Stat. Sin. 6, 733–807.

Guindon, S., Rodrigo, A., Dyer, K. & Huelsenbeck, J. 2004Modeling the site-specific variation of selection patternsalong lineages. Proc. Natl Acad. Sci. USA 101,12 957–12 962. (doi:10.1073/pnas.0402177101)

Holmes, I. & Rubin, G. M. 2002 An expectation maximiza-tion algorithm for training hidden substitution models.J. Mol. Biol 317, 753–764. (doi:10.1006/jmbi.2002.5405)

Huelsenbeck, J., Rannala, B. & Masly, J. 2000 Accommodatingphylogenetic uncertainty in evolutionary studies. Science288, 2349–2350. (doi:10.1126/science.288.5475.2349)

Phil. Trans. R. Soc. B (2008)

Huelsenbeck, J., Nielsen, R. & Bollback, J. 2003 Stochasticmapping of morphological characters. Syst. Biol. 52,131–158. (doi:10.1080/10635150390192780)

Lange, K. 2004 Applied probability. New York, NY: Springer.Laub, A. 2004 Matrix analysis for scientists and engineers.

Philadelphia, PA: SIAM.Lemey, P., Pond, S. K., Drummond, A., Pybus, O., Shapiro,

B., Barroso, H., Taveira, N. & Rambaut, A. 2007Synonymous substitution rates predict HIV disease pro-gression as a result of underlying replication dynamics.PLoSComput. Biol. 3, e29. (doi:10.1371/journal.pcbi.0030029)

Leschen, R. & Buckley, T. 2007 Multistate characters anddiet shifts: evolution of Erotylidae (Coleoptera). Syst. Biol.56, 97–112. (doi:10.1080/10635150701211844)

Meng, X. 1994 Posterior predictive p-values. Ann. Stat. 22,1142–1160. (doi:10.1214/aos/1176325622)

Minin, V. & Suchard, M. 2008 Counting labeled transitionsin continuous-time Markov models of evolution. J. Math.Biol. 56, 391–412. (doi:10.1007/s00285-007-0120-8)

Neuts, M. 1995 Algorithmic probability: a collection of problems.London, UK: Chapman and Hall.

Nielsen, R. 2002 Mapping mutations on phylogenies. Syst.Biol. 51, 729–739. (doi:10.1080/10635150290102393)

Pagel, M. 1999 The maximum likelihood approach to re-constructing ancestral character states of discrete characterson phylogenies. Syst. Biol. 48, 612–622. (doi:10.1080/106351599260184)

Pagel, M. & Lutzoni, F. 2002 Accounting for phylogeneticuncertainty in comparative studies of evolution and adap-tation.Biological evolution and statistical physics, pp. 148–161.Berlin, Germany: Springer.

Pagel, M. & Meade, A. 2006 Bayesian analysis of correlatedevolution of discrete characters by reversible-jumpMarkov chain Monte Carlo. Am. Nat. 167, 808–825.(doi:10.1086/503444)

Pagel, M., Meade, A. & Barker, D. 2004 Bayesian estimationof ancestral character states on phylogenies. Syst. Biol. 53,673–684. (doi:10.1080/10635150490522232)

Rodrigue, N., Philippe, H. & Lartillot, N. 2008 Uniformiza-tion for sampling realizations of Markov processes:applications to Bayesian implementations of codonsubstitution models. Bioinformatics 24, 56–62. (doi:10.1093/bioinformatics/btm532)

Schadt, E., Sinsheimer, J. & Lange, K. 1998 Computationaladvances in maximum likelihood methods for molecularphylogeny. Genome Res. 8, 222–233.

Shankarappa, R. et al. 1999 Consistent viral evolutionarychanges associated with the progression of humanimmunodeficiency virus type 1 infection. J. Virol. 73,10 489–10 502.

Shapiro, B., Rambaut, A. & Drummond, A. 2006 Choosingappropriate substitution models for the phylogeneticanalysis of protein-coding sequences. Mol. Biol. Evol. 23,7–9. (doi:10.1093/molbev/msj021)

Yang, Z. 1996 Maximum-likelihood models for combinedanalyses of multiple sequence data. J. Mol. Evol. 42,587–596. (doi:10.1007/BF02352289)