58
Improved models of molecular evolution in statistical phylogenetics St´ ephane Guindon Department of Statistics The University of Auckland New Zealand

Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Embed Size (px)

DESCRIPTION

In this talk, I will present new models of molecular evolution suitable for phylogeny estimation. These models provide an improved description of the variation of rates of evolution along genomes and during the course of evolution. I will present examples that demonstrate the superiority of these new models and show how to use them in PhyML

Citation preview

Page 1: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Improved models of molecular

evolution in statistical phylogenetics

Stephane Guindon

Department of StatisticsThe University of Auckland

New Zealand

Page 2: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Outline

1 Introduction

2 Variability of rates across sites

3 Variablity of rates across sites and lineages

4 Variability of selection regimes across sites and lineages

5 Conclusion

Introduction 2/52

Page 3: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Data used in phylogenetics

Multiple sources:

(Alignment of) homologous sequences

Introduction 3/52

Page 4: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Data used in phylogenetics

Multiple sources:

(Alignment of) homologous sequences

Calibration (typically fossil data)

Geography, environment (e.g., GIS data)

In this talk, I will mainly focus on pre-determined alignment(s)of orthologous coding sequences.

Introduction 3/52

Page 5: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Phylogenetic model

“Hybrid” object made of a discrete parameter, the treetopology, and multiple continuous parameters such asbranch lengths, substitution rates between pairs ofcharacters (nucleotide, amino-acids, codons), populationssize, migration rates, etc.

Estimation relies on the likelihood, i.e., the probability ofthe data given the model parameter values.

Bayesian or maximum-likelihood inference, depending onthe type of problem and the amount of time/computingresources available.

Introduction 4/52

Page 6: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Likelihood

Pr(S1,S2, . . . ,S6|M) = Pr(S1 = AAAA|M)× . . .

Introduction 5/52

Page 7: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Likelihood

Pr(S1,S2, . . . ,S6|M) = . . . × Pr(S2 = CGGC|M)× . . .

Introduction 6/52

Page 8: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Likelihood

Pr(S1,S2, . . . ,S6|M) = . . . × Pr(S6 = GGAA|M)

Introduction 7/52

Page 9: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Likelihood

4n combinations (for nucleotides): not computationallytractable for the vast majority of data sets...Clever tree traversal algorithm by Felsenstein (1981): 4× 4× n

operations required → can go up to 5,000 - 10,000 sequences!

Introduction 8/52

Page 10: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Core of the likelihood

N (t): number of substitutions in short time interval [0, t ].

Poisson process:

Pr(N (t + dt)− N (t) = 1) ≃ λdt

Pr(N (t + dt)− N (t) = 0) ≃ 1− λdt

Pr(N (t + dt)− N (t) ≥ 2) ≃ 0

Poisson probability:

Pr(N (t) = k) =(λt)k

k !e−λt

Introduction 9/52

Page 11: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Core of the likelihood

0/1 data, 2 substitutions

Introduction 10/52

Page 12: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Core of the likelihood

0/1 data, 2 substitutions(

p0→0 × p0→0 + p0→1 × p1→0 p0→0 × p0→1 + p0→1 × p1→1

p1→0 × p0→0 + p1→1 × p1→0 p1→0 × p0→1 + p1→1 × p1→1

)

Introduction 11/52

Page 13: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Core of the likelihood

0/1 data, 2 substitutions(

p0→0 p0→1

p1→0 p1→1

)

×

(

p0→0 p0→1

p1→0 p1→1

)

Introduction 12/52

Page 14: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Core of the likelihood

0/1 data, 2 substitutions(

p0→0 p0→1

p1→0 p1→1

)

×

(

p0→0 p0→1

p1→0 p1→1

)

In general, for k ≥ 0 substitutions, the probabilities of changefrom one state to another is given by Rk .

Introduction 12/52

Page 15: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Core of the likelihood

Combine Poisson probability for k substitutions happening, toRk , in order to derive P(t), the matrix of transition betweenstates in time t :

P(t) =∞∑

k=0

(Rk )(µt)ke−µt

k !

= e−µt

∞∑

k=0

(Rµt)k

k !

= e−µteRµt

= e(R−I)µt

Introduction 13/52

Page 16: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Core of the likelihoodCombine Poisson probability for k substitutions happening, toRk , in order to derive P(t), the matrix of transition betweenstates in time t :

P(t) =

∞∑

k=0

(Rk )(µt)ke−µt

k !

= e−µt

∞∑

k=0

(Rµt)k

k !

= e−µteRµt

= e(R−I)µt

P(µt) = eQµt

Introduction 14/52

Page 17: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Core of the likelihoodCombine Poisson probability for k substitutions happening, toRk , in order to derive P(t), the matrix of transition betweenstates in time t :

P(t) =

∞∑

k=0

(Rk )(µt)ke−µt

k !

= e−µt

∞∑

k=0

(Rµt)k

k !

= e−µteRµt

= e(R−I)µt

P(l) = eQl

Introduction 15/52

Page 18: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Outline

1 Introduction

2 Variability of rates across sites

3 Variablity of rates across sites and lineages

4 Variability of selection regimes across sites and lineages

5 Conclusion

Variability of rates across sites 16/52

Page 19: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Simplest model

Same rate matrix Q throughout the tree

Sites are independent and identically distributed (iid)

Edges all have the same length l

Variability of rates across sites 17/52

Page 20: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Standard model, no variation across sites

Same rate matrix throughout the tree

Sites are iid

Each edge has its own length

Variability of rates across sites 18/52

Page 21: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Standard model, variation across sites

Same rate matrix throughout the tree

Each edge has its own length

Sites are still iid, πfast + πslow = 1, πfastr fast + πslowr slow = 1

Variability of rates across sites 19/52

Page 22: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Continuous Gamma model

Variability of rates across sites 20/52

Page 23: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Discrete Gamma model (Yang, 1994)

Variability of rates across sites 21/52

Page 24: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Discrete Gamma model (Yang, 1994)

Variability of rates across sites 22/52

Page 25: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Discrete Gamma model (Yang, 1994)

Variability of rates across sites 23/52

Page 26: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Transition to new models

Benefit of Yang’s discrete Gamma approach: oneparameter (α) determines what the values of all the ris are.

Limiting the number of parameters to estimate isconvenient from a computational perspective.

In practice, the variation of rates across sites is a strongfeature of molecular evolution, i.e., estimating α is easy.

Modern genetic data sets are much bigger than they werein the 90’s.

Computers are also much faster.

It is ample time we move on...

Designing more flexible models of rate variation isrelatively straightforward.

Variability of rates across sites 24/52

Page 27: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

FreeRate model

Non-parametric estimation of πi ’s and ri ’s: estimate theseparameters under the two constraints

iπi = 1 and

iπiri = 1.

Variability of rates across sites 25/52

Page 28: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

FreeRate model

Main drawback is the greater number of parameters toestimate: one for discrete gamma model vs. 2C − 2 forFreeRate.

Benefits:

more flexibility in modelling the variability of rates acrosssites.possibility to select the “best” number of rate classes usingsound statistical approach (e.g., likelihood ratio tests)

Variability of rates across sites 26/52

Page 29: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Results: nucleotide data sets

Variability of rates across sites 27/52

Page 30: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Results: amino-acid data sets

Variability of rates across sites 28/52

Page 31: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Prediction of amino-acid diversity

Variability of rates across sites 29/52

Page 32: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Summary

FreeRate generally fits data better than +Γ4.

Similar computational costs.

Soubrier et al. (2012, Mol. Biol. Evol.): FreeRate returnsmore accurate estimates of node ages compared to +Γ4.

In PhyML: command-line option --freerate (or --freerates).

Variability of rates across sites 30/52

Page 33: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Outline

1 Introduction

2 Variability of rates across sites

3 Variablity of rates across sites and lineages

4 Variability of selection regimes across sites and lineages

5 Conclusion

Variablity of rates across sites and lineages 31/52

Page 34: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Actual rate patterns (?)

Variablity of rates across sites and lineages 32/52

Page 35: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Modelling site-specific rate patterns

Each site and each edge has its own rate of evolution.

No-common-mechanism model: poor statistical properties.

Alternative: each edge has the same distribution of rates.

Variablity of rates across sites and lineages 33/52

Page 36: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

The Integrated Length (IL) approach

Variablity of rates across sites and lineages 34/52

Page 37: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

The Integrated Length (IL) approach

The length of a branch is a random variable, characterizedby a mean (number of substitutions) and a variance.

In the current implementation, the variance is proportionalto the mean (one extra parameter for the whole treecompared to the standard approach).

Variablity of rates across sites and lineages 35/52

Page 38: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

The Integrated Length (IL) approach

Integrate over a wide range of scenarios...

Including the good ones...

Variablity of rates across sites and lineages 36/52

Page 39: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

The Integrated Length (IL) approach

Integrate over a wide range of scenarios...

Including the good ones...

...and the not so good ones.

Variablity of rates across sites and lineages 36/52

Page 40: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Theory

Standard approach:

P(l) = eQl

IL approach:

P(l) =

0eQlp(l)dl

If l is distributed as Γ(α, β), then:

P(α, β) = (I − βQ)−α

Same computational cost as that of the standard approach.

Variablity of rates across sites and lineages 37/52

Page 41: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Results: nucleotide data sets

Variablity of rates across sites and lineages 38/52

Page 42: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Results: amino-acid data sets

Variablity of rates across sites and lineages 39/52

Page 43: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Summary

IL incurs approximately the same computational cost asthe standard model.

IL is nested within the standard model: avenue forhypothesis testing.

Gamma distribution is a good model for the branch lengthif the rate of evolution fluctuates according to a(geometric) Brownian process (Guindon, Syst. Biol., 2013).

Large improvement for a small proportion of data sets.

In PhyML: command-line option --il.

Variablity of rates across sites and lineages 40/52

Page 44: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Outline

1 Introduction

2 Variability of rates across sites

3 Variablity of rates across sites and lineages

4 Variability of selection regimes across sites and lineages

5 Conclusion

Variability of selection regimes across sites and lineages 41/52

Page 45: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Codon models

Use alignments of homologous coding sequences to estimatethe ratio of non-synonymous to synonymous (dN/dS)substitution rates.

The Q matrix is now 61 by 61 (instead of 4×4 or 20×20).

We are no longer interested in the variation of the overallrate at which substitutions accumulate. Rather, we focuson the variation of dN/dS.

Variability of selection regimes across sites and lineages 42/52

Page 46: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Variation across sites: M2a model

Variability of selection regimes across sites and lineages 43/52

Page 47: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

M2a model: a different viewpoint

Consider a new model where each state of the Markovmodel is a combination of a codon state and a selectionregime.

The M2a model is then defined by a rate matrix Q withdimension (3× 61) by (3× 61):

Q =

Qω0 0 00 Qω1 00 0 Qω2

Variability of selection regimes across sites and lineages 44/52

Page 48: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Extending M2a: branch-site model

Q =

Qω0 0 00 Qω1 00 0 Qω2

Variability of selection regimes across sites and lineages 45/52

Page 49: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Extending M2a: branch-site model

Q =

Qω0 0 00 Qω1 00 0 Qω2

+

- πω1I πω2I

πω0I - πω2I

πω0I πω1I -

Guindon, Rodrigo, Dyer, Huelsenbeck (2004, PNAS)

Variability of selection regimes across sites and lineages 45/52

Page 50: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Shan et al. (2009, Mol. Biol. Evol.)

Variability of selection regimes across sites and lineages 46/52

Page 51: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Standard branch-site model

Gold standard: branch-site model (PAML), where the userspecifies which branches are likely to be affected by positiveselection at some sites a priori.

The stochastic branch-site model does not require suchprior information.

Also, the standard branch-site model assumes that thesame branches undergo positive selection in differentregions of the alignment.

The stochastic branch-site approach does not impose thatconstraint.

Variability of selection regimes across sites and lineages 47/52

Page 52: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Simulations

Variability of selection regimes across sites and lineages 48/52

Page 53: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Power to detect positive selection

Truth: 50% A+ 50% F XU XV XW

std-BS with tree A 0.912 0.916 1.000std-BS with tree B 0.020 0.036 0.190std-BS with tree C 0.346 0.346 0.976std-BS with tree D 0.006 0.010 0.166std-BS with tree E 0.172 0.150 0.000

std-BS multi 0.022 0.040 0.264sto-BS 0.148 0.178 0.682

XU, XV: 20% of the sites evolve under positive selection (on green edges only).

XW: 40% of the sites evolve under positive selection (on green edges only).

Lu & Guindon (2013, Mol. Biol. Evol.)

Variability of selection regimes across sites and lineages 49/52

Page 54: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Summary

Strong prior on where positive selection might haveoccurred: use the standard branch-site model.

Exploratory analysis: the stochastic branch-site modelperforms better. Also clearly outperforms the MEMEapproach implemented in HyPhy (Murrell et al., 2012)

Variability of selection regimes across sites and lineages 50/52

Page 55: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Summary

Strong prior on where positive selection might haveoccurred: use the standard branch-site model.

Exploratory analysis: the stochastic branch-site modelperforms better. Also clearly outperforms the MEMEapproach implemented in HyPhy (Murrell et al., 2012)

Stochastic branch-site model implemented in fitmodel:http://code.google.com/p/fitmodel.

Variability of selection regimes across sites and lineages 50/52

Page 56: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Outline

1 Introduction

2 Variability of rates across sites

3 Variablity of rates across sites and lineages

4 Variability of selection regimes across sites and lineages

5 Conclusion

Conclusion 51/52

Page 57: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Conclusion

FreeRate model almost systematically outperforms thestandard (+Γ4) one.

IL approach brings significant improvement for a smallerfraction of the alignments (why?)

Stochastic branch-site model of codon evolution is wellsuited for exploratory analysis where one does not have aclear idea about the lineages evolving under positiveselection a priori.

Conclusion 52/52

Page 58: Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

Conclusion

FreeRate model almost systematically outperforms thestandard (+Γ4) one.

IL approach brings significant improvement for a smallerfraction of the alignments (why?)

Stochastic branch-site model of codon evolution is wellsuited for exploratory analysis where one does not have aclear idea about the lineages evolving under positiveselection a priori.

The future?

More data means more variability to account for →improving models of molecular evolution is (still) essential.

Better models for other sources of data, in particularspatial coordinates of collected sequences and fossils.

Conclusion 52/52