49
A probabilistic parsimonious model for species tree reconstruction Leonardo de Oliveira Martins David Posada [email protected] [email protected] with invaluable help from Klaus Schliep and Diego Mallo

A probabilistic parsimonious model for species tree reconstruction

Embed Size (px)

DESCRIPTION

Talk presented at the Evolution Meeting 2013 (http://www.evolutionmeeting.org/engine/search/index.php?func=detail&aid=478)

Citation preview

Page 1: A probabilistic parsimonious model for species tree reconstruction

A probabilistic parsimonious model for species tree reconstruction

Leonardo de Oliveira MartinsDavid Posada

[email protected]

[email protected]

with invaluable help from Klaus Schliep and Diego Mallo

Page 2: A probabilistic parsimonious model for species tree reconstruction

What do we want

● To account for uncertainty in gene tree and species tree estimation ← some gene families may be more informative, or maybe we don't have signal at all

● To estimate species trees given arbitrary gene families ←can contain paralogous, missing data, etc.

● To allow for several sources of disagreement ← real data seldomly can be explained by just one biological phenomenon

● Fast computation ← improvement provided by slower, fully probabilistic methods may be elusive, and they can benefit from our output nonetheless

Page 3: A probabilistic parsimonious model for species tree reconstruction

Outline

Model of gene family evolution

Parsimonious estimation of disagreement

* reconciliation

* distance between trees

Hierarchical Bayesian model

Examples

* comparing many trees

* simulation

* TreeFam data set

Page 4: A probabilistic parsimonious model for species tree reconstruction

D1

D2

G1S

G2

Model for the evolution of gene families

.

.

.

Dn

Gn

Page 5: A probabilistic parsimonious model for species tree reconstruction

D1

G1S

Model for the evolution of gene families

distance between G and S

P(G

/S)

Our assumption:

We just need to consider the

simplest explanation for the

difference between the gene

and species trees

● we may use several such simple explanations

Page 6: A probabilistic parsimonious model for species tree reconstruction

D1

G1S

Model for the evolution of gene families

distance between G and S

P(G

/S)

Our assumption:

We just need to consider the

simplest explanation for the

difference between the gene

and species trees

● we may use several such simple explanations

● work with unrooted gene trees

● penalize gene trees very different from species tree

David Posada
Rodrigo and Steel.2008. SystBiol 57: 243 ML supertrees
David Posada
David Posada
Page 7: A probabilistic parsimonious model for species tree reconstruction

Outline

Model of gene family evolution

Parsimonious estimation of disagreement

* reconciliation

* distance between trees

Hierarchical Bayesian model

Examples

* comparing many trees

* simulation

* TreeFam data set

Page 8: A probabilistic parsimonious model for species tree reconstruction

Quantifying the disagreement

gene tree species tree

reconciliation

assuming deepcoal:

assuming duplosses:

1 deepcoal

1 dup3 losses

assuming HGT:

1 event

Page 9: A probabilistic parsimonious model for species tree reconstruction

Quantifying the disagreement

gene tree species tree

reconciliation

assuming deepcoal:

assuming duplosses:

1 deepcoal

1 dup3 losses

Stochastic error/nonparametric

assuming HGT:

1 event

Page 10: A probabilistic parsimonious model for species tree reconstruction

Outline

Model of gene family evolution

Parsimonious estimation of disagreement

* reconciliation

* distance between trees

Hierarchical Bayesian model

Examples

* comparing many trees

* simulation

* TreeFam data set

Page 11: A probabilistic parsimonious model for species tree reconstruction

Quantifying the disagreement – other measures

mul-tree version: Chaudhary R, Burleigh JG, Fernández-Baca D (2013) Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance. arXiv:1210.2665

Page 12: A probabilistic parsimonious model for species tree reconstruction

de Oliveira Martins et al. (2008) Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees. PLoS ONE 3(7): e2651.

Quantifying the disagreement – other measures

Page 13: A probabilistic parsimonious model for species tree reconstruction

see also: Whidden et al. (2013) Supertrees based on the subtree prune-and-regraft distance. PeerJ PrePrints 1:e18v1

Quantifying the disagreement – other measures

Page 14: A probabilistic parsimonious model for species tree reconstruction

Hdist similar to: Nye TMW, Liò P, Gilks WR (2006) A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics 22: 117-119

Quantifying the disagreement – other measures

Page 15: A probabilistic parsimonious model for species tree reconstruction

Now we have estimates for these

assuming deepcoal:

assuming duplosses:

1 deepcoal

1 dup3 losses

Stochastic error/nonparametric

assuming HGT:

1 event

Page 16: A probabilistic parsimonious model for species tree reconstruction

Now we have estimates for these

assuming deepcoal:

assuming duplosses:

1 deepcoal

1 dup3 losses

Stochastic error/nonparametric

assuming HGT:

1 event

Gene tree parsimony

Page 17: A probabilistic parsimonious model for species tree reconstruction

Now we have estimates for these

assuming deepcoal:

assuming duplosses:

1 deepcoal

1 dup3 losses

Stochastic error/nonparametric

assuming HGT:

1 event

Gene tree parsimony

Gene tree parsimony

Page 18: A probabilistic parsimonious model for species tree reconstruction

Now we have estimates for these

assuming deepcoal:

assuming duplosses:

1 deepcoal

1 dup3 losses

Stochastic error/nonparametric

assuming HGT:

1 event

Gene tree parsimony

Gene tree parsimony

(approximate) dSPR

Page 19: A probabilistic parsimonious model for species tree reconstruction

Now we have estimates for these

assuming deepcoal:

assuming duplosses:

1 deepcoal

1 dup3 losses

Stochastic error/nonparametric

assuming HGT:

1 event

Gene tree parsimony

Gene tree parsimony

(approximate) dSPR

RF, Hdist

Page 20: A probabilistic parsimonious model for species tree reconstruction

Considering several measures of disagreement:

Thus we can incorporate e.g. duplications and losses while accounting for HGT and

random errors

Easy to include other distances in the future

Page 21: A probabilistic parsimonious model for species tree reconstruction

Considering several measures of disagreement:

Problem: the normalization constant

E.g.: Rodrigue N, Kleinman CL, Philippe H, Lartillot N (2009) Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons. Mol Biol Evol 26: 1663-1676.

Solution: importance sampling estimate of Z(.)

Ref.: Bryant D, Steel M (2009) Computing the Distribution of a Tree Metric. TCBB: 420 – 426

Thus we can incorporate e.g. duplications and losses while accounting for HGT and

random errors

Easy to include other distances in the future

Page 22: A probabilistic parsimonious model for species tree reconstruction

Outline

Model of gene family evolution

Parsimonious estimation of disagreement

* reconciliation

* distance between trees

Hierarchical Bayesian model

Examples

* comparing many trees

* simulation

* TreeFam data set

Page 23: A probabilistic parsimonious model for species tree reconstruction

G1 S

Gn

.

.

.

Distribution of gene trees: probabilistic model

D1

Dn

Q1

Qn

Page 24: A probabilistic parsimonious model for species tree reconstruction

G1 S

Gn

λdup1

λdupn

λdupprior

.

.

.

Distribution of gene trees: probabilistic model

D1

Dn

Q1

Qn

Page 25: A probabilistic parsimonious model for species tree reconstruction

G1 S

Gn

λdup1

λdupn

λdupprior

.

.

.

.

.

.

λloss1

λlossn

λlossprior

λspr1

λsprn

λsprprior...

Distribution of gene trees: probabilistic model

D1

Dn

Q1

Qn

Page 26: A probabilistic parsimonious model for species tree reconstruction

G1 S

Gn

λdup1

λdupn

λdupprior

.

.

.

.

.

.

λloss1

λlossn

λlossprior

λspr1

λsprn

λsprprior...

ImportanceSampling

So we can use complex, state-of-the-art software

for phylogenetic inference

Distribution of gene trees: probabilistic model

Page 27: A probabilistic parsimonious model for species tree reconstruction

G1 S

Gn

λdup1

λdupn

λdupprior

.

.

.

.

.

.

λloss1

λlossn

λlossprior

λspr1

λsprn

λsprprior...

Input

ImportanceSampling

So we can use complex, state-of-the-art software

for phylogenetic inference

Distribution of gene trees: probabilistic model

Page 28: A probabilistic parsimonious model for species tree reconstruction

G1 S

Gn

λdup1

λdupn

λdupprior

.

.

.

.

.

.

λloss1

λlossn

λlossprior

λspr1

λsprn

λsprprior...

Output

ImportanceSampling

So we can use complex, state-of-the-art software

for phylogenetic inference

Distribution of gene trees: probabilistic model

Page 29: A probabilistic parsimonious model for species tree reconstruction

G1 S

Gn

λdup1

λdupn

λdupprior

.

.

.

.

.

.

λloss1

λlossn

λlossprior

λspr1

λsprn

λsprprior...

Output

ImportanceSampling

So we can use complex, state-of-the-art software

for phylogenetic inference

Distribution of gene trees: probabilistic model

We should not rely on single estimates of gene

phylogenies

E.g.: Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. (2012) Genome-scale coestimation of species and gene trees. Genome research 23: 323-330.

Page 30: A probabilistic parsimonious model for species tree reconstruction

Outline

Model of gene family evolution

Parsimonious estimation of disagreement

* reconciliation

* distance between trees

Hierarchical Bayesian model

Examples

* comparing many trees

* simulation

* TreeFam data set

Page 31: A probabilistic parsimonious model for species tree reconstruction

Example: distances between gene families

● 567 single-copy gene trees for 23 species

Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327–331

● Analysis under a model where only RF, Hdist and dSPR are considered

● Not interested in data set per se (unreliable)

● Use it just as a didactical tool about how the model works

Page 32: A probabilistic parsimonious model for species tree reconstruction

● 567 single-copy gene trees for 23 species

Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327–331

● Analysis under a model where only RF, Hdist and dSPR are considered

Example: distances between gene families

RF Hdist SPR

● Not interested in data set per se (unreliable)

● Use it just as a didactical tool about how the model works

Page 33: A probabilistic parsimonious model for species tree reconstruction

Example: distances between gene families

RF Hdist SPR

Page 34: A probabilistic parsimonious model for species tree reconstruction

Posterior samples

Example: distances between gene families

RF Hdist SPR

Page 35: A probabilistic parsimonious model for species tree reconstruction

Posterior samplesbest estimate

Example: distances between gene families

RF Hdist SPR

Page 36: A probabilistic parsimonious model for species tree reconstruction

Outline

Model of gene family evolution

Parsimonious estimation of disagreement

* reconciliation

* distance between trees

Hierarchical Bayesian model

Examples

* comparing many trees

* simulation

* TreeFam data set

Page 37: A probabilistic parsimonious model for species tree reconstruction

Analysis of simulated data sets

Idea from: Rasmussen MD, Kellis M (2012) Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Res. 22: 755-765

We use gene trees only, and simulate tree inference error

● Fully probabilistic simulation of gene trees by Diego Mallo and

David Posada

● Birth and death of new loci, conditioned on a multispecies

coalescent, followed by sequence evolution

Page 38: A probabilistic parsimonious model for species tree reconstruction

Analysis of simulated data sets – results

Page 39: A probabilistic parsimonious model for species tree reconstruction

Analysis of simulated data sets – results

Page 40: A probabilistic parsimonious model for species tree reconstruction

Analysis of simulated data sets – results

Page 41: A probabilistic parsimonious model for species tree reconstruction

Outline

Model of gene family evolution

Parsimonious estimation of disagreement

* reconciliation

* distance between trees

Hierarchical Bayesian model

Examples

* comparing many trees

* simulation

* TreeFam data set

Page 42: A probabilistic parsimonious model for species tree reconstruction

Single copy genes from Drosophila (TreeFam)● 4591 informative, single-copy gene families

● (TreeFam database has 14250 informative gene families)

Page 43: A probabilistic parsimonious model for species tree reconstruction

Single copy genes from Drosophila (TreeFam)● 4591 informative, single-copy gene families

● (TreeFam database has 14250 informative gene families)

Page 44: A probabilistic parsimonious model for species tree reconstruction

Estimated species tree:

Single copy genes from Drosophila (TreeFam)● 4591 informative, single-copy gene families

● Root location uncertain

Page 45: A probabilistic parsimonious model for species tree reconstruction

Estimated species tree:

Single copy genes from Drosophila (TreeFam)● 4591 informative, single-copy gene families

● Root location uncertain

● Only one unrooted topology

Page 46: A probabilistic parsimonious model for species tree reconstruction

Large gene families from Drosophila (TreeFam)● 43 gene families with 102~295 tips

Page 47: A probabilistic parsimonious model for species tree reconstruction

Large gene families from Drosophila (TreeFam)

best species tree:

● 43 gene families with 102~295 tips

~100%

Page 48: A probabilistic parsimonious model for species tree reconstruction

To recap, our model can

● Account for uncertainty in gene tree and species tree estimation ← some gene families may be more informative, or maybe we don't have signal at all

● Estimate species trees given arbitrary gene families ← can contain paralogous, missing data, etc.

● Allow for several sources of disagreement ← real data seldomly can be explained by just one biological phenomenon

● Be fast ← improvement provided by slower, fully probabilistic methods may be elusive, and they can benefit from our output nonetheless

The larger, the better – specially for rooting the species tree

Do not assume gene trees are known – embrace ignorance!

Different gene families may be product of distinct processes

It's parallelized, and all distances can be calculated very fast.

Page 49: A probabilistic parsimonious model for species tree reconstruction

Thank you!

Check out http://darwin.uvigo.es for announcements, code, slides...