Upload
leonardo-de-oliveira-martins
View
247
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Talk presented at the Evolution Meeting 2013 (http://www.evolutionmeeting.org/engine/search/index.php?func=detail&aid=478)
Citation preview
A probabilistic parsimonious model for species tree reconstruction
Leonardo de Oliveira MartinsDavid Posada
with invaluable help from Klaus Schliep and Diego Mallo
What do we want
● To account for uncertainty in gene tree and species tree estimation ← some gene families may be more informative, or maybe we don't have signal at all
● To estimate species trees given arbitrary gene families ←can contain paralogous, missing data, etc.
● To allow for several sources of disagreement ← real data seldomly can be explained by just one biological phenomenon
● Fast computation ← improvement provided by slower, fully probabilistic methods may be elusive, and they can benefit from our output nonetheless
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
D1
D2
G1S
G2
Model for the evolution of gene families
.
.
.
Dn
Gn
D1
G1S
Model for the evolution of gene families
distance between G and S
P(G
/S)
Our assumption:
We just need to consider the
simplest explanation for the
difference between the gene
and species trees
● we may use several such simple explanations
D1
G1S
Model for the evolution of gene families
distance between G and S
P(G
/S)
Our assumption:
We just need to consider the
simplest explanation for the
difference between the gene
and species trees
● we may use several such simple explanations
● work with unrooted gene trees
● penalize gene trees very different from species tree
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
Quantifying the disagreement
gene tree species tree
reconciliation
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
assuming HGT:
1 event
Quantifying the disagreement
gene tree species tree
reconciliation
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
Stochastic error/nonparametric
assuming HGT:
1 event
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
Quantifying the disagreement – other measures
mul-tree version: Chaudhary R, Burleigh JG, Fernández-Baca D (2013) Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance. arXiv:1210.2665
de Oliveira Martins et al. (2008) Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees. PLoS ONE 3(7): e2651.
Quantifying the disagreement – other measures
see also: Whidden et al. (2013) Supertrees based on the subtree prune-and-regraft distance. PeerJ PrePrints 1:e18v1
Quantifying the disagreement – other measures
Hdist similar to: Nye TMW, Liò P, Gilks WR (2006) A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics 22: 117-119
Quantifying the disagreement – other measures
Now we have estimates for these
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
Stochastic error/nonparametric
assuming HGT:
1 event
Now we have estimates for these
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
Stochastic error/nonparametric
assuming HGT:
1 event
Gene tree parsimony
Now we have estimates for these
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
Stochastic error/nonparametric
assuming HGT:
1 event
Gene tree parsimony
Gene tree parsimony
Now we have estimates for these
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
Stochastic error/nonparametric
assuming HGT:
1 event
Gene tree parsimony
Gene tree parsimony
(approximate) dSPR
Now we have estimates for these
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
Stochastic error/nonparametric
assuming HGT:
1 event
Gene tree parsimony
Gene tree parsimony
(approximate) dSPR
RF, Hdist
Considering several measures of disagreement:
Thus we can incorporate e.g. duplications and losses while accounting for HGT and
random errors
Easy to include other distances in the future
Considering several measures of disagreement:
Problem: the normalization constant
E.g.: Rodrigue N, Kleinman CL, Philippe H, Lartillot N (2009) Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons. Mol Biol Evol 26: 1663-1676.
Solution: importance sampling estimate of Z(.)
Ref.: Bryant D, Steel M (2009) Computing the Distribution of a Tree Metric. TCBB: 420 – 426
Thus we can incorporate e.g. duplications and losses while accounting for HGT and
random errors
Easy to include other distances in the future
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
G1 S
Gn
.
.
.
Distribution of gene trees: probabilistic model
D1
Dn
Q1
Qn
G1 S
Gn
λdup1
λdupn
λdupprior
.
.
.
Distribution of gene trees: probabilistic model
D1
Dn
Q1
Qn
G1 S
Gn
λdup1
λdupn
λdupprior
.
.
.
.
.
.
λloss1
λlossn
λlossprior
λspr1
λsprn
λsprprior...
Distribution of gene trees: probabilistic model
D1
Dn
Q1
Qn
G1 S
Gn
λdup1
λdupn
λdupprior
.
.
.
.
.
.
λloss1
λlossn
λlossprior
λspr1
λsprn
λsprprior...
ImportanceSampling
So we can use complex, state-of-the-art software
for phylogenetic inference
Distribution of gene trees: probabilistic model
G1 S
Gn
λdup1
λdupn
λdupprior
.
.
.
.
.
.
λloss1
λlossn
λlossprior
λspr1
λsprn
λsprprior...
Input
ImportanceSampling
So we can use complex, state-of-the-art software
for phylogenetic inference
Distribution of gene trees: probabilistic model
G1 S
Gn
λdup1
λdupn
λdupprior
.
.
.
.
.
.
λloss1
λlossn
λlossprior
λspr1
λsprn
λsprprior...
Output
ImportanceSampling
So we can use complex, state-of-the-art software
for phylogenetic inference
Distribution of gene trees: probabilistic model
G1 S
Gn
λdup1
λdupn
λdupprior
.
.
.
.
.
.
λloss1
λlossn
λlossprior
λspr1
λsprn
λsprprior...
Output
ImportanceSampling
So we can use complex, state-of-the-art software
for phylogenetic inference
Distribution of gene trees: probabilistic model
We should not rely on single estimates of gene
phylogenies
E.g.: Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. (2012) Genome-scale coestimation of species and gene trees. Genome research 23: 323-330.
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
Example: distances between gene families
● 567 single-copy gene trees for 23 species
Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327–331
● Analysis under a model where only RF, Hdist and dSPR are considered
● Not interested in data set per se (unreliable)
● Use it just as a didactical tool about how the model works
● 567 single-copy gene trees for 23 species
Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327–331
● Analysis under a model where only RF, Hdist and dSPR are considered
Example: distances between gene families
RF Hdist SPR
● Not interested in data set per se (unreliable)
● Use it just as a didactical tool about how the model works
Example: distances between gene families
RF Hdist SPR
Posterior samples
Example: distances between gene families
RF Hdist SPR
Posterior samplesbest estimate
Example: distances between gene families
RF Hdist SPR
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
Analysis of simulated data sets
Idea from: Rasmussen MD, Kellis M (2012) Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Res. 22: 755-765
We use gene trees only, and simulate tree inference error
● Fully probabilistic simulation of gene trees by Diego Mallo and
David Posada
● Birth and death of new loci, conditioned on a multispecies
coalescent, followed by sequence evolution
Analysis of simulated data sets – results
Analysis of simulated data sets – results
Analysis of simulated data sets – results
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
Single copy genes from Drosophila (TreeFam)● 4591 informative, single-copy gene families
● (TreeFam database has 14250 informative gene families)
Single copy genes from Drosophila (TreeFam)● 4591 informative, single-copy gene families
● (TreeFam database has 14250 informative gene families)
Estimated species tree:
Single copy genes from Drosophila (TreeFam)● 4591 informative, single-copy gene families
● Root location uncertain
Estimated species tree:
Single copy genes from Drosophila (TreeFam)● 4591 informative, single-copy gene families
● Root location uncertain
● Only one unrooted topology
Large gene families from Drosophila (TreeFam)● 43 gene families with 102~295 tips
Large gene families from Drosophila (TreeFam)
best species tree:
● 43 gene families with 102~295 tips
~100%
To recap, our model can
● Account for uncertainty in gene tree and species tree estimation ← some gene families may be more informative, or maybe we don't have signal at all
● Estimate species trees given arbitrary gene families ← can contain paralogous, missing data, etc.
● Allow for several sources of disagreement ← real data seldomly can be explained by just one biological phenomenon
● Be fast ← improvement provided by slower, fully probabilistic methods may be elusive, and they can benefit from our output nonetheless
The larger, the better – specially for rooting the species tree
Do not assume gene trees are known – embrace ignorance!
Different gene families may be product of distinct processes
It's parallelized, and all distances can be calculated very fast.
Thank you!
Check out http://darwin.uvigo.es for announcements, code, slides...