59
The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Embed Size (px)

Citation preview

Page 1: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

The multispecies coalescent: implications for inferring species trees

James Degnan

21 February 2008

Page 2: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Outline

1 .Background-- gene trees vs. species trees

-- coalescence and incomplete lineage sorting

2 .Inferring species trees-- Concatenation

-- Consensus Trees

3 .Conclusions

Page 3: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Population Genetics and Phylogenetics

Population genetics: traditionally used to analyze single populations.

Phylogenetics: What is the best way to infer relationships between

populations/species?

Graphic by Mark A. Klinger, Carnegie Museum of Natural History, Pittsburgh

Page 4: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Desirable properties of species tree estimators

1 .Statistical consistency (sample size = # of genes)

2 .Efficiency

3 .Robustness to violations in assumptions

Page 5: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Bridging the popgen/phylo divide

“Closer integration of population-genetic factors in phylogenetics, including further insights into gene-tree/species tree, and horizontal gene transfer.” --from Mike Steel’s website, My pick for five directions in phylogenetics that will grow in the next five years (2006).

“Incorporation of explicit models of lineage sorting will be needed for continued development of phylogenetic inference near the species level.” –Maddison and Knowles (2006).

Page 6: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

The coalescent process

Past

Present

Page 7: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

One population

Page 8: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Multiple populations/species

Present

Past

Page 9: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Gene tree in a species tree

Page 10: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Model species tree with gene tree

A B C DThe gene tree is a random variable. The gene tree distribution is parameterized by the species tree topology and internal branch lengths.

Page 11: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

How can we compute probabilities of gene trees given species trees?

-General case solved by Degnan and Salter (2005) and implemented by program COAL. Also allows individuals sampled in species i. 0in

-Under a coalescent model, probabilities for gene trees with three species were derived by Nei (1987): 1-(2/3)e-T

-Probabilities for the gene tree to match the species tree topology for 4 and 5 species given by Pamilo and Nei (1988).

-All 30 species tree/gene tree combinations for 4 species given by Rosenberg (2002).

Page 12: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Definition: a coalescent history is a list of the populations in which each coalescent event occurs.

A B C DThis coalescent history: (1,3,3)

Other coalescent histories: (2,3,3), (3,3,3)

Page 13: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Gene tree probabilities

histories

ShistoriesGSG ]|,Pr[]|Pr[

Page 14: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Gene tree probabilities

histories

ShistoriesgGSgG ]|,Pr[]|Pr[

)()(),( bbvbuhistories b

b Tpw

internal branches of S

combinatorial enumeration, complexity only known in special cases

u coalesce into v

probability coalescences are consistent with g

branch length

Page 15: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Data from Ebersberger et al. 2007. Mol. Biol. Evol. 24:2266-2276.

Theoretical distribution based on parameters from Rannala and Yang, 2003. Genetics 164:1645-1656.

1.2

4.2t/N=

Page 16: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

y

x

Page 17: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 18: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 19: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 20: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 21: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 22: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 23: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 24: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 25: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 26: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 27: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 28: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 29: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 30: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 31: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 32: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 33: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 34: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 35: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 36: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 37: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Definition: a gene tree which is more probable than the gene tree matching the species tree is called an anomalous gene tree (Degnan and Rosenberg, 2006).

Theorem 1. For the asymmetric species tree topology with four species and for any species tree topology with more than four species, there exist branch lengths such that at least one gene tree is anomalous (Degnan and Rosenberg, 2006).

Page 38: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Is species tree inference consistent in this setting?

1 .Concatenation?

2 .Consensus?

Page 39: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Species Tree inference—concatenation

Species Trees are often estimated by concatenating several gene sequences and analyzing as one (data from Chen and Li, 2001).

Gene 1 Human CTTGAATAATTTTTACChimp CTTCAATAATTTTTACGorilla TTTGAATAATTTTTACOrang CTTGAATAATTTTTAT

Gene 2TAGAGTTTCCTTGTGGTGTAGAGTTTCCTTGTGGTATAGAGTTTCCTTGTGGTACAGAGTTTCCTTGTGGTC

Gene 3CGGTTTTGGTTTTGGTTTCRGTTT

Page 40: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Concatenation and gene tree discordanceHow does concatenation perform when sequences are generated

from different topologies?CGGTTTTGGTTATGGTTATAGTTA

CGATTATGATTATAATTTTGAATT

TGCTATTGCTATTGCTATCCCTAT

Species tree:

y = 1.0, x = 0.05

yx

Simulated gene trees

concatenated

sequence

CGGTTTTGGTTATGGTTATAGTTA

CGATTATGATTATAATTTTGAATT

TGCTATTGCTATTGCTATCCCTAT

Page 41: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Trees inferred from concatenated sequences (Kubatko and Degnan, 2007)

y = 1.0, x = 0.05

Number of genes

Page 42: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Is species tree inference consistent in this setting?

1 .Concatenation? No.

2 .Consensus?

Page 43: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Consensus (majority-rule)

Page 44: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Types of consensus trees

Greedy—sort clades by their proportions. Accept the most frequently observed clades one at a time that are compatible with already accepted clades. Do this until you have a fully resolved tree.

Majority rule—consensus tree has all clades that were observed in > 50% of trees.

R*—for each set of 3 taxa, find the most commonly occurring triple e.g., (AB)C, (AC)B or (BC)A. Build the tree from the most commonly occurring triple.

)AB(D, (CD)B are two rooted triples

Page 45: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Asymptotic consensus trees

Consensus trees are usually statistics, functions of data like x-bar.

Definition: an asymptotic consensus tree is the tree that is obtained by computing the consensus tree using topology probabilities from the

multispecies coalescent model .

Motivation: if there are a large number of independent loci, observed gene tree, clade, and rooted triple proportions should approximate their theoretical probabilities.

Page 46: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Simulated gene treesGreedy consensus tree

Page 47: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Greedy consensus tree

Page 48: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Simulated gene treesGreedy consensus treeR* consensus tree

Greedy consensus tree

Page 49: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Majority-rule: unresolved zone

Page 50: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Too-greedy zone

Page 51: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Is species tree inference consistent in this setting?

1 .Concatenation? No.

2 .Consensus? Yes (R*), no for greedy and majority-rule .

Page 52: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Are consensus trees inconsistent estimators of species trees?

Theorem 2. (i) Majority-rule asymptotic consensus trees (MACTs) do not have any clades not on the species tree. (ii) Majority-rule unresolved zones exist for any species tree topology with n ≥ 3 species.

Theorem 4. R* asymptotic consensus trees (RACTs) always match the species tree.

Theorem 3. Greedy asymptotic consensus trees (GACTs) can be misleading estimators of species trees for the 4-species asymmetric tree and for any species tree with n > 4 species.

Page 53: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

What about finite samples?

If you sample 10 loci, you could have:

All 10 match the species tree

9 match the species tree, 1 disagrees

8 match the species tree, 2 disagree, etc.

You can consider gene trees as categories and use multinomial probabilities for the probability of your sample

samples

knk

n

kk TnncIpp

nn

nTnnc k )),,((

!!

!]),,(Pr[ 11

11

1

Page 54: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

R* consensus, y = 0.4, x = 0.6

Page 55: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Conclusion

Coalescent gene tree probabilities can be used to prove or disprove the statistical consistency of species tree estimators.

Page 56: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 57: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Number of genes

Pro

babi

lity

R* consensus, y = x = 0.1

Page 58: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008
Page 59: The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008