11
Intraspecific phylogenetics and networks David Posada University of Vigo, Spain EPSCoR Workshop on Phylogenetics University of Hawaii 1-3 August 2005, O’ahu, Hawaii David Posada – University of Vigo, Spain 2 Phylogenies and genealogies David Posada – University of Vigo, Spain 3 Phylogeny Phylogeny : hierarchical relationships among organisms ! Caused by speciation/extinction (organismal level) ! Caused by gene duplication (genetic level) ! Descent with modification is a hierarchy producing process. Relationships among species are hierarchical. -> Phylogenetics David Posada – University of Vigo, Spain 4 Tokogeny Tokogeny : Non-hierarchical relationships among organisms ! Caused by sexual reproduction (organismal level) ! Caused by recombination (genetic level) Relationships among individuals within sexual species are non-hierarchical Relationships among recombining genes are non-hierarchical -> Classic population genetics

Phylogenies and genealogies Phylogeny y Tokogen ...darwin.uvigo.es/docencia/2005/05hawaii/networks.Hawaii05.4x1.pdf · Intraspecific phylogenetics and networks David Posada ... Refinement

Embed Size (px)

Citation preview

Page 1: Phylogenies and genealogies Phylogeny y Tokogen ...darwin.uvigo.es/docencia/2005/05hawaii/networks.Hawaii05.4x1.pdf · Intraspecific phylogenetics and networks David Posada ... Refinement

Intraspecific phylogenetics

and networks

David Posada

University of Vigo, Spain

EPSCoR Workshop on Phylogenetics

University of Hawaii

1-3 August 2005, O’ahu, Hawaii

David Posada – University of Vigo, Spain

2

Phylogenies and genealogies

David Posada – University of Vigo, Spain

3

Phylogeny

Phylogeny: hierarchical relationships

among organisms

! Caused by speciation/extinction

(organismal level)

! Caused by gene duplication (genetic

level)

! Descent with modification is a hierarchy

producing process.

Relationships among species are

hierarchical.

-> Phylogenetics

David Posada – University of Vigo, Spain

4

Tokogeny

Tokogeny: Non-hierarchical relationships

among organisms

! Caused by sexual reproduction

(organismal level)

! Caused by recombination (genetic level)

Relationships among individuals within

sexual species are non-hierarchical

Relationships among recombining genes are

non-hierarchical

-> Classic population genetics

Page 2: Phylogenies and genealogies Phylogeny y Tokogen ...darwin.uvigo.es/docencia/2005/05hawaii/networks.Hawaii05.4x1.pdf · Intraspecific phylogenetics and networks David Posada ... Refinement

David Posada – University of Vigo, Spain

5

Genealogies (I)

David Posada – University of Vigo, Spain

6

Genealogies (II)

David Posada – University of Vigo, Spain

7

Genealogies (III)

David Posada – University of Vigo, Spain

8

Genealogies (IV)

Page 3: Phylogenies and genealogies Phylogeny y Tokogen ...darwin.uvigo.es/docencia/2005/05hawaii/networks.Hawaii05.4x1.pdf · Intraspecific phylogenetics and networks David Posada ... Refinement

David Posada – University of Vigo, Spain

9

Genealogies (V)

David Posada – University of Vigo, Spain

10

Genealogies (VI)

David Posada – University of Vigo, Spain

11

Genealogies (VII)

David Posada – University of Vigo, Spain

12

Genealogy

A genealogy is a representation of the

history of a sample of genes, independent

of the process of mutation.

In reality we can only estimate those

branches marked by mutationsl that is

haplotype trees.

Page 4: Phylogenies and genealogies Phylogeny y Tokogen ...darwin.uvigo.es/docencia/2005/05hawaii/networks.Hawaii05.4x1.pdf · Intraspecific phylogenetics and networks David Posada ... Refinement

David Posada – University of Vigo, Spain

13

Haplotype trees

We cannot infer A1 that is closer to A4 than

to A2, for example.

Gene tree

Haplotype tree

David Posada – University of Vigo, Spain

14

Population trees: migration

Haplotypes and trees can have different

histories due to migration or lineage

sorting.

David Posada – University of Vigo, Spain

15

Population trees: lineage

sorting

Haplotypes and trees can also have

different histories due to lineage sorting.

David Posada – University of Vigo, Spain

16

Intraspecific data

Intraspecific data show some

particularities1:

1. Low divergence.

2. Ancestral haplotypes are very likely to

persist in the population.

3. Real polytomies o multifurcations.

4. Recombination generates homoplasy.

Page 5: Phylogenies and genealogies Phylogeny y Tokogen ...darwin.uvigo.es/docencia/2005/05hawaii/networks.Hawaii05.4x1.pdf · Intraspecific phylogenetics and networks David Posada ... Refinement

David Posada – University of Vigo, Spain

17

Reticulation

David Posada – University of Vigo, Spain

18

Traditional phylogenetics

Network methods can provide a useful

alternative to standard phylogenetic

methods (ML,MP,NJ) for estimating

intraspecific phylogenies.

David Posada – University of Vigo, Spain

19

Phylogenetic networks

Phylogenetic network methods are able to

display multifurcations and extant internal

nodes, and explicitly represent phylogenetic

conflict through reticulation.

David Posada – University of Vigo, Spain

20

Network methods (I)

Pyramids2

! Hierarchical clustering framework.

! Clades can overlap.

! Reticulations are allowed among tips.

! Implemented in PYRAMIDS.

Reticulograms3

! Adds reticulations to an existing binary

tree.

! Fit of network to the data (least

squares).

! Implemented in T-REX.

Statistical geometry3,4

! Average quartet geometry.

! Implemented in STATGEOM and GEOMETRY.

Split decomposition5

! Weighted consensus of non-worst

quartet splits.

! Implemented in SPLITSTREE, JSPLITS and

SPECTRONET.

Page 6: Phylogenies and genealogies Phylogeny y Tokogen ...darwin.uvigo.es/docencia/2005/05hawaii/networks.Hawaii05.4x1.pdf · Intraspecific phylogenetics and networks David Posada ... Refinement

David Posada – University of Vigo, Spain

21

Network methods (II)

NeighborNet6

! Hybrid between split decomposition and

NJ.

! Agglomerate pairs of pairs that share a

node in common, into a circular split

system.

! Implemented in SPLITSTREE and JSPLITS.

Median networks7

! Constructed by adding median vectors

(consensus of triplets).

! Can be reduced or pruned afterwards

using frequency information (RM).

! Not implemented.

Median joining networks8,9

! Refinement of RM for multistate

characters and large data sets.

! Adds median vectors for optimal

triplets are added to a minimum

spanning network(MSN).

! Implemented in NETWORK.

David Posada – University of Vigo, Spain

22

Network methods (III)

Statistical parsimony10

! Local parsimony connections are made

until a network or a set of networks is

constructed.

! Implemented in TCS.

Molecular variance parsimony11

The optimal network is that MST or MSN

that minimizes a set of population

statistics.

! These statistics are based on haplotype

frequency, distance, and geographic

distribution.

! Implemented in ARLEQUIN.

Netting12

! Closest haplotypes are successively

joined.

! In case of homoplasy a new dimension

is added to the network.

! Not implemented.

David Posada – University of Vigo, Spain

23

Network methods (IV)

Likelihood13

! For directed graphs.

! Needs of a good heuristic search.

! Likelihood implemented in PAL, but

search not implemented.

David Posada – University of Vigo, Spain

24

Network methods comparative

Methods Category Software Speed Input data Model of evolution

Statistical assessment

Pyramids

Distance Pyramids Fast Distances Yes No

Statistical geometry

Distance Invariants

Geometry, Statgeom

Fast Multistate Yes Yes

Split decomposition

Distance Parsimony

SplitsTree Fast Multistate Yes Yes

Median networks

Distance No Slow Binary No No

Median-joining networks

Distance Network Very fast

Multistate No No

Statistical parsimony

Distance

TCS Fast Multistate No Yes

NeighborNet

Distance SplitsTree Very fast

Multistate Yes Yes

Molecular variance parsimony

Distance Arlequin Fast Multistate Yes Yes

Netting

Distance No Slow Multistate No No

Likelihood network

Likelihood PAL Slow Multistate Yes Yes

Reticulate phylogeny

Least squares

No Slow Distances* Yes Yes

Reticulogram

Least squares

T-rex Fast Distances No Yes

* distances estimated from gene frequency data

Page 7: Phylogenies and genealogies Phylogeny y Tokogen ...darwin.uvigo.es/docencia/2005/05hawaii/networks.Hawaii05.4x1.pdf · Intraspecific phylogenetics and networks David Posada ... Refinement

David Posada – University of Vigo, Spain

25

Network methods can give

different results

David Posada – University of Vigo, Spain

26

Network performance

A few empirical studies compare the

networks inferred by different methods.

The absolute or relative performance of

network methods has never been evaluated

through computer simulations.

David Posada – University of Vigo, Spain

27

Statistical parsimony (I)

Calculate pairwise haplotype distance

matrix.

Estimate parsimony connection limit,

defined as the maximum number of

differences among two haplotypes that

ensures, with a probability >= 0.95, that no

over imposed mutations have occurred.

Do not consider connections above this

limit

ˆ P j = (1! ˆ q i )i=1

j

"

ˆ q i = q1L( j, m)dq

10

1

# / L( j,m)dq1

0

1

#L( j,m) = (2qi)

j!1 (1! q1)2m +1 1 ! q

1/(br)[ ] $ 2 ! q

1(br +1) /(br)[ ]

j!1

1 ! 2q11 ! q

1/(br)[ ]{ }

David Posada – University of Vigo, Spain

28

Statistical parsimony (II)

Find, for each haplotype, its minimum

connection/s to other haplotypes.

Make the minimum connections, and add

missing haplotypes (represented as zeroes

or small circles). Never make a connection

that implies that a network distance is

smaller than the observed distance.

Page 8: Phylogenies and genealogies Phylogeny y Tokogen ...darwin.uvigo.es/docencia/2005/05hawaii/networks.Hawaii05.4x1.pdf · Intraspecific phylogenetics and networks David Posada ... Refinement

David Posada – University of Vigo, Spain

29

Statistical parsimony (III)

If there are several options to make that

connection, do it in a way that agrees as

much as possible with the distances in the

observed distance matrix. So we minimize

W = Nij ! Dij( )PDijj= i+1

h

"i=0

h!1

"

h = number of haplotypes

Nij = min. network distance between haplotypes i and j

Dij = observed distance between haplotypes i and j

PDij = probability of no over imposed mutations among

two haplotypes differing by Dij steps

We will select the connecting alternative

with smallest W.

David Posada – University of Vigo, Spain

30

Coalescent theory results

Some results from neutral coalescent

theory related to frequency and

distribution of haplotypes are relevant to

contruct and interpret intraspecific

phylogenies.

There is a direct relationship between the

frequency and the age of a haplotype14,15.

Haplotypes that have persisted for a long

time in the population will tend to show

higher frequencies than more recent

haplotypes, and new haplotypes will arise

from old ones16. Also, young haplotypes will

tend to stay in the population where they

first appeared17.

We can establish several explicit

predictions:

David Posada – University of Vigo, Spain

31

Neutral coalescent theory

predictions

1. Older haplotypes tend to have higher

frequency

2. Older haplotypes tend to have wider

geographic distribution

3. Older haplotypes tend to be interior in the

network

4. Older haplotypes tend to be more

connected in the network

5. Singletons tend to connect to non-

singletons

6. Singletons tend to connect to haplotypes

in the same population

David Posada – University of Vigo, Spain

32

Rooting

Rooting networks is difficult.

Assuming neutrality we can assign

outgroup weights to haplotypes in the

sample18 using their absolute frequency (fi)

and the sum of k neighbor frequencies (vj):

Page 9: Phylogenies and genealogies Phylogeny y Tokogen ...darwin.uvigo.es/docencia/2005/05hawaii/networks.Hawaii05.4x1.pdf · Intraspecific phylogenetics and networks David Posada ... Refinement

David Posada – University of Vigo, Spain

33

Solving loops (I)

Neutral coalescent predictions can be used

to solve ambiguities or loops

David Posada – University of Vigo, Spain

34

Solving loops (II)

We can define some objective functions in

the statistical parsimony framework:

! Minimize network disagreement

W = Nij !Dij( )PDijj= i+1

h

"i= 0

h!1

"

h = number of haplotypes

Nij = min. network distance between haplotypes i and j

Dij = observed distance between haplotypes i and j

PDij = probability of no over imposed mutations among

two haplotypes differing by Dij steps

! Maximize topological/frequency

criterion

T = !(1 " pi )i =0

h

# + (1 " !)pi t = 1 if haplotype i is tip

t = 0 if haplotype i is interior

$

%

&

Solving loops (III)

Solving loops (IV)

Page 10: Phylogenies and genealogies Phylogeny y Tokogen ...darwin.uvigo.es/docencia/2005/05hawaii/networks.Hawaii05.4x1.pdf · Intraspecific phylogenetics and networks David Posada ... Refinement

David Posada – University of Vigo, Spain

37

Network applications

Detect recombination19,20

Species delimitation21,22

Speciation modes23

Population history24 (NCA)

Genotype/phenotype association25

Higher-level phylogenetics26

David Posada – University of Vigo, Spain

38

Take home

! Network methods can be very

appropriate for the representation of

intraspecific phylogenies.

! Their current performance is unknown.

At least in absence of recombination

they should work well.

! Several programs exists that implement

these methods.

David Posada – University of Vigo, Spain

39

Phylogenetic network software

(I)

STATGEOM27 implements stastistical geometry.

Data: DNA, RNA, protein, binary data.

Distribution: C code. URL: http://www.zbit.uni-

tuebingen.de/pas/kay_en.htm

SPLITSTREE28 implements split decomposition.

Data: DNA, RNA. Executables. Windows, Unix,

Mac. URL: http://www.mathematik.uni-

bielefeld.de/~huson/phylogenetics/splitstree.ht

ml

JSPLIT is a new version of the Splitstree

program written in Java. It can run under any

platform provided a Java runtime environment.

URL: http://www-ab.informatik.uni-

tuebingen.de/software/jsplits/welcome_en.html

SPECTRONET implements median networks adn

other tools. Data: DNA. Executables. Windows.

URL: http://awcmee.massey.ac.nz/spectronet/

David Posada – University of Vigo, Spain

40

Phylogenetic network software

(II)

NETWORK29 implements median-joining

networks. Data: DNA, RFLPs. Executables:

Windows, DOS. URL: http://www.fluxus-

engineering.com/sharenet.htm

ARLEQUIN30 implements molecular variance

parsimony. Data: DNA, RNA, microsatellites.

Executables: Mac, Windows and Unix. URL:

http://lgb.unige.ch/arlequin/

TCS31 implements statistical parsimony. Data:

DNA, RNA, distances. Executables: Mac and

Windows. URL:

http://inbio.byu.edu/Faculty/kac/crandall_lab/t

cs.htm y http://darwin.uvigo.es/

PAL32 calculates the likelihood of a network.

Data: network. Executables: Mac and Windows.

URL: http://www.cebl.auckland.ac.nz/pal-

project/

Page 11: Phylogenies and genealogies Phylogeny y Tokogen ...darwin.uvigo.es/docencia/2005/05hawaii/networks.Hawaii05.4x1.pdf · Intraspecific phylogenetics and networks David Posada ... Refinement

David Posada – University of Vigo, Spain

41

References

1. Posada, D. & Crandall, K.A. Intraspecific gene

genealogies: trees grafting into networks. Trends in Ecology and Evolution 16, 37-45 (2001).

2. Diday, E. & Bertrand, P. An extension of hierarchical

clustering: the pyramidal representation. in Pattern recognition in practice (eds. Gelsema, E.S. & Kanal, L.N.)

411-424 (North-Holland, Amsterdam, 1986).

3. Eigen, M., Winkler-Oswatitsch, R. & Dress, A. Statistical

geometry in sequence space: a method of quantitative

sequence analysis. Proceedings of the National Academy

of Sciences, U.S.A. 85, 5917 (1988).

4. Nieselt-Struwe, K. Graphs in sequence spaces: a review

of statistical geometry. Biophysical Chemistry 66, 111-

131 (1997).

5. Bandelt, H.-J. & Dress, A.W.M. Split decomposition: a

new and useful approach to phylogenetic analysis of

distance data. Molecular Phylogenetics and Evolution 1,

242-252 (1992).

6. Bryant, D. & Moulton, V. NeighborNet: an agglomerative

method for the construction of planar phylogenetic

networks. in Workshop in Algorithms for Bioinformatics

(2002).

7. Bandelt, H.-J., Macaulay, V. & Richards, M. Median

networks: Speedy construction and greedy reduction,

one simulation, and two case studies from human

mtDNA. Molecular Phylogenetics and Evolution 16, 8-28

(2000).

8. Bandelt, H.-J., Forster, P. & Röhl, A. Median-joining

networks for inferring intraspecific phylogenies.

Molecular Biology and Evolution 16, 37 (1999).

David Posada – University of Vigo, Spain

42

9. Foulds, L.R., Hendy, M.D. & Penny, D. A graph theoretic

approach to the development of minimal phylogenetic

trees. Journal of Molecular Evolution 13, 127-149

(1979).

10. Templeton, A.R., Crandall, K.A. & Sing, C.F. A cladistic

analysis of phenotypic associations with haplotypes

inferred from restriction endonuclease mapping and DNA

sequence data. III. Cladogram estimation. Genetics 132,

619-633 (1992).

11. Excoffier, L. & Smouse, P.E. Using allele frequencies and

geographic subdivision to reconstruct gene trees within

a species: molecular variance parsimony. Genetics 136,

343-359 (1994).

12. Fitch, W.M. Networks and viral evolution. Journal of Molecular Evolution 44, S65-S75 (1997).

13. Strimmer, K. & Moulton, V. Likelihood analysis of

phylogenetic networks using directed graphical models.

Molecular Biology and Evolution 17, 875-881 (2000).

14. Watterson, G.A. & Guess, H.A. Is the most frequent allele

the oldest? Theoretical Population Biology 11, 141-160

(1977).

15. Donnelly, P. & Tavaré, S. The ages of alleles and a

coalescent. Advances in Applied Probability 18, 1-19

(1986).

16. Excoffier, L. & Langaney, A. Origin and differentiation of

human mitochondrial DNA. American Journal of Human Genetics 44, 73-85 (1989).

17. Watterson, G.A. The genetic divergence of two

populations. Theoretical Population Biology 27, 298-317

(1985).

18. Castelloe, J. & Templeton, A.R. Root probabilities for

intraspecific gene trees under neutral coalescent theory.

Molecular Phylogenetics and Evolution 3, 102-113

(1994).

David Posada – University of Vigo, Spain

43

19. Holmes, E.C., Urwin, R. & Maiden, M.C.J. The influence of

recombination on the population structure and evolution

of the human pathogen Neisseria meningitidis. Mol. Biol. Evol. 16, 741-749 (1999).

20. Templeton, A.R. et al. Recombinational and mutational

hotspots within the human lipoprotein lipase gene.

American Journal of Human Genetics 66, 69-83 (2000).

21. Templeton, A.R. Using phylogeographic analyses of gene

trees to test species status and processes. Molecular Ecology 10, 779-91 (2001).

22. Shaw, K.L. A nested analysis of song groups and species

boundaries in the hawaiian cricket genus Laupala.

Molecular Phylogenetics and Evolution 11, 332-341

(1999).

23. Barraclough, T.G. & Vogler, A.P. Detecting the

geographical pattern of speciation from species-level

phylogenies. The American Naturalist 155, 419-434

(2000).

24. Gómez-Zurita, J., Petitpierre, E. & Juan, C. Nested

cladistic analysis, phylogeography and speciation in the

Timarcha goettingensis complex (Coleoptera,

Chrysomelidae). Molecular Ecology 9, 557-560 (2000).

25. Sing, C.F., Haviland, M.B., Zerba, K.E. & Templeton, A.R.

Application of cladistics to the analysis of genotype-

phenotype relationships. European Journal of Epidemiology 8, 3-9 (1992).

26. Crandall, K.A. Intraspecific cladogram estimation:

Accuracy at higher levels of divergence. Systematic Biology 43, 222-235 (1994).

27. Nieselt-Struwe, K. STATGEOM. 1.0 edn (Department of

Physics, University of Auckland, Auckland, New Zealand,

2000).

28. Huson, D.H. SplitsTree: analyzing and visualizing

evolutionary data. Bioinformatics 14, 68-73 (1998).

David Posada – University of Vigo, Spain

44

29. Röhl, A. Network. A program package for phylogenetic

networks. (Mathematisches Seminar, Universität

Hamburg, Hamburg, Germany, 1997).

30. Schneider, S., Roessli, D. & Excofier, L. Arlequin: A

software for population genetic data analysis. 2.000 edn

(Genetics and Biometry Lab, Dept. of Anthropology,

University of Geneva, Geneva, 2000).

31. Clement, M., Posada, D. & Crandall, K.A. TCS: a computer

program to estimate gene genealogies. Molecular Ecology 9, 1657-9 (2000).

32. Drummond, A. & Strimmer, K. PAL: an object-oriented

programming library for molecular evolution and

phylogenetics. Bioinformatics 17, 662-663 (2001).