Phylogenetic methods

Phylogenetic methods

Thanks to:

Applications of phylogenetic trees

• Evolution studies

• Systematic biology

• Medical research and epidemiology

• Ecology

• Others like grouping languages, medieval manuscripts …

Phylogenetics: basics

• With only 2 sequences, we can simply align them and look at their similarities or differences (how many, where they are in the molecule, and so on).

• More than 2 sequences, they are related to each other as a result of a branching evolutionary history, and they are located at the tips of a tree of species (except if HGT or sampling within species)

• Phylogenies (evolutionary trees) can be inferred from molecular sequences. They can be used to address a wide variety of questions about the course of evolution. They are needed not only to know the relationships of organisms, but to enable us to correctly interpret rates of evolution and the evolution of different parts of different molecules.

• In examining a phylogeny diagram, it is important to look at the branching pattern and not be misled by the order of species names on the page, or the left-right order of branching of lineages.

Terminology

Rooted tree Unrooted tree

Trees: rooted or unrooted?• Rooted trees indicate the direction of evolution, and the ancestor

(more useful)• Unrooted trees only show patterns of relationship (more common)• The root can be placed on a tree by two methods:

• The outgroup method. This amounts to knowing in advance where it is.• A molecular clock. If we assume that rates of change are about the same up all lineages, then the root will be a point which is equidistant from all the tips. It is not necessarily easy to see where this is.

• The molecular clock is the assumption that change occurs at a constant expected rate, so that all tips of the tree are equidistant from the root. This is better the more closely related the species are, but breaks down as their biology becomes more different. It is often a useful approximation.

Advantages of molecular phylogenetic analysis

• Analogous features (share common function, but NOT common ancestry) can be misleading

• DNA sequences more simple to model, we only have the four states A, C, G, T

• DNA samples for sequence analysis easy to prepare

Procedure: Steps of a molecular phylogenetic analysis

1. Decide what sequences to examine

2. Determine the evolutionary distances between the sequences and build distance matrix

3. Phylogenetic tree construction

1. Decide what to examine

• Choose homologous sequences in different species

• Homologous sequences must, by definition, be derived from a common ancestral sequence

• Homology is not similarity

2. Determine the evolutionary distances and build distance matrix

• For molecular data, evolutionary distances can be the observed number of nucleotide differences between the pairs of species.

• Distance matrix: simply a table showing the evolutionary distances between all pairs of sequences in the dataset

2. Determine the evolutionary distances and build

distance matrix - A simple example

1. AGGCCATGAATTAAGAATAA2. AGCCCATGGATAAAGAGTAA3. AGGACATGAATTAAGAATAA4. AAGCCAAGAATTACGAATAA

Distance Matrix

In this example the evolutionary distance is expressed as the number of nucleotide differences for each sequence pair. For example, sequences 1 and 2 are 20 nucleotides in length and have four differences, corresponding to an evolutionary difference of 4/20 = 0.2.

1 2 3 4

1 - 0.2 0.05 0.15

2 - 0.25 0.4

3 - 0.2

4 -

3. Phylogenetic Tree Construction example (UPGMA algorithm)

1. Pick smallest entry Dij

2. Join the two intersecting species and assign branch lengths Dij/2 to each of the nodes

DijBear Raccoon Weasel Seal

Bear - 0.26 0.34 0.29

Raccoon - 0.42 0.44

Weasel - 0.44

Seal -

Bear Raccoon

0.13 0.13

UPMGA (Michener & Sokal 1957)

3. Phylogenetic Tree Construction example (UPGMA algorithm)Dij

Bear Raccoon Weasel Seal

Bear - 0.26 0.34 0.29

Raccoon - 0.42 0.44

Weasel - 0.44

Seal -

3. Compute new distances to the other species using arithmetic means

365.02

44.029.0

2

38.02

42.034.0

2

)(

)(

SRSBBRS

WRWBBRW

DDD

DDD

Bear Raccoon

0.13 0.13


DijBR Weasel Seal

BR - 0.38 0.365

Weasel - 0.44

Seal -

1. Pick smallest entry Dij

2. Join the two intersecting species and assign branch lengths Dij/2 to each of the nodes

Bear Raccoon Seal

0.13

0.1825 0.1825


DijBR Weasel Seal

BR - 0.38 0.365

Weasel - 0.44

Seal -

3. Compute new distances to the other species using arithmetic means

4.03

44.042.034.0

3)(

WSWRWBBRSW

DDDD

Bear Raccoon Seal

0.13

0.1825 0.1825


DijBRS Weasel

BRS - 0.4

Weasel -

1. Pick smallest entry Dij.

2. Join the two intersecting species and assign branch lengths Dij/2 to each of the nodes.

3. Done!

Bear Raccoon Seal Weasel

0.13 0.1825

0.2 0.2

Weakness of UPGMA

• UPGMA assumes a constant molecular clock (i.e. accumulate mutations at the same rate)– All leaves in the same level

• Only constructs rooted trees

23

41 1 4 32

Correct tree UPGMA

Neighbor Joining (Saitou and Nei, 1987)

• Most widely used distance-based method • UPGMA showed that it is not enough to

just pick closest neighbors• Neighbor Joining:

– Principle: Join nodes that are close to each other and far from everything else

– Produces an unrooted tree– Multiple substitutions at the same site

(homoplasy) obscure true distance and make sequences artificially close to one another.

Methods

• When there is conflict among characters as to what phylogenies they support, we need some way of choosing among them a best estimate. Maximum parsimony chooses that phylogeny on which the characters can evolve with the fewest evolutionary events.

Long branch attraction• A phenomenon in phylogeny (esp. parsimony) when rapidly evolving

lineages are inferred to be closely related, regardless of their true evolutionary relationships.

• Esp. a problem with DNA sequence (rather than protein sequence or other more complex trait combinations), b/c there are only 4 nucleotides

• When DNA substitution rates are high, the probability that two lineages will convergently evolve the same nucleotide at the same site increases.

• Parsimony wrongly interprets this similarity as having evolved only once in a common ancestor of the two lineages

• Problem can be minimized by using methods that incorporate differential rates of substitution among lineages (e.g., maximum likelihood) or by breaking up long branches by adding taxa that are related to those with the long branches.

Other methods• Maximum likelihood. The probability of

the data is computed, given the tree and a probabilistic model of evolution. We choose that tree that gives the highest probability to the observed data, among all trees.

Maximum Likelihood

• Search through all trees to find the one with highest probability– Assumed to be “NP-complete“

• For a given tree, we can calculate its likelihood score– Markov model– Dynamic programming

• Very time-consuming

Which method to use?

• Distance based– fast

• Maximum Parsimony– Strong sequence similarity

• Maximum Likelihood– Very slow– Use only for small number of sequences

• Most packages (e.g. Phylip, MEGA) use software for all three methods.

Methods of evaluating trees

• Bootstrap: resample initial data set with one datum removed and replaced with another member

• Jackknife: resample initial distribution with one datum missing and not replaced

• MCMC: complex, but generates random numbers to produce a desired probability distribution with which to compare model

Monte Carlo method

Bayesian phylogenetics

• Newer method • Produces both a tree estimate and measures of uncertainty for the groups on the tree • Similar to ML. • Optimal hypothesis is the one that maximizes the posterior probability, = the likelihood multiplied by the PRIOR PROBABILITY of that hypothesis.• Prior probabilities of different hypotheses convey the scientist’s beliefs before having seen the data.

Rates and causes of molecular evolution

• Different parts of the genome are useful for different problems. Fast evolving sequences are useful for recent events, but become saturated and unrecognizable when comparing more distant relatives. Slow evolving sequences are useful around the base of the tree, but don’t have any variability at all among close relatives.

Different molecular regions, different rates

• DNA distant from genes evolves very quickly (at about one substitution per 108 years),

• Flanking regions upstream and downstream from a gene evolve less quickly than that,

• Introns evolve less quickly than those, though not much less,

• Third positions of codons evolve less quickly than introns,

• First and second positions of codons evolve less quickly than that,

Within a protein:– active sites evolve very slowly,– sites that bind heme, or interact with other proteins evolve a bit faster

but also very slowly,– interior sites evolve less quickly than exterior sites,– substitutions that involve less radical changes of the amino acid (i.e.

that change to a rather similar amino acid) happen more readily.

Of base changes, transitions (A -> G or C -> T) happen several times more readily than transversions (all other changes).

Between protein-coding loci, some (fibrinopeptide, for example) evolve rapidly, some less so (hemoglobins, cytochromes), and some (histones, for example)change very slowly.

Different molecular regions, different rates

Documents

Phylogenetic methods