61
A brief introduction to phylogenetics

A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Embed Size (px)

Citation preview

Page 1: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

A brief introduction to phylogenetics

Page 2: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Definition:

The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences diverged from a common ancestor

Simplest distance: p-distance

= proportion of sites that are different

Genetic Distance

Page 3: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

A T T G C G CC

A T T G C G CT

CT

A A TA

C A

Differences

Sub

stit

utio

ns

Correcting for ‘multiple substitutions’

Page 4: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Correcting for multiple substitutions

Requires a statistical ‘model’ of how the process of substitution works to correct for

- Differences in the rates of different substitution types (e.g. Jukes and Cantor – all substitutions are treated the same versus Kimura 2-parameter model – distinguishes between transitions and transversions)

- Different frequencies of different nucleotides (e.g. GC content – the HKY model adds nucleotide frequency parameters to the Kimura 2-parameter model)

- Different rates at different sites (often modelled using a distribution – e.g. Gamma distribution – see next)

Page 5: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

In order to perform a gamma correction for site specific rates you need to know the shape of the gamma distribution

Page 6: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Correcting for multiple substitutions (continued…)

Correction for multiple substitutions implies a model of evolution, but some models have many more parameters than others

- Models with few parameters are easy to fit, but may miss some important biology (e.g. there’s typically a big difference between rates of transition and transversion, and it would be dangerous not to model that). Simple models can underfit the data.

- Complex models (many parameters) may be difficult and much slower to estimate. There can also be a danger of over-fitting the data when more parameters are included in a model than are necessary.

(see later…)

Page 7: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Some general points:

- genetic distances can be far greater than 1

- smaller genetic distances are more reliable

- model choice has a bigger impact for distantly related sequences

- normally positions with gaps are ignored (complete deletion)

- IF you know the rate of evolution for a pair of sequences (and if the rate has remained more or less constant) you can estimate the date at which they diverged

Page 8: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Phylogenetic tree

Diagram consisting of branches and nodes

Branches indicate relationships between the ‘objects’

Internal branches define partitions of the objects

Page 9: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences
Page 10: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Rooting the Tree

• In an unrooted tree the direction of evolution is unknown

• The root is the hypothesized ancestor of the sequences in the tree

• The root can either be placed on a branch or at a node

• You should start by viewing an unrooted tree

Page 11: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences
Page 12: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences
Page 13: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

• Many software packages will root trees automatically (e.g. mid-point rooting in NJPlot)

• This always involves assumptions… BEWARE!

Page 14: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Rooting Using an Outgroup

1. The outgroup should be a sequence (or set of sequences) known to be less closely related to the rest of the sequences than they are to each other

2. It should ideally be as closely related as possible to the rest of the sequences while still satisfying condition 1

The root must be somewhere between the outgroup and the rest (either on the node or in a branch)

Page 15: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Sometimes two trees may look very different but, in fact, differ only in the position of the

root

Page 16: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Two trees are different if one tree specifies at least one partition that is not present in the other

Looking at trees

Page 17: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

A

B

C

D

E

F

G

H

I

J

0.01

Page 18: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

I

J

H

F

G

D

E

A

B

C

0.01

Page 19: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

I

J

H

F

G

D

E

A

B

C

0.01

A

B

C

D

E

F

G

H

I

J

0.01

Page 20: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

A

B

G

D

E

H

I

J

F

C

0.01

I

J

H

F

G

D

E

A

B

C

0.01

Page 21: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Phylogenetic Inference

Distance, parsimony and maximum likelihood methods

Page 22: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

need

optimality criteria

+

algorithm to search for the best tree given the optimality criteria

Page 23: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Best tree Vs True tree

Page 24: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Types of optimality criteria used to infer phylogeny from sequence

• Distance methods• Parsimony• Likelihood• Others

Page 25: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Distance based methods

Minimum Evolution Principal

“The tree with the smallest sum of branch lengths is the best tree”

Page 26: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

t

r s

u v

dAB ~ r + sdCD ~ u + vdAD ~ r + t + vdBC ~ s + t + uetc.

A B

C D

(r, s, u, v, t are estimated so that these relationships are as close as possible to being correct)

Tree length = u + v + t + r + s

Page 27: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

)!3(2

)!52(3

n

nN

nu

Number of possible unrooted trees from n sequences:

e.g. for 20 sequences there are approximately 1020

Page 28: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

For realistic numbers of sequences it is impossible to consider all possible trees.

Need algorithms that can arrive at the ‘best tree’ without considering all possible trees.

Page 29: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Neighbour joining is a very fast approximation to minimum

evolution

Page 30: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Neighbour Joining

87

6

54

1

2

3

8

7

6

5

23

4

1

Choose the pair that minimizes the length of the resulting tree

Page 31: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Maximum Parsimony

Occam’s Razor

Entia non sunt multiplicanda praeter necessitatem.

William of Occam (1300-1349)

The best tree is the one which requires the least number of substitutions

Page 32: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

• Check each topology• Count the minimum number of changes required

to explain the data• Choose the tree with the smallest number of

changes• Usually performs well with closely related

sequences – but often performs badly with very distantly related sequences

• With distantly related sequences homoplasy becomes a major problem

Page 33: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Informative sites: Not all sites contain information about the tree topology using the parsimony approach

Homoplasy: characters that are similar for reasons other than common ancestry (increasingly a problem as sequences become more divergent)

Page 34: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Branch & Bound: A method that does not have to consider all trees but still guarantees finding the ‘best’ tree. Slow for large numbers of sequences.

Heuristic methods (No guarantee of finding the best tree)

- Start with some tree (e.g. the neighbour-joining tree)

- Consider making a random change to the tree

- make the change if it improves the score of the tree

- stop making changes when you can find no further improvement

NNI -> SPR -> TBR

(NNI fastest and least rigorous, TBR slowest and most rigorous)

Methods for searching for the ‘best’ tree without considering all trees

Page 35: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

How confident are we that the tree is correct?

Bootstrap values

Bootstrapping is a statistical technique that can use random resampling of data to determine sampling error for tree topologies

Page 36: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Bootstrapping phylogenies

• Characters are resampled with replacement to create many bootstrap replicate data sets

• Each bootstrap replicate data set is analysed (e.g. with parsimony, distance, ML etc.)

• Agreement among the resulting trees is summarized with a majority-rule consensus tree

• Frequencies of occurrence of groups, bootstrap proportions (BPs), are a measure of support for those groups

Page 37: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Pagurus bernhardus

Pagurus acadianus

Ellasochirus tenuimanus

Labidochirus splendescens

Lithodes aequispina

Paralithodes camtschatica

Pagurus pollicaris (NE)

Pagurus pollicaris (GU)

Pagurus longicarpus (NE)

Pagurus longicarpus (GU)

Clibanarius vittatus

Coenobita sp.

Artemia salina

82

100

99

100

100100

98

97

81

99

0.05

Page 38: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

• Bootstrapping is a very valuable and widely used technique (it is demanded by some journals)

• BPs give an idea of how likely a given branch would be to be unaffected if additional data, with the same distribution, became available

• BPs are not the same as confidence intervals. There is no simple mapping between bootstrap values and confidence intervals. There is no agreement about what constitutes a ‘good’ bootstrap value (> 70%, > 80%, > 85% ????)

• Some theoretical work indicates that BPs can be a conservative estimate of confidence

Bootstrap - interpretation

Page 39: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Inferring trees using Likelihood

Page 40: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

The ‘optimality criterion’

The best tree is the one that makes the data have the highest likelihood

The ML optimality criterion will lead to the correct tree given

- enough data (e.g. long enough sequence alignment)

- the correct model (e.g. Kimura 2 parameter model)

Page 41: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Distance

Lik

elih

ood

A C G

G A G

Suppose we have a model of evolution (e.g. Jukes & Cantor) that allows us to work out the probability of each pair of characters, given a particular genetic distance (c.f. series of scoring matrices like BLOSUM, PAM etc)

D = 0.3

L = 0.06

D = 0.6

0.6 * 0.6 * 0.4 = 0.144

D = 0.9

0.9 * 0.9 * 0.1 = 0.081

Page 42: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Genetic Distance using Maximum Likelihood

• Require a model of evolution

• Optimise all parameters of the model

• Each evolutionary ‘event’ has an associated likelihood given an inferred genetic distance

• The likelihood of the sequence-pair is a function of the genetic distance (just the product of the likelihoods of each of the inferred ‘events’ at each sequence position)

• Function is maximized

Page 43: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Phylogenetic trees using Maximum Likelihood

• Require a model of evolution• Each substitution has an associated likelihood given a

branch of a certain length• A function is derived to represent the likelihood of the data

given the tree, branch-lengths and additional parameters• Optimise over parameters of the model• Optimise over branch lengths• Sum the likelihood over all possible sequences at ancestral

nodes• Search for the best tree (using heuristics such as TBR)

Page 44: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Models can be made more parameter rich to increase their realism

• The most common additional parameters are:

– A correction to allow different rates for each type of nucleotide change

– Parameters for equilibrium base frequencies

– A correction for the proportion of sites which are unable to change

– A correction for variable rates at those sites which can change

• The values of the additional parameters will be estimated in the process

Page 45: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Likelihood and the number of parameters

More parameters always leads to a better fit of the data

Page 46: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Likelihood and the number of parameters

More parameters always leads to a better fit of the data

Page 47: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

More parameters always leads to a higher value of the likelihood whether or not the additional parameters are providing a ‘significantly’ better fit to the data

Page 48: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Are the extra parameters justified?

- Likelihood ratio test

Has chi-squared distribution

dof = number of additional parameters

Maximum Likelihood | H1

Maximum Likelihood | H0Likelihood ratio statistic: 2 log ( )

Page 49: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

One model is nested in another if it is a special case of the more general model

e.g. the Jukes and Cantor model and Kimura 2P model

G

C

T

A

GCTA

G

C

T

A

GCTA

J-C K2P

Page 50: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Modeltest

- Uses PAUP

- Tries out many nested models of nucleotide substitution

- Decides how many parameters are justified by the data

GTR does not overfit the data for at least some HIV sequences

Page 51: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Bayesian methods

Page 52: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

The ‘optimality criterion’

The best tree is the one that has the highest probability of being the true tree

Page 53: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Likelihood: Choose the tree that makes the data the most likely

Bayesian: Choose the most probable tree (tree with the highest posterior probability)

)|( TDPEquivalent to maximizing

)|( DTPEquivalent to maximizing

Page 54: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Bayes’ Rule

Probability = Likelihood X Prior Information

Some normalising factors

Mathematically: )(

)()|()|(

DP

TPTDPDTP

T = Tree

D = Data

Page 55: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Important Terms

Prior probability: the probability of the event before considering the data

Posterior probability: the probability of the event after taking the data into consideration

Page 56: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

In molecular phylogenetics the prior is usually ‘flat’ so the max likelihood tree is usually also the max probability tree

So why bother?

Page 57: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

2. Because this formulation allows us to use another approach to get to the best tree (MCMC – see later)

3. Also allows us to integrate over parameters instead of optimising over parameters

1. Because we get the answer as a probability

Page 58: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

MCMC (Markov Chain Monte Carlo)

Produces a long chain of trees/parameters sampled according to their probability

The number of times the chain visits tree X is proportional to the probability of tree X

Page 59: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Burnin

• Typically the chain will take some time before trees are sampled according to their probability

• Initially probability of trees increases with time

• Programmes need to be allowed to run until the probabilities are fluctuating randomly about a constant mean

• Data generated before the chain reaches a steadystate are discarded

Page 60: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

Bayesian methods can be

- relatively fast

- easily interpretable

- often very accurate

Page 61: A brief introduction to phylogenetics. Definition: The number of evolutionary events (usually nucleotide substitutions) that have occurred since two sequences

But

- sometimes overestimate confidence

- difficult to be sure of convergence (less of a problem with more recent software versions)

=> difficult to decide how long to run the chain

Software for Bayesian phylogenetics: MrBayes