110
Phylogenetic Inference Involves an attempt to estimate the evolutionary history of a collection of organisms (taxa) or a family of genes Two major components Estimation of the evolutionary tree (branching order) Using estimated trees (phylogenies) as analytical framework for further evolutionary study Traditional role: systematics and classification

Phylogenetic Inference

Embed Size (px)

DESCRIPTION

Phylogenetic Inference. Involves an attempt to estimate the evolutionary history of a collection of organisms (taxa) or a family of genes Two major components Estimation of the evolutionary tree (branching order) - PowerPoint PPT Presentation

Citation preview

Page 1: Phylogenetic Inference

Phylogenetic Inference

• Involves an attempt to estimate the evolutionary history of a collection of organisms (taxa) or a family of genes

• Two major components– Estimation of the evolutionary tree (branching

order)– Using estimated trees (phylogenies) as analytical

framework for further evolutionary study

• Traditional role: systematics and classification

Page 2: Phylogenetic Inference

Example 1: Closest living relatives of humans

Humans

Bonobos

Gorillas

Orangutans

Chimpanzees

MYA015-30

MYA

Chimpanzees

Orangutans

Humans

Bonobos

Gorillas

014

Pre-molecular view(morphology)

Emerging picture from mtDNA, most nuclear genes, DNA/DNA hybridization

Page 3: Phylogenetic Inference

Example 2: Who are whales related to?

Morphological data suggest that whales are a “sister clade” to extant artiodactylans, but molecular data suggest strongly that whales and hippos are more closely related to each other than hippos are to other artiodactylans

Morphology

Mt and nuclear DNA sequences, SINEs, LINEs

Page 4: Phylogenetic Inference

Other interesting applicationsForensics—Transmission of HIV by Florida dentist

DENTIST

DENTIST

Patient D

Patient F

Patient C

Patient A

Patient G

Patient BPatient E

Patient A

Local control 2

Local control 3

Local control 9

Local control 35

Local control 3

Yes:The HIV sequences fromthese patients fall withinthe clade of HIV sequences found in the dentist.

No

No

From Ou et al. (1992) and Page & Holmes (1998), redrawn by Caro-Beth Stewart

Phylogenetic treeof HIV sequencesfrom the DENTIST,his Patients, & LocalHIV-infected People:

Page 5: Phylogenetic Inference

Other interesting applicationsStudying dynamics of microbial communities:

Sequence 16s rDNA to identify and quantify microbes in soil before and after pesticide exposure (many microbes are previously unknown, so study gene sequences phylogenetically to follow changes in community composition)

Known sequences from database

Novel microbial sequences

Page 6: Phylogenetic Inference

Other interesting applicationsPredicting evolution of influenza viruses

Lineages with many mutations in one set of positively selected codons were usually the ones which led to successful strains in subsequent seasons

Page 7: Phylogenetic Inference

Other interesting applicationsPredicting functions of uncharacterized genes

Use “character-mapping” to infer functions based on parsimonious reconstructions

Many situations where similarity-based methods are inadequate, e.g.:

Page 8: Phylogenetic Inference

Other interesting applications• Drug Discovery—predicting natural ligands for cell

surface receptors that are potential drug targets (e.g., G-protein coupled receptors)

G-protein-coupled receptors are a pharmacologically important protein family with approximately 450 genes identified to date. Pathways involving these receptors are the targets of hundreds of drugs, including antihistamines, neuroleptics, antidepressants, and antihypertensives. The functions of many of these proteins are unknown, and determining ligands and signaling pathways is time-consuming and expensive. This difficulty motivates the search for a computational method which can predict ligand and second messenger with high reliability. Classifying this family of proteins helps us classify drugs, a technique which might be called "evolutionary pharmacology”… A computational method based on evolutionary tree reconstruction and employing an accepted-mutation stepmatrix can predict the ligand selectivities and intracellular signaling pathways of uncharacterized receptors, given only the amino acid sequence of the receptor. This dramatically increases the efficiency of functional characterization of new receptors. (http://www.cis.upenn.edu/~krice/receptor.html)

• Vaccine development—engineer vaccines to confer immunity against multiple virus populations by targeting their inferred common ancestors

Page 9: Phylogenetic Inference

Ancestral Node or ROOT of

the TreeInternal Nodes orDivergence Points

(represent hypothetical ancestors of the taxa)

Branches (edges) and lineages

Terminal Nodes

A

B

C

D

E

Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny

Common Phylogenetic Tree Terminology

Page 10: Phylogenetic Inference

Completely unresolvedor "star" phylogeny

Partially resolvedphylogeny

Fully resolved, bifurcatingphylogeny (binary tree)

A A A

B

B B

C

C

C

E

E

E

D

D D

Polytomy or multifurcation A bifurcation

The goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees:

Page 11: Phylogenetic Inference

C-B Stewart, NHGRI lecture, 12/5/00

Three possible unrooted trees for four taxa (A, B, C, D)

A C

B D

Tree 1

A B

C D

Tree 2

A B

D C

Tree 3

Phylogenetic tree building (or inference) methods are aimed at discovering which of the possible unrooted trees is "correct".We would like this to be the “true” biological tree — that is, one that accurately represents the evolutionary history of the taxa.However, we must settle for discovering the optimal tree for the phylogenetic method of choice (no guarantee that optimality = truth).

Page 12: Phylogenetic Inference

The number of unrooted trees increases in a greater than exponential manner with number of taxa

(2N - 5)!! = # unrooted trees for N taxa

CA

B D

A B

C

A D

B E

C

A D

B E

C

F

Page 13: Phylogenetic Inference

Inferring evolutionary relationships between the taxa requires rooting the tree:

To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A

BC

Root D

A B C D

RootNote that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.

Rooted tree

Unrooted tree

Page 14: Phylogenetic Inference

Now, try it again with the root at another position:

A

BC

Root

D

Unrooted tree

Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.

C D

Root

Rooted tree

A

B

Page 15: Phylogenetic Inference

An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees

The unrooted tree 1:

A C

B D

Rooted tree 1d

C

D

A

B

4

Rooted tree 1c

A

B

C

D

3

Rooted tree 1e

D

C

A

B

5

Rooted tree 1b

A

B

C

D

2

Rooted tree 1a

B

A

C

D

1

These trees show five different evolutionary relationships among the taxa

Page 16: Phylogenetic Inference

All of these rearrangements show the same evolutionary relationships between the taxa

B

A

C

D

A

B

D

C

B

C

A

D

B

D

A

C

B

AC

DRooted tree 1a

B

A

C

D

A

B

C

D

Page 17: Phylogenetic Inference

By outgroup: Uses taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., -globins to root -globins).

There are two major ways to root trees:

A

B

C

D

10

2

3

5

2

By midpoint or distance:Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods.

outgroup

d (A,D) = 10 + 3 + 5 = 18Midpoint = 18 / 2 = 9

Page 18: Phylogenetic Inference

C-B Stewart, NHGRI lecture,12/5/00

# T axa

3

4

5

6

7

8

9

.

.

.

.

30

# Un r oot e d

T rees

1

3

15

105

945

1 0 ,935

13 5 ,135

.

.

.

.

~3 . 58 x 10

3 6

# Root s

3

5

7

9

11

13

15

.

.

.

.

57

# Root e d

T rees

3

1 5

1 0 5

9 4 5

10,3 9 5

1 35,1 3 5

2, 0 27,0 2 5

.

.

.

.

~2 . 04 x 10

3 8

x =

CA

B D

A D

B E

C

A D

B E

C

F (2N - 3)!! = # unrooted trees for N taxa

Each unrooted tree theoretically can be rootedanywhere along any of its branches

Page 19: Phylogenetic Inference

Types of data used in phylogenetic inference:Character-based methods: Use the aligned characters, such as DNA

or protein sequences, directly during tree inference. Taxa Characters

Species A ATGGCTATTCTTATAGTACGSpecies B ATCGCTAGTCTTATATTACASpecies C TTCACTAGACCTGTGGTCCASpecies D TTGACCAGACCTGTGGTCCGSpecies E TTGACCAGTTCTCTAGTTCG

Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building.

A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ----

Example 1: Uncorrected“p” distance(=observed percentsequence difference)

Example 2: Kimura 2-parameter distance(estimate of the true number of substitutions between taxa)

Page 20: Phylogenetic Inference

Similarity vs. Evolutionary Relationship:

Similarity and relationship are not the same thing, even thoughevolutionary relationship is inferred from certain types of similarity.

Similar: having likeness or resemblance (an observation)

Related: genetically connected (an historical fact)

Two taxa can be most similar without being most closely-related:

Taxon A

Taxon B

Taxon C

Taxon D

1

1

1

6

3

5

C is more similar in sequence to A (d = 3) than to B (d = 7),but C and B are most closelyrelated (that is, C and B shareda common ancestor more recentlythan either did with A).

Page 21: Phylogenetic Inference

Character-based methods can tease apart types of similarity and theoreticallyfind the true evolutionary tree. Similarity = relationship only if certain conditionsare met (if the distances are ‘ultrametric’).

Types of Similarity

Observed similarity between two entities can be due to:

Evolutionary relationship:Shared ancestral characters (‘symplesiomorphies’)Shared derived characters (‘’synapomorphy’)

Homoplasy (independent evolution of the same character):Convergent events (in either related on unrelated entities),Parallel events (in related entities), Reversals (in related entities)

CC

G

G

C

C

G

G

CG

G C

C

G

GT

Page 22: Phylogenetic Inference

METRIC DISTANCES between any two or three taxa(a, b, and c) have the following properties:

Property 1: d (a, b) ≥ 0 Non-negativity

Property 2: d (a, b) = d (b, a) Symmetry

Property 3: d (a, b) = 0 if and only if a = b Distinctness

and...

Property 4: d (a, c) ≤ d (a, b) + d (b, c) Triangle inequality:

a

b

c6

9

5

Page 23: Phylogenetic Inference

ULTRAMETRIC DISTANCESmust satisfy the previous four conditions, plus:

Property 5 d (a, b) ≤ maximum [d (a, c), d (b, c)]

If distances are ultrametric, then the sequences are evolving in a perfectly clock-like manner, thus can be used in UPGMA trees and for the most precise calculations of divergence dates.

a b4

66

c

Similarity = Relationship if the distances are ultrametric!

a

b

c

2

22

4

This implies that the two largest distances are equal, so that they define an isosceles triangle:

Page 24: Phylogenetic Inference

General strategy for estimating a phylogeny

1. Get data

2. Select an optimality criterion (e.g., parsimony, least-squares distance, maximum likelihood)

3. Choose a search strategy (e.g., stepwise addition with branch swapping, branch-and-bound)

4. Evaluate optimality criterion for each tree visited during search, always keeping track of best tree(s) found

Page 25: Phylogenetic Inference
Page 26: Phylogenetic Inference

Parsimony (optimality criterion)

• In general: choose the tree requiring the fewest number of (possibly weighted) character-state changes (= steps)

• Assume character independence; can calculate length required by each character and sum over characters to get total tree length

Page 27: Phylogenetic Inference

Parsimony variants used for molecular data

• Fitch parsimony (unordered/nonadditive): Each change counts 1 step, regardless of the nature of this change

• Transversion parsimony: changes between a purine (A or G) and a pyrimidine (C or T) (“transversions”) count 1, changes between two purines or between two pyrimidines (“transitions”) count 0

• Generalized parsimony: User specifies cost of each type of change

A C

G T

= 1 step

= 3 steps

Page 28: Phylogenetic Inference

Calculating tree lengths under parsimony using “brute force”

• For each character:– Consider every possible ancestral state

reconstruction– Count total cost required for each of these

reconstructions– Sum over all characters

Page 29: Phylogenetic Inference

G

A

A C

C

C

G

A

A T

C

C

G

A

A G

C

C

G

A

C A

C

C

G

A

C C

C

C

G

A

C T

C

C

G

A

C G

C

C

G

A

G A

C

C

G

A

G C

C

C

G

A

G T

C

C

G

A

G G

C

C

G

A

T A

C

C

G

A

T C

C

C

G

A

T T

C

C

G

A

T G

C

C

G

A

A A

C

C

equal: 1+0+0+1+1=3tv4: 1+0+0+4+4=9

equal: 1+1+1+1+1=5tv4: 4+4+4+4+4=20

equal: 0+1+1+1+1=4tv4: 0+1+1+4+4=10

equal: 1+1+1+1+1=5tv4: 4+4+4+4+4=20

equal: 1+0+1+0+0=2tv4: 1+0+4+0+0=5

equal: 1+1+0+0+0=2tv4: 4+4+0+0+0=8

equal: 0+1+1+0+0=2tv4: 0+1+4+0+0=5

equal: 1+1+1+0+0=3tv4: 4+4+1+0+0=9

equal: 1+0+1+1+1=4tv4: 1+0+1+4+4=10

equal: 1+1+1+1+1=5tv4: 4+4+4+4+4=20

equal: 0+1+0+1+1=3tv4: 0+1+0+4+4=9

equal: 1+1+1+1+1=5tv4: 4+4+4+4+4=20

equal: 1+0+1+1+1=3tv4: 1+0+4+1+1=7

equal: 1+1+1+1+1=5tv4: 4+4+1+1+1=11

equal: 0+1+1+1+1=4tv4: 0+1+4+1+1=7

equal: 1+1+0+1+1=4tv4: 4+4+0+1+1=10

0 1 1 11 0 1 11 1 0 11 1 1 0

equal =

0 4 1 44 0 4 11 4 0 44 1 4 0

tv4 =

Page 30: Phylogenetic Inference

Calculating tree lengths using dynamic programming

• Analogous to pairwise alignment: determine implications of each possible state assignment at one level (node) for length at next level (parent node)

Page 31: Phylogenetic Inference

G A C CA C G T A C G T A C G T A C G T

∞ ∞ ∞0 ∞ ∞∞ ∞ ∞∞ ∞ ∞∞000

W XY Z

1 2

3

Page 32: Phylogenetic Inference

A C G T A C G T

∞ ∞∞ ∞ ∞∞00

(min∞,4,∞,∞)+

(min∞,4,∞,∞)= 4 + 4 = 8

(min∞,0,∞,∞)+

(min∞,0,∞,∞)= 0 + 0 = 0

(min∞,4,∞,∞)+

(min∞,4,∞,∞)= 4 + 4 = 8

(min∞,1,∞,∞)+

(min∞,1,∞,∞)=1+1= 2

A C G T

2

X Z

min(1,12,2,12)+

min(8,4,9,6)= 1 + 4 = 5

min(5,8,5,9)+

min(12,0,12,3)= 5 + 0 = 5

min(2,12,1,12)+

min(9,4,8,6)= 1 + 4 = 5

min(5,9,5,8)+

min(12,1,12,2)= 5 + 1 = 6

A C G T

A C G T

1 8 81

A C G T

8 0 28

A C G T A C G T

∞ ∞ ∞0 ∞ ∞∞0

(min∞,∞,1,∞)+

(min0,∞,∞,∞)= 1 + 0 = 1

(min∞,∞,4,∞)+

(min4,∞,∞,∞)= 4 + 4 = 8

(min∞,∞,0,∞)+

(min1,∞,∞,∞)=0 + 1 = 1

(min∞,∞,4,∞)+

(min4,∞,∞,∞)= 4 + 4 = 8

A C G T

W Y

1

Page 33: Phylogenetic Inference

Faster algorithms for special cases

• Farris (1970) algorithm for ordered characters• Fitch (1971) algorithm for unordered characters

• Assign “state sets” to terminal taxa based on observed data, and initialize tree length to 0

• Traverse tree from tips to root; for each node consider state sets of two immediate descendants (children)

– If child state sets have a nonempty intersection, new state set equals this intersection

– Otherwise, make new state set equal to the union of the two child state sets, and add 1 to the tree length

Page 34: Phylogenetic Inference

{G}:0 {A}:0 {C}:0 {C}:0

1 2

3

W XY Z

{G}:0 {A}:0 {C}:0 {C}:0

{A,G}:1 2

3

{G}:0 {A}:0 {C}:0 {C}:0

{A,G}:1

3

{C}:0

{G}:0 {A}:0 {C}:0 {C}:0

{A,G}:1 {C}:0

{A,C,G}:2

Example of tree length calculation using Fitch optimization

Page 35: Phylogenetic Inference

Searching for trees

• Generation of all possible trees

B

C

A

D

D

D

B

CD

A

B

CD

B C

DB

A

1.Generate all 3 trees for first 4 taxa:

Page 36: Phylogenetic Inference

Searching for trees

B

C

D

AE

EE

C

DE

AB

C

DE

BA

C

DB

AE

D

EB

AC

C

EB

AD

2. Generate all 15 trees for first 5 taxa:

(likewise for each of the other two 4-taxon trees)

Page 37: Phylogenetic Inference

Searching for trees

3. Full search tree:

EA

CB

D

DA

CB

E

DA

EB

C

DA

EC

B

CB

ED

A

CA

DB

E

CA

EB

D

CA

ED

B

DB

EC

A

EA

DC

BE

B

DC

A

BA

DC

E

BA

EC

D

BA

ED

C

D

A

B

C

B

A

C

D

A

B

C

C

A

B

D

DB

EA

C

Page 38: Phylogenetic Inference

Searching for trees

Branch and bound algorithm:

The branch-and-bound algorithm for exact solution of the problem of finding an optimal parsimony tree. The search tree is the same as for exhaustive search, with tree lengths for a hypothetical data set shown in boldface type. If a tree lying at a node of this search tree has a length that exceeds the current lower bound on the optimal tree length, this path of the search tree is terminated (indicated by a cross-bar), and the algorithm backtracks and takes the next available path. When a tip of the search tree is reached (i.e., when we arrive at a tree containing the full set of taxa), the tree is either optimal (and hence retained) or suboptimal (and rejected). When all paths leading from the initial 3-taxon tree have been explored, the algorithm terminates, and all most-parsimonious trees will have been identified. Asterisks indicate points at which the current lower bound is reduced. See text for additional explanation, and circled numbers represent the order in which phylogenetic trees are visited in the search tree.

1

*229

EA

CB

D

DA

CB

E

DA

EB

C

DA

EC

B

CB

ED

A

CA

DB

E

CA

EB

D

DB

EC

A

D

A

B

C

A

B

C

233

235

237 237245

251258

C

A

B

D

280

221 213

B

A

C

D

234

*241

*242

242245

246247

249

268C

A

ED

B

245

241

241

244248

251

232

226

233

235

251

262

243

227

2

3

11

12

13-19

4-10

DB

EA

C

20

21

22

26

23

24

25

27

28-34

Page 39: Phylogenetic Inference

Searching for trees

Heuristic search methods

A greedy stepwise-addition search applied to the example used for branch-and-bound. The best 4-taxon tree is determined by evaluating the lengths of the three trees obtained by joining taxon D to tree 1 containing only the first three taxa. Taxa E and F are then connected to the five and seven possible locations, respectively, on trees 4 and 9, with only the shortest trees found during each step being used for the next step. In this example, the 233-step tree obtained is not a global optimum. Circled numbers indicate the order in which phylogenetic trees are evaluated in the stepwise-addition search.

EA

CB

D

DA

CB

E

DA

EB

C

DA

EC

B

CB

ED

A

D

A

B

C

A

B

C

233*

235

237 237245

251258

C

A

B

D

280

221 213

B

A

C

D

235

251

262

243

227

2

1

2

3

5

6

7

8

4

9

10-16

Page 40: Phylogenetic Inference

Searching for trees

Heuristic search methods continued

1

2 3 45

6

Nearest neighbor interchange:

1

2 3 45

6

1

2 4 35

6 1

2 3

4

5 6

1

32 4 5

6

3

21 4

5

6

1

2 3 5 4

6

1

2 3 64

5

All possible NNIs on 6-taxon tree:

Page 41: Phylogenetic Inference

Searching for trees

Heuristic search methods continued

Subtree pruning regrafting:

1

2 3 45

6

x zy

x

1

2 3 4 5

6

a

bc

z

1

2 3 45

6

a

bc

d

y

1

2 3 4 5

6

a

b

Page 42: Phylogenetic Inference

Searching for trees

Heuristic search methods continued

Trees resulting from SPR:

z.a.

1

2 4 3 5

6

z.b.

1

2

4 3

5

6z.c.

4

3 2 1 5

6

z.d.

3

4 1 2 5

6

y.a.

1

2 3 5 4

6

y.b.

1

2 3 6 4

5

x.a.

1

2 4 3 5

6

x.b.

1

2

4 3

5

6x.c.

1

2 5 63

4

x.d.

1

2 6 5 3

4

Page 43: Phylogenetic Inference

Searching for trees

Heuristic search methods continued

Tree bisection-reconnection:

1

2 3 45

6

x zy

r

s

t u v

w

1

2 3 45

6

x zx'

u v

w1

2 4 3 5

6

1

2 3 45

6

0 01

1

2

2

Reconnection distances:

Page 44: Phylogenetic Inference

Searching for trees

Heuristic search methods continued

Tree bisection-reconnection:

(D)

1

2 3 45

6

y

r

s

v

wy'

3

1 2 54

6

01

1

2 3 45

6

1

1

1

0Reconnection distances:

Page 45: Phylogenetic Inference

Star-decomposition search

1

2

3

4

5

1

3

2

4

5

3

5

1

2

4

•••

4

5

1

2

3

1

2

3

4

5

14

3

2

5

12

3

4

5

15

3

2

4

Step 1

Step 2 Step 3

Page 46: Phylogenetic Inference

Other search strategies

• These “hill-climbing” methods work well for up to 20-30 taxa. For larger numbers of taxa, highly prone to entrapment in local optima. Therefore, additional strategies may be necessary:– Random restart (random trees, stepwise addition

with random addition sequences)– Other optimization (meta)heuristics: iterated local

search (restart after random perturbations); simulated annealing and other stochastic optimization methods

– Genetic algorithms and other population-based approaches

Page 47: Phylogenetic Inference

Overview of maximum likelihood as used Overview of maximum likelihood as used in phylogeneticsin phylogenetics

• Overall goal: Find a tree topology (and associated Overall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of parameter estimates) that maximizes the probability of obtaining the observed data, given a model of evolutionobtaining the observed data, given a model of evolution

Page 48: Phylogenetic Inference

Overview of maximum likelihood as used Overview of maximum likelihood as used in phylogeneticsin phylogenetics

• Overall goal: Find a tree topology (and associated Overall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of parameter estimates) that maximizes the probability of obtaining the observed data, given a model of evolutionobtaining the observed data, given a model of evolution

Likelihood(hypothesis) Likelihood(hypothesis) Prob(dataProb(data||hypothesis)hypothesis)

Likelihood(tree,model) = k Prob(observed sequences|Likelihood(tree,model) = k Prob(observed sequences|tree,model)tree,model)

Page 49: Phylogenetic Inference

Overview of maximum likelihood as used Overview of maximum likelihood as used in phylogeneticsin phylogenetics

• Overall goal: Find a tree topology (and associated Overall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of parameter estimates) that maximizes the probability of obtaining the observed data, given a model of evolutionobtaining the observed data, given a model of evolution

Likelihood(hypothesis) Likelihood(hypothesis) Prob(dataProb(data||hypothesis)hypothesis)

Likelihood(tree,model) = k Prob(observed sequences|Likelihood(tree,model) = k Prob(observed sequences|tree,model)tree,model)

[[notnot Prob(tree Prob(tree||data,model)]data,model)]

Page 50: Phylogenetic Inference

Computing the likelihood of a single treeComputing the likelihood of a single tree

1 1 jj NN (1) C…GGACA…(1) C…GGACA…CC……

GTTTA…CGTTTA…C(2) C…AGACA…(2) C…AGACA…CC……

CTCTA…CCTCTA…C(3) C…GGATA…(3) C…GGATA…AA……

GTTAA…C GTTAA…C (4) C…GGATA…(4) C…GGATA…GG……

CCTAG…C CCTAG…C

Page 51: Phylogenetic Inference

Computing the likelihood of a single treeComputing the likelihood of a single tree

1 1 jj NN (1) C…GGACA…(1) C…GGACA…CC……

GTTTA…CGTTTA…C(2) C…AGACA…(2) C…AGACA…CC……

CTCTA…CCTCTA…C(3) C…GGATA…(3) C…GGATA…AA……

GTTAA…C GTTAA…C (4) C…GGATA…(4) C…GGATA…GG……

CCTAG…C CCTAG…C (1)(1)

(2)(2)

(3)(3)

(4)(4)

Page 52: Phylogenetic Inference

Computing the likelihood of a single treeComputing the likelihood of a single tree

1 1 jj NN (1) C…GGACA…(1) C…GGACA…CC……

GTTTA…CGTTTA…C(2) C…AGACA…(2) C…AGACA…CC……

CTCTA…CCTCTA…C(3) C…GGATA…(3) C…GGATA…AA……

GTTAA…C GTTAA…C (4) C…GGATA…(4) C…GGATA…GG……

CCTAG…C CCTAG…C (1)(1)

(2)(2)

(3)(3)

(4)(4)

CCCC AA GG

(6)(6)

(5)(5)

Page 53: Phylogenetic Inference

Computing the likelihood of a single treeComputing the likelihood of a single tree

ProbProb

CCCC AA GG

AA

AA

Likelihood at site Likelihood at site jj = =

Page 54: Phylogenetic Inference

Computing the likelihood of a single treeComputing the likelihood of a single tree

ProbProb

CCCC AA GG

AA

AA

Likelihood at site Likelihood at site jj = =

+ Prob+ Prob

CCCC AA GG

AA

CC

Page 55: Phylogenetic Inference

Computing the likelihood of a single treeComputing the likelihood of a single tree

ProbProb

CCCC AA GG

AA

AA

Likelihood at site Likelihood at site jj = =

+ Prob+ Prob

CCCC AA GG

AA

CC

ProbProb

CCCC AA GG

TT

TT+ … ++ … +

Page 56: Phylogenetic Inference

Computing the likelihood of a single treeComputing the likelihood of a single tree

ProbProb

CCCC AA GG

AA

AA

Likelihood at site Likelihood at site jj = =

+ Prob+ Prob

CCCC AA GG

AA

CC

ProbProb

CCCC AA GG

TT

TT+ … ++ … +

But use Felsenstein (1981) pruning algorithmBut use Felsenstein (1981) pruning algorithm

Page 57: Phylogenetic Inference

Computing the likelihood of a single treeComputing the likelihood of a single tree

L=L1L2L LN = Ljj=1

N

lnL=lnL1 +lnL2 +L +lnLN = lnLjj=1

N

Page 58: Phylogenetic Inference

Finding the maximum-likelihood treeFinding the maximum-likelihood tree(in principle)(in principle)

• Evaluate the likelihood of each possible Evaluate the likelihood of each possible tree for a given collection of taxa.tree for a given collection of taxa.

Page 59: Phylogenetic Inference

Finding the maximum-likelihood treeFinding the maximum-likelihood tree(in principle)(in principle)

• Evaluate the likelihood of each possible Evaluate the likelihood of each possible tree for a given collection of taxa.tree for a given collection of taxa.

• Choose the tree topology which Choose the tree topology which maximizes the likelihood over all maximizes the likelihood over all possible trees.possible trees.

Page 60: Phylogenetic Inference

Probability calculations Probability calculations require…require…

• An explicit model of substitution that specifies change probabilities for a given branch length:An explicit model of substitution that specifies change probabilities for a given branch length:

Page 61: Phylogenetic Inference

Probability calculations Probability calculations require…require…

• An explicit model of substitution that specifies change probabilities for a given branch length:An explicit model of substitution that specifies change probabilities for a given branch length:

Q =

πArAA πCrAC πGrAG πTrAT

πArCA πCrCC πGrCG πTrCT

πArGA πCrGC πGrGG πTrGT

πArTA πCrTC πGrTG πTrTT

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

Jukes-CantorJukes-CantorKimura 2-parameterKimura 2-parameterHasegawa-Kishino-Yano (HKY)Hasegawa-Kishino-Yano (HKY)Felsenstein 1981, 1984 Felsenstein 1981, 1984 General time-reversibleGeneral time-reversible

Page 62: Phylogenetic Inference

Probability calculations Probability calculations require…require…

• An explicit model of substitution that specifies change probabilities for a given branch length:An explicit model of substitution that specifies change probabilities for a given branch length:

• An estimate of optimal branch lengths in units of expected amount of change (An estimate of optimal branch lengths in units of expected amount of change ( = rate x time) = rate x time)

Q =

πArAA πCrAC πGrAG πTrAT

πArCA πCrCC πGrCG πTrCT

πArGA πCrGC πGrGG πTrGT

πArTA πCrTC πGrTG πTrTT

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

P(v)=eQν

Jukes-CantorJukes-CantorKimura 2-parameterKimura 2-parameterHasegawa-Kishino-Yano (HKY)Hasegawa-Kishino-Yano (HKY)Felsenstein 1981, 1984Felsenstein 1981, 1984General time-reversibleGeneral time-reversible

Page 63: Phylogenetic Inference

A Family of Reversible Substitution ModelsA Family of Reversible Substitution Models

GTR

SYMTrN

F81

JC

K3ST

K2P

HKY85F84

Equal base frequencies

3 substitution types(transitions,2 transversion classes)

2 substitution types(transitions vs. transversions)

3 substitution types(transversions, 2 transition classes)

2 substitution types(transitions vs.transversions)

Single substitution type

Equal basefrequencies

Single substitution typeEqual base frequencies

(general time-reversible)

(Tamura-Nei)

(Hasegawa-Kishino-Yano)

(Felsenstein)

Jukes-Cantor

(Kimura 2-parameter)

(Kimura 3-subst. type)

(Felsenstein)

Page 64: Phylogenetic Inference

E.g., transition probabilities forE.g., transition probabilities forHKY and F84:HKY and F84:

Pij t( ) =

π j +π j1

Π j

−1⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟ e

−μν +Π j −π j

Π j

⎝ ⎜ ⎜

⎠ ⎟ ⎟ e

−μνA (i= j)

π j +π j

1Π j

−1⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟ e

−μν −π j

Π j

⎝ ⎜ ⎜

⎠ ⎟ ⎟ e

−μνA (i≠ j, transition)

π j 1−e−μν( ) (i≠ j, transversion)

⎪ ⎪ ⎪ ⎪ ⎪

⎪ ⎪ ⎪ ⎪ ⎪

Page 65: Phylogenetic Inference

The Relevance of Branch LengthsThe Relevance of Branch LengthsC C A A A A A A A A

A

C

Page 66: Phylogenetic Inference

The Relevance of Branch LengthsThe Relevance of Branch LengthsC C A A A A A A A A

A

C

C C A A A A A A A A

CA

Page 67: Phylogenetic Inference

Concerns about statistical properties Concerns about statistical properties and suitability of models and suitability of models

(assumptions)(assumptions)

Page 68: Phylogenetic Inference

Concerns about statistical properties Concerns about statistical properties and suitability of models and suitability of models

(assumptions)(assumptions)

ConsistencyConsistency

If an estimator converges to the true value of a If an estimator converges to the true value of a parameter as the amount of data increases toward parameter as the amount of data increases toward infinity, the estimator is infinity, the estimator is consistentconsistent..

Page 69: Phylogenetic Inference

Two levels of maximizationTwo levels of maximization

• Nei (1987)Nei (1987)– “…“…the likelihood computed in this method is conditional for the likelihood computed in this method is conditional for

each topology, so it is not clear whether or not the topology each topology, so it is not clear whether or not the topology showing the highest likelihood has the highest probability showing the highest likelihood has the highest probability of being the true topology…”of being the true topology…”

Page 70: Phylogenetic Inference

Two levels of maximizationTwo levels of maximization

• Nei (1987)Nei (1987)– “…“…the likelihood computed in this method is conditional for the likelihood computed in this method is conditional for

each topology, so it is not clear whether or not the topology each topology, so it is not clear whether or not the topology showing the highest likelihood has the highest probability showing the highest likelihood has the highest probability of being the true topology…”of being the true topology…”

• Yang (1996)Yang (1996)– ““Literally it is a Literally it is a maximum maximum likelihoodmaximum maximum likelihood method method… …

The failure to recognize the complexity of the problem has The failure to recognize the complexity of the problem has caused much controversy … Felsenstein (1973, 1978) caused much controversy … Felsenstein (1973, 1978) referred to the regularity conditions of Wald (1949) for a referred to the regularity conditions of Wald (1949) for a proof of …consistency. These conditions would include proof of …consistency. These conditions would include the continuity and differentiability of the likelihood function the continuity and differentiability of the likelihood function with respect to the topology parameter. These concepts with respect to the topology parameter. These concepts are not defined.are not defined.

Page 71: Phylogenetic Inference

““Likelihood” Likelihood” isis consistent. consistent.

• Two proofs:Two proofs:– Chang (1996) in Chang (1996) in Mathematical BiosciencesMathematical Biosciences– Rogers (1997) in Rogers (1997) in Systematic BiologySystematic Biology

These proofs establish that the probability that the true tree has These proofs establish that the probability that the true tree has a higher likelihood than any other possible tree approaches one a higher likelihood than any other possible tree approaches one

as the number of sites (characters) increases toward infinityas the number of sites (characters) increases toward infinity. . Chang called his proof a “customized variant of the fundamental Chang called his proof a “customized variant of the fundamental consistency result of Wald.”consistency result of Wald.”

Page 72: Phylogenetic Inference

When does maximum likelihood work When does maximum likelihood work better than parsimony?better than parsimony?

Page 73: Phylogenetic Inference

When does maximum likelihood work When does maximum likelihood work better than parsimony?better than parsimony?

• When you’re in the “Felsenstein Zone”When you’re in the “Felsenstein Zone”

AA CC

BB DD

(Felsenstein, 1978)(Felsenstein, 1978)

Page 74: Phylogenetic Inference

In the Felsenstein ZoneIn the Felsenstein Zone

AA CC GG TTAA -- 55 66 22CC 55 -- 33 88GG 66 33 -- 11TT 22 88 11 --

Substitution rates:Substitution rates:

Base frequencies:Base frequencies: A=0.1A=0.1 C=0.2C=0.2 G=0.3G=0.3 T=0.4T=0.4

AA BB

CC DD

0.10.1

0.10.1 0.10.1

0.80.8 0.80.8

Page 75: Phylogenetic Inference

In the Felsenstein ZoneIn the Felsenstein Zone

0

0.2

0.4

0.6

0.8

1

0 5000 10000

Sequence Length

parsimony

Pro

port

ion

corr

ect

Page 76: Phylogenetic Inference

In the Felsenstein ZoneIn the Felsenstein Zone

0

0.2

0.4

0.6

0.8

1

0 5000 10000

Sequence Length

parsimonyML-GTR

Pro

port

ion

corr

ect

Page 77: Phylogenetic Inference

The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem

Pattern typePattern type

11 44AA I = Uninformative (constant)I = Uninformative (constant) AA

A AA A 22 33

The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4

(zero changes required on any (zero changes required on any tree)tree)

Page 78: Phylogenetic Inference

The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem

Pattern typePattern type

11 44AA I = Uninformative (constant)I = Uninformative (constant) AAAA II = UninformativeII = Uninformative GG

A AA A 22 33

The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4

(one change required on any tree)(one change required on any tree)

Page 79: Phylogenetic Inference

The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem

Pattern typePattern type

11 44AA I = Uninformative (constant)I = Uninformative (constant) AAAA II = UninformativeII = Uninformative GGCC III = UninformativeIII = Uninformative GG

A AA A 22 33

The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4

(two changes required on any tree)(two changes required on any tree)

Page 80: Phylogenetic Inference

The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem

Pattern typePattern type

11 44AA I = Uninformative (constant)I = Uninformative (constant) AAAA II = UninformativeII = Uninformative GGCC III = UninformativeIII = Uninformative GGG G IV = IV = MisinformativeMisinformative GG

A AA A 22 33

The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4

(two changes required on true tree)(two changes required on true tree)

Page 81: Phylogenetic Inference

The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem

GG 44

AA 22

AA 33

GG 11

… … but this tree needs only one stepbut this tree needs only one step

Page 82: Phylogenetic Inference

When do both methods fail?When do both methods fail?

Page 83: Phylogenetic Inference

When do both methods fail?When do both methods fail?

• When there is insufficient phylogenetic signal...When there is insufficient phylogenetic signal...

22

11 33

44

Page 84: Phylogenetic Inference

When does parsimony work “better” When does parsimony work “better” than maximum likelihood?than maximum likelihood?

Page 85: Phylogenetic Inference

When does parsimony work “better” When does parsimony work “better” than maximum likelihood?than maximum likelihood?

• When you’re in the Inverse-Felsenstein (“Farris”) zoneWhen you’re in the Inverse-Felsenstein (“Farris”) zone

AA

BB

CC

DD

(Siddall, 1998)(Siddall, 1998)

Page 86: Phylogenetic Inference

Siddall (1998) parameter space Siddall (1998) parameter space

a

a

b

b

b

Both methods do poorly

Parsimony has higheraccuracy than likelihood

Both methods do well

pa

pb0 0.75

0.75

Page 87: Phylogenetic Inference

Parsimony vs. likelihood in the Inverse-Felsenstein ZoneParsimony vs. likelihood in the Inverse-Felsenstein Zone

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 100 1,000 10,000 100,000

Sequence length

ParsimonyML/JC

15%67.5%

67.5%

(expected differences/site)

Acc

ura

cy

Page 88: Phylogenetic Inference

Why does parsimony do so well in theWhy does parsimony do so well in theInverse-Felsenstein Inverse-Felsenstein zone?zone?

A

A

C

C

AC

A

A

C

C

AG

A

C G

C

A

A

C

CAC

AC

True synapomorphyTrue synapomorphy

Apparent synapomorphiesApparent synapomorphiesactually due toactually due tomisinterpreted homoplasymisinterpreted homoplasy

Page 89: Phylogenetic Inference

Proportion of parsimony- Proportion of parsimony- informative sites for which informative sites for which

ancestral states are correctly ancestral states are correctly reconstructed and reconstructed and

interpreted as interpreted as synapomorphiessynapomorphies

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

x

x yy

y

x

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

q

x

y yy

y

x

p

q

p

Proportion of parsimony- Proportion of parsimony- informative sites that are informative sites that are

interpreted as interpreted as synapomorphies but are synapomorphies but are actually misinterpreted actually misinterpreted

homoplasieshomoplasies

Page 90: Phylogenetic Inference

Parsimony vs. likelihood in the Felsenstein ZoneParsimony vs. likelihood in the Felsenstein Zone

15%

67.5% 67.5%

Acc

ura

cy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 100 1,000 10,000 100,000

ParsimonyML/JC

(expected differences/site)

Sequence length

Page 91: Phylogenetic Inference

From the Farris Zone to the Felsenstein ZoneFrom the Farris Zone to the Felsenstein Zone

CC

DD

AA

BB

CC

DD

AA

BB

CC

DD

AA

BB

BB

CC

DD

AA

BB

DD

CC

AA

External branches = 0.5 or 0.05 substitutions/site, Jukes-Cantor model of nucleotide substitutionExternal branches = 0.5 or 0.05 substitutions/site, Jukes-Cantor model of nucleotide substitution

Page 92: Phylogenetic Inference

0

0.2

0.4

0.6

0.8

1.0

0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05

100 sites

1,000 sites

10,000 sites ML/JC

Length of internal branch ( d)Farris zone Felsenstein zone

0

0.2

0.4

0.6

0.8

0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05

Length of internal branch ( d)Farris zone Felsenstein zone

100 sites

1,000 sites

10,000 sites

1.0

Acc

ura

cyA

ccu

racy

ParsimonyParsimony

LikelihoodLikelihood

SimulationSimulationresults:results:

Page 93: Phylogenetic Inference

Maximum likelihood models are Maximum likelihood models are oversimplifications of reality. If I assume the oversimplifications of reality. If I assume the

wrong model, won’t my results be meaningless?wrong model, won’t my results be meaningless?

Page 94: Phylogenetic Inference

Maximum likelihood models are Maximum likelihood models are oversimplifications of reality. If I assume the oversimplifications of reality. If I assume the

wrong model, won’t my results be meaningless?wrong model, won’t my results be meaningless?

• Not necessarily (maximum likelihood is pretty robust)Not necessarily (maximum likelihood is pretty robust)

Page 95: Phylogenetic Inference

Returning to earlier example...Returning to earlier example...

AA CC GG TTAA -- 55 66 22CC 55 -- 33 88GG 66 33 -- 11TT 22 88 11 --

Substitution rates:Substitution rates:

Base frequencies:Base frequencies: A=0.1A=0.1 C=0.2C=0.2 G=0.3G=0.3 T=0.4T=0.4

AA BB

CC DD

0.10.1

0.10.1 0.10.1

0.80.8 0.80.8

Page 96: Phylogenetic Inference

Performance of ML when its model is Performance of ML when its model is violated (one example)violated (one example)

0

0.2

0.4

0.6

0.8

1

100 1000 10000

Sequence Length

parsimonyML-JCML-K2PML-HKYML-GTR

Page 97: Phylogenetic Inference

Performance of ML when its model is Performance of ML when its model is violated (another example)violated (another example)

...

0

0.02

0.04

0.06

0.08

0 1 2

Rate

=50

=200

Modeling among-site rate variation with a gamma distribution...Modeling among-site rate variation with a gamma distribution...

=2

=0.5

Fre

quen

cy

Page 98: Phylogenetic Inference

Performance of ML when its model is Performance of ML when its model is violated (another example)violated (another example)

...

0

0.02

0.04

0.06

0.08

0 1 2

Rate

=50

=200

Modeling among-site rate variation with a gamma distribution...Modeling among-site rate variation with a gamma distribution...

……can also estimate a proportion of “invariable” sites (pcan also estimate a proportion of “invariable” sites (p invinv))

=2

=0.5

Fre

quen

cy

Page 99: Phylogenetic Inference

Performance of ML when its model is Performance of ML when its model is violated (another example)violated (another example)

Sequence Length

Proportion Correct

Tree a = 0.5, =0.5pinv a = 1.0, =0.5pinv a = 1.0, =0.2pinv

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYigGTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHKYigGTRgHKYgGTRiHKYiGTRerHKYerParsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHKYigGTRgHKTgGTRiHKYiGTRerHKYerparsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHYYigGTRgHKYgGTRiHKYiGRTerHKYerparsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

Page 100: Phylogenetic Inference

“MODERATE”–Felsenstein zone

= 1.0, pinv=0.5

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony

Page 101: Phylogenetic Inference

“MODERATE”–Inverse-Felsenstein zone

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony

Page 102: Phylogenetic Inference

“MODERATE”–Equal branch lengths

00.10.20.30.40.50.60.70.80.91

100 1000 10000

JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony

100000

Page 103: Phylogenetic Inference

Performance of ML when its model is Performance of ML when its model is violated (another example)violated (another example)

Sequence Length

Proportion Correct

Tree a = 0.5, =0.5pinv a = 1.0, =0.5pinv a = 1.0, =0.2pinv

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYigGTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHKYigGTRgHKYgGTRiHKYiGTRerHKYerParsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHKYigGTRgHKTgGTRiHKYiGTRerHKYerparsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHYYigGTRgHKYgGTRiHKYiGRTerHKYerparsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

Page 104: Phylogenetic Inference

Extension to more taxa...Extension to more taxa...

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

200 1000 10000

HKY+I+ΓHKY+ΓHKY+IHKYerparsimony

Sequece Legth

Proportio Correct

Page 105: Phylogenetic Inference

Distance methods

DB

A

C

v3

v2

v1

v4

v5

"Input" distance matrix:

A B C DA - dAB dAC dADB dBA - dBC dBDC dCA dCB - dCDD dDA dDB dDC -

Distances are "additive" if, e.g.:

pAB = v1 + v2 = dAB

pAC = v1 + v3 + v4 = dAC

pAD = v1 + v3 + v5 = dAD

pBC = v2 + v3 + v4 = dBC

pBD = v2 + v3 + v5 = dBD

pCD = v4 + v5 = dCD

Page 106: Phylogenetic Inference

Distances in general will not be additive, sochoose optimal tree according to one of the

following criteria (objective functions):

"Goodness - of - fit" : minimize wij pij −diji < j∑

r

Typicall , y r = 2 (least-squares) and wij = 1/dij2 ("Fitch-

Margoliash" method)

"Minimum- "evolution : minimize vkk=1

#branches

∑ or vkk=1

#branches

Page 107: Phylogenetic Inference

Neighbor joining:Neighbor joining:

A fast approximation to full searching under the minimum-evolution criterion A fast approximation to full searching under the minimum-evolution criterion using star-decomposition with iteratively updated branch lengthsusing star-decomposition with iteratively updated branch lengths

Uses the relationship:Uses the relationship:

ddAXAX = (d = (dABAB + d + dACAC - d - dBCBC)/2)/2

(etc.)(etc.)

AACC

BB

XX

Page 108: Phylogenetic Inference

Bayesian Inference in Phylogenetics

• Uses Bayes formula:

Pr(q|D) = Pr(D|q) Pr(q) Pr(D)

Pr(D|q) Pr(q)

L(q) Pr(q)

• Calculation involves integrating over all tree topologies and model-parameter values, subject to assumed prior distribution on parameters

Page 109: Phylogenetic Inference

Bayesian Inference in Phylogenetics

• To approximate this posterior density (complicated multidimensional integral) we use Markov chain Monte Carlo (MCMC)– Simulated Markov chain in which transition probabilities are

assigned such that the stationary distribution of the chain is the posterior density of interest

– E.g., Metropolis-Hastings algorithm: Accept a proposed move from one state q to another state q* with probability min(r,1) where

r = Pr(q*|D) Pr(q| q*)

Pr(q|D) Pr(q*| q)– Sample chain at regular intervals to approximate posterior

distribution

Page 110: Phylogenetic Inference

Bayesian Inference in Phylogenetics

• To approximate this posterior density (complicated multidimensional integral) we use Markov chain Monte Carlo (MCMC)– Simulated Markov chain in which transition

probabilities are assigned such that the stationary distribution of the chain is the posterior density of interest

– E.g., Metropolis-Hastings algorithm: Accept a proposed move from one state to another with probability min(r,1) where