Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N...

Preview:

Citation preview

.

Phylogeny II : Parsimony, ML, SEMPHY

Phylogenetic Tree

Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2

leaf

branch internal node

Character Based Methods

We start with a multiple alignment Assumptions:

All sequences are homologous Each position in alignment is homologous Positions evolve independently No gaps

We seek to explain the evolution of each position in the alignment

Parsimony

Character-based method A way to score trees (but not to build trees!)

Assumptions: Independence of characters (no interactions) Best tree is one where minimal changes take place

A Simple Example

What is the parsimony score of

Aardvark Bison Chimp Dog Elephant

A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA

A Simple Example

Each column is scored separately. Let’s look at the first column:

Minimal tree has one evolutionary change:C

C

CC

C

T

T

T

T C

A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA

Evaluating Parsimony Scores

How do we compute the Parsimony score for a given tree?

Traditional Parsimony Each base change has a cost of 1

Weighted Parsimony Each change is weighted by the score c(a,b)

Traditional Parsimony

}{},{

1 1min);,...,(vu xx

Evun TxxPar

nodesinternal

a g a

{a,g}

{a}

•Solved independently for each position

•Linear time solution

a

a

Evaluating Weighted Parsimony

Dynamic programming on the tree

S(i,a) = cost of tree rooted at i if i is labeled by a

Initialization: For each leaf i set S(i,a) = 0 if i is labeled by a,

otherwise S(i,a) = Iteration: if k is a node with children i and j, then

S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b))

Termination: cost of tree is minaS(r,a) where r is the root

Cost of Evaluating Parsimony

Score is evaluated on each position independetly. Scores are then summed over all positions.

If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk)

By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

How many possible unrooted trees?

Maximum Parsimony

How many possible unrooted trees?

1

3

2

4

1

2

3

4

1

4

3

2

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T GSpecies 2 - A C G A T T A T T ASpecies 3 - A T A A T T G T C TSpecies 4 - A A T G T T G T C G

Maximum Parsimony

How many substitutions?

A

A

G

GA G

1 change

A

A

G

GG A

5 changes

1

2

3

4

tree

MP

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G0

0

0

1

3

2

4

1

2

3

4

1

4

3

2

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G0 3

0 3

0 3

1

3

2

4

1

2

3

4

1

4

3

2

Maximum Parsimony

4

1 - G

2 - C

3 - T

4 - A

1

2

3

4A

G

C

T

C

A

G

T

C1

3

2

4C

C

G

A

T1

4

3

2C

3

3

3

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G0 3 2

0 3 2

0 3 2

1

3

2

4

1

2

3

4

1

4

3

2

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G0 3 2 2

0 3 2 1

0 3 2 2

1

3

2

4

1

2

3

4

1

4

3

2

Maximum Parsimony

4

1 - G

2 - A

3 - A

4 - G

1

2

3

4G

G

A

A

A

G

G

A

A1

3

2

4A

G

A

A

G1

4

3

2A

2

2

1

Maximum Parsimony

0 3 2 2 0 1 1 1 1 3 14

0 3 2 1 0 1 2 1 2 3 15

0 3 2 2 0 1 2 1 2 3 16

1

3

2

4

1

2

3

4

1

4

3

2

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3 2 2 0 1 1 1 1 3 14

1

2

3

4

Searching for Trees

#Taxa #Trees #Taxa #Trees

3 1 10 2 x 106

4 3 50 3 x 1074

5 15 100 2 x 10182

Searching for the Optimal Tree

Exhaustive Search Very intensive

Branch and Bound A compromise

Heuristic Fast Usually starts with NJ

Phylogenetic Tree Assumptions

Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2

Lengths t = {ti} for each branch Phylogenetic tree = (Topology, Lengths) = (T,t)

leaf

branch internal node

Probabilistic Methods

The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences.

Background probabilities: q(a) Mutation probabilities: P(a|b, t) Models for evolutionary mutations

Jukes Cantor Kimura 2-parameter model

Such models are used to derive the probabilities

Jukes Cantor model

A model for mutation rates

•Mutation occurs at a constant rate •Each nucleotide is equally likely to mutate into any other nucleotide with rate .

Kimura 2-parameter model

Allows a different rate for transitions and transversions.

Mutation Probabilities

The rate matrix R is used to derive the mutation probability matrix S:

S is obtained by integration. For Jukes Cantor:

q can be obtained by setting t to infinity

RItS )(

)()(),|(

)()(),|(

tag

taa

etStagP

etStaaP

4

4

14

1

314

1

Mutation Probabilities

Both models satisfy the following properties:

Lack of memory:

Reversibility: Exist stationary probabilities

{Pa} s.t.

A

G T

C

b

cbbaca tPtPttP )'()()'(

)()( tPPtPP abbbaa

Probabilistic Approach

Given P,q, the tree topology and branch lengths, we can compute:

x1 x2 x3

x4

x5

),|(),|(),|(),|()(

),|,,,,(

2421413534545

54321

txxptxxptxxptxxpxq

tTxxxxxP

t1t2 t3

t4

Computing the Tree Likelihood

54

54321321xx

tTxxxxxPtTxxxP,

),|,,,,(),|,,(

We are interested in the probability of observed data given tree and branch “lengths”:

Computed by summing over internal nodes This can be done efficiently using a tree upward

traversal pass.

Tree Likelihood Computation

Define P(Lk|a)= prob. of leaves below node k given that xk=a

Init: for leaves: P(Lk|a)=1 if xk=a ; 0 otherwise Iteration: if k is node with children i and j, then

Termination:Likelihood is

ba

jik cjLtacPbiLtabPaLP,

)|(),|()|(),|()|(

)()|(),|,,( aqaLPtTxxPa

root31

Maximum Likelihood (ML)

Score each tree by Assumption of independent positions

Branch lengths t can be optimized Gradient ascent EM

We look for the highest scoring tree Exhaustive search Sampling methods (Metropolis)

m

nn tTmxmxPtTXXP ),|][,],[(),|,,( 11

Optimal Tree Search

Perform search over possible topologiesT1 T3

T4

T2

Tn

Parametric optimization

(EM)

Parameter space

Local Maxima

Computational Problem

Such procedures are computationally expensive! Computation of optimal parameters, per candidate,

requires non-trivial optimization step. Spend non-negligible computation on a candidate,

even if it is a low scoring one. In practice, such learning procedures can only

consider small sets of candidate structures

Structural EM

Idea: Use parameters found for current topology to help evaluate new topologies.

Outline: Perform search in (T, t) space. Use EM-like iterations:

E-step: use current solution to compute expected sufficient statistics for all topologies

M-step: select new topology based on these expected sufficient statistics

The Complete-Data ScenarioSuppose we observe H, the ancestral sequences.

Tjijiji

Tji m mx

jimxmx

i mmx

mN

complete

StFconst

p

tpp

tTmxPHDtTl

j

ji

i

),(,,

),(

,

22...1

),(

)(loglog

),|(log,:,

),(max ,,, , jijitji StFwji

Tji

jiw),(

,

Define:

Find: topology T that maximizes

Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,jF is a linear function of Si,j

Expected Likelihood

Start with a tree (T0,t0) Compute

Formal justification: Define:

Theorem:

Consequence: improvement in expected score improvement in likelihood

m

mN

mj

miji tTxbXaXPbaSE ),,|,()],([ 00

],,1[),(

Tjijiji

complete

constSEtF

tTtTHDlEtTQ

),(,,

00

])[,(

],|),:,([),(

),:(),:(),(),( 0000 tTDltTDltTQtTQ

Algorithm Outline

Original Tree (T0,t0)

Unlike standard EM for trees, we compute all possible pairwise statistics

Time: O(N2M)

Compute: ],,|),([ 00),( tTDbaSE ji

])[,(max ,, jitji SEtFw Weights:

Pairwise weights

This stage also computes the branch length for each pair (i,j)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

])[,(max ,, jitji SEtFw Weights:

Tji

jiT wT),(

,maxarg'Find:

Max. Spanning Tree

Fast greedy procedure to find tree

By construction:Q(T’,t’) Q(T0,t0)

Thus, l(T’,t’) l(T0,t0)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

])[,(max ,, jitji SEtFw Weights:

Tji

jiT wT),(

,maxarg'Find:

Construct bifurcation T1

Fix Tree

Remove redundant nodesAdd nodes to break large degree

This operation preserves likelihood l(T1,t’) =l(T’,t’) l(T0,t0)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

Tji

jiT wT),(

,maxarg'Find:

])[,(max ,, jitji SEtFw Weights:

Construct bifurcation T1

Assessing trees: the Bootstrap

Often we don’t trust the tree found as the “correct” one.

Bootstrapping: Sample (with replacement) n positions from the

alignment Learn the best tree for each sample Look for tree features which are frequent in all

trees. For some models this procedure approximates the

tree posterior P(T| X1,…,Xn)

New TreeThm: l(T1,t1) l(T0,t0)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

Construct bifurcation T1

Tji

jiT wT),(

,maxarg'Find:

])[,(max ,, jitji SEtFw Weights:

These steps are then repeated until convergence

Recommended