57
Tree Searching Methods Tree Searching Methods • Exhaustive search (exact) • Branch-and-bound search (exact) • Heuristic search methods (approximate) – Stepwise addition – Branch swapping – Star decomposition

Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Embed Size (px)

Citation preview

Page 1: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Tree Searching MethodsTree Searching Methods

• Exhaustive search (exact)

• Branch-and-bound search (exact)

• Heuristic search methods (approximate)– Stepwise addition– Branch swapping– Star decomposition

Page 2: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Exhaustive Search

12

12

11

12

13

13

13

13

13

13

12

13

13

13

13

Page 3: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Searching for trees

• Generation of all possible trees

B

C

A

D

D

D

B

CD

A

B

CD

B C

DB

A

1.Generate all 3 trees for first 4 taxa:

Page 4: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Searching for trees

B

C

D

AE

EE

C

DE

AB

C

DE

BA

C

DB

AE

D

EB

AC

C

EB

AD

2. Generate all 15 trees for first 5 taxa:

(likewise for each of the other two 4-taxon trees)

Page 5: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Searching for trees

3. Full search tree:

EA

CB

D

DA

CB

E

DA

EB

C

DA

EC

B

CB

ED

A

CA

DB

E

CA

EB

D

CA

ED

B

DB

EC

A

EA

DC

BE

B

DC

A

BA

DC

E

BA

EC

D

BA

ED

C

D

A

B

C

B

A

C

D

A

B

C

C

A

B

D

DB

EA

C

Page 6: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Searching for trees

Branch and bound algorithm:

The search tree is the same as for exhaustive search, with tree lengths for a hypothetical data set shown in boldface type. If a tree lying at a node of this search tree has a length that exceeds the current lower bound on the optimal tree length, this path of the search tree is terminated (indicated by a cross-bar), and the algorithm backtracks and takes the next available path. When a tip of the search tree is reached (i.e., when we arrive at a tree containing the full set of taxa), the tree is either optimal (and hence retained) or suboptimal (and rejected). When all paths leading from the initial 3-taxon tree have been explored, the algorithm terminates, and all most-parsimonious trees will have been identified. Asterisks indicate points at which the current lower bound is reduced. Circled numbers represent the order in which phylogenetic trees are visited in the search tree.

1

*229

EA

CB

D

DA

CB

E

DA

EB

C

DA

EC

B

CB

ED

A

CA

DB

E

CA

EB

D

DB

EC

A

D

A

B

C

A

B

C

233

235

237 237245

251258

C

A

B

D

280

221 213

B

A

C

D

234

*241

*242

242245

246247

249

268C

A

ED

B

245

241

241

244248

251

232

226

233

235

251

262

243

227

2

3

11

12

13-19

4-10

DB

EA

C

20

21

22

26

23

24

25

27

28-34

Page 7: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Stepwise Addition (in a nutshell)

3

2

1

42

31

43

21

34

21

Page 8: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Searching for trees

Stepwise addition

A greedy stepwise-addition search applied to the example used for branch-and-bound. The best 4-taxon tree is determined by evaluating the lengths of the three trees obtained by joining taxon D to tree 1 containing only the first three taxa. Taxa E and F are then connected to the five and seven possible locations, respectively, on trees 4 and 9, with only the shortest trees found during each step being used for the next step. In this example, the 233-step tree obtained is not a global optimum. Circled numbers indicate the order in which phylogenetic trees are evaluated in the stepwise-addition search.

EA

CB

D

DA

CB

E

DA

EB

C

DA

EC

B

CB

ED

A

D

A

B

C

A

B

C

233*

235

237 237245

251258

C

A

B

D

280

221 213

B

A

C

D

235

251

262

243

227

2

1

2

3

5

6

7

8

4

9

10-16

Page 9: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Stepwise Addition Variants

• As Is– add in order found in matrix

• Closest– add unplaced taxa that requires smallest increase

• Furthest– add unplaced taxa that requires largest increase

• Simple– Farris’s (1970) “simple algorithm” uses a set of pairwise

reference distances

• Random– random permutation of taxa is used to select the order

Page 10: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Branch swappingNearest Neighbor Interchange (NNI)

E

A

C

B

D

A

D

E

CB

DA

CB

E

Page 11: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Branch swappingSubtree Pruning and Regrafting (SPR)

D

AB

C

GF

E

D

GF

E

AB

C

G

DE

F

BA

C

a

Page 12: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Branch swappingTree Bisection and Reconnection (TBR)

D

AB

C

GF

ED

GF

E

AB

C

G

DE

F

BC

A

G

DE

F

BA

C

G

DE

F

CA

B

Page 13: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Reconnection limits in TBR

1

2 3 45

6

x zy

r

s

t u v

w

1

2 3 45

6

x zx'

u v

w1

2 4 3 5

6

1

2 3 45

6

0 01

1

2

2

Reconnection distances:

Page 14: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

(D)

1

2 3 45

6

y

r

s

v

wy'

3

1 2 54

6

01

1

2 3 45

6

1

1

1

0Reconnection distances:

In PAUP*, use “ReconLim” to set maximum reconnection distanceIn PAUP*, use “ReconLim” to set maximum reconnection distance

Reconnection limits in TBR

Page 15: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Star-decomposition search

1

2

3

4

5

1

3

2

4

5

3

5

1

2

4

•••

4

5

1

2

3

1

2

3

4

5

14

3

2

5

12

3

4

5

15

3

2

4

Step 1

Step 2 Step 3

Page 16: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Overview of maximum likelihood as used Overview of maximum likelihood as used in phylogeneticsin phylogenetics

• Overall goal: Find a tree topology (and associated parameter Overall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of obtaining the observed estimates) that maximizes the probability of obtaining the observed data, given a model of evolutiondata, given a model of evolution

Likelihood(hypothesis) Likelihood(hypothesis) Prob(dataProb(data||hypothesis)hypothesis)

Likelihood(tree,model) = k Prob(observed sequences|tree,model)Likelihood(tree,model) = k Prob(observed sequences|tree,model)

[[notnot Prob(tree Prob(tree||data,model)]data,model)]

Page 17: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Computing the likelihood of a single treeComputing the likelihood of a single tree

1 1 jj NN(1) C…GGACA…(1) C…GGACA…CC…GTTTA…C…GTTTA…C(2) C…AGACA…(2) C…AGACA…CC…CTCTA…C…CTCTA…C(3) C…GGATA…(3) C…GGATA…AA…GTTAA…C…GTTAA…C(4) C…GGATA…(4) C…GGATA…GG…CCTAG…C …CCTAG…C

(1)(1)

(2)(2)

(3)(3)

(4)(4)

CCCC AA GG

(6)(6)

(5)(5)

Page 18: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Computing the likelihood of a single treeComputing the likelihood of a single tree

ProbProb

CCCC AA GG

AA

AA

Likelihood at site Likelihood at site jj = =

+ Prob+ Prob

CCCC AA GG

AA

CC

ProbProb

CCCC AA GG

TT

TT+ … ++ … +

But use Felsenstein (1981) pruning algorithmBut use Felsenstein (1981) pruning algorithm

Page 19: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Computing the likelihood of a single treeComputing the likelihood of a single tree

L = L1L2L LN = L jj=1

N

lnL = lnL1 + lnL2 +L lnLN = lnL1

j=1

N

Note: PAUP* reports -ln Note: PAUP* reports -ln LL, so lower -ln , so lower -ln LL implies higher likelihood implies higher likelihood

Page 20: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Finding the maximum-likelihood treeFinding the maximum-likelihood tree(in principle)(in principle)

• Evaluate the likelihood of each possible Evaluate the likelihood of each possible tree for a given collection of taxa.tree for a given collection of taxa.

• Choose the tree topology which Choose the tree topology which maximizes the likelihood over all maximizes the likelihood over all possible trees.possible trees.

Page 21: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Probability calculations require…Probability calculations require…• An explicit model of substitution that specifies change An explicit model of substitution that specifies change

probabilities for a given branch lengthprobabilities for a given branch length

“Instantaneous rate matrix”“Instantaneous rate matrix”

Jukes-CantorJukes-CantorKimura 2-parameterKimura 2-parameterHasegawa-Kishino-Yano (HKY)Hasegawa-Kishino-Yano (HKY)Felsenstein 1981, 1984Felsenstein 1981, 1984General time-reversibleGeneral time-reversible

Q =

π ArAA π CrAC π GrAG π TrAT

π ArCA π CrCC π GrCG π TrCT

π ArGA π CrGC π GrGG π TrGT

π ArTA π CrTC π GrTG π TrTT

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

P(v) = eQν

• An estimate of optimal branch lengths in units of An estimate of optimal branch lengths in units of expected amount of change (expected amount of change ( = rate x time) = rate x time)

Page 22: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

For example:For example:

Q =

− α α α

α − α α

α α − α

α α α −

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

Jukes-Cantor (1969)Jukes-Cantor (1969)

Q =

− β α β

β − β α

α β − β

β α β −

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

Kimura (1980) “2-parameter”Kimura (1980) “2-parameter”

Q =

− π Cβ π Gα π Tβ

π Aβ − π Gβ π Tα

π Aα π Cβ − π Tβ

π Aβ π Cα π Gβ −

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

Hasegawa-Kishino-Yano (1985)Hasegawa-Kishino-Yano (1985)

Q =

π ArAA π CrAC π GrAG π TrAT

π ArCA π CrCC π GrCG π TrCT

π ArGA π CrGC π GrGG π TrGT

π ArTA π CrTC π GrTG π TrTT

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

General-Time ReversibleGeneral-Time Reversible

Page 23: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

E.g., transition probabilities forE.g., transition probabilities forHKY and F84:HKY and F84:

Pij t( ) =

π j +π j1

Π j

−1⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟ e

−μν +Π j −π j

Π j

⎝ ⎜ ⎜

⎠ ⎟ ⎟ e

−μνA (i= j)

π j +π j

1Π j

−1⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟ e

−μν −π j

Π j

⎝ ⎜ ⎜

⎠ ⎟ ⎟ e

−μνA (i≠ j, transition)

π j 1−e−μν( ) (i≠ j, transversion)

⎪ ⎪ ⎪ ⎪ ⎪

⎪ ⎪ ⎪ ⎪ ⎪

Page 24: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

A Family of Reversible Substitution ModelsA Family of Reversible Substitution Models

GTR

SYMTrN

F81

JC

K3ST

K2P

HKY85F84

Equal base frequencies

3 substitution types(transitions,2 transversion classes)

2 substitution types(transitions vs. transversions)

3 substitution types(transversions, 2 transition classes)

2 substitution types(transitions vs.transversions)

Single substitution type

Equal basefrequencies

Single substitution typeEqual base frequencies

(general time-reversible)

(Tamura-Nei)

(Hasegawa-Kishino-Yano)

(Felsenstein)

Jukes-Cantor

(Kimura 2-parameter)

(Kimura 3-subst. type)

(Felsenstein)

Page 25: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

The Relevance of Branch LengthsThe Relevance of Branch LengthsC C A A A A A A A A

A

C

C C A A A A A A A A

CA

Page 26: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

When does maximum likelihood work When does maximum likelihood work better than parsimony?better than parsimony?

• When you’re in the “Felsenstein Zone”When you’re in the “Felsenstein Zone”

AA CC

BB DD

(Felsenstein, 1978)(Felsenstein, 1978)

Page 27: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

In the Felsenstein ZoneIn the Felsenstein Zone

AA CC GG TTAA -- 55 66 22CC 55 -- 33 88GG 66 33 -- 11TT 22 88 11 --

Substitution rates:Substitution rates:

Base frequencies:Base frequencies: A=0.1A=0.1 C=0.2C=0.2 G=0.3G=0.3 T=0.4T=0.4

AA BB

CC DD

0.10.1

0.10.1 0.10.1

0.80.8 0.80.8

Page 28: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

In the Felsenstein ZoneIn the Felsenstein Zone

0

0.2

0.4

0.6

0.8

1

0 5000 10000

Sequence Length

parsimonyML-GTR

Pro

port

ion

corr

ect

Page 29: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem

Pattern typePattern type

11 44AA I = Uninformative (constant)I = Uninformative (constant) AA

A AA A 22 33

The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4

(zero changes required on any (zero changes required on any tree)tree)

Page 30: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem

Pattern typePattern type

11 44AA I = Uninformative (constant)I = Uninformative (constant) AAAA II = UninformativeII = Uninformative GG

A AA A 22 33

The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4

(one change required on any tree)(one change required on any tree)

Page 31: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem

Pattern typePattern type

11 44AA I = Uninformative (constant)I = Uninformative (constant) AAAA II = UninformativeII = Uninformative GGCC III = UninformativeIII = Uninformative GG

A AA A 22 33

The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4

(two changes required on any tree)(two changes required on any tree)

Page 32: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem

Pattern typePattern type

11 44AA I = Uninformative (constant)I = Uninformative (constant) AAAA II = UninformativeII = Uninformative GGCC III = UninformativeIII = Uninformative GGG G IV = IV = MisinformativeMisinformative GG

A AA A 22 33

The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4

(two changes required on true tree)(two changes required on true tree)

Page 33: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem

GG 44

AA 22

AA 33

GG 11

… … but this tree needs only one stepbut this tree needs only one step

Page 34: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Concerns about statistical properties Concerns about statistical properties and suitability of models and suitability of models

(assumptions)(assumptions)

ConsistencyConsistency

If an estimator converges to the true value of a If an estimator converges to the true value of a parameter as the amount of data increases toward parameter as the amount of data increases toward infinity, the estimator is infinity, the estimator is consistentconsistent..

Page 35: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

When do both methods fail?When do both methods fail?• When there is insufficient phylogenetic signal...When there is insufficient phylogenetic signal...

22

11 33

44

Page 36: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

When does parsimony work “better” When does parsimony work “better” than maximum likelihood?than maximum likelihood?

• When you’re in the Inverse-Felsenstein (“Farris”) zoneWhen you’re in the Inverse-Felsenstein (“Farris”) zone

AA

BB

CC

DD

(Siddall, 1998)(Siddall, 1998)

Page 37: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Siddall (1998) parameter space Siddall (1998) parameter space

a

a

b

b

b

Both methods do poorly

Parsimony has higheraccuracy than likelihood

Both methods do well

pa

pb0 0.75

0.75

Page 38: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Parsimony vs. likelihood in the Inverse-Felsenstein ZoneParsimony vs. likelihood in the Inverse-Felsenstein Zone

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 100 1,000 10,000 100,000

Sequence length

ParsimonyML/JC

15%67.5%

67.5%

(expected differences/site)

Acc

ura

cy

Page 39: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Why does parsimony do so well in theWhy does parsimony do so well in theInverse-Felsenstein Inverse-Felsenstein zone?zone?

A

A

C

C

AC

A

A

C

C

AG

A

C G

C

A

A

C

CAC

AC

True synapomorphyTrue synapomorphy

Apparent synapomorphiesApparent synapomorphiesactually due toactually due tomisinterpreted homoplasymisinterpreted homoplasy

Page 40: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Parsimony vs. likelihood in the Felsenstein ZoneParsimony vs. likelihood in the Felsenstein Zone

15%

67.5% 67.5%

Acc

ura

cy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 100 1,000 10,000 100,000

ParsimonyML/JC

(expected differences/site)

Sequence length

Page 41: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

From the Farris Zone to the Felsenstein ZoneFrom the Farris Zone to the Felsenstein Zone

CC

DD

AA

BB

CC

DD

AA

BB

CC

DD

AA

BB

BB

CC

DD

AA

BB

DD

CC

AA

External branches = 0.5 or 0.05 substitutions/site, Jukes-Cantor model of nucleotide substitutionExternal branches = 0.5 or 0.05 substitutions/site, Jukes-Cantor model of nucleotide substitution

Page 42: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

0

0.2

0.4

0.6

0.8

1.0

0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05

100 sites

1,000 sites

10,000 sites ML/JC

Length of internal branch ( d)Farris zone Felsenstein zone

0

0.2

0.4

0.6

0.8

0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05

Length of internal branch ( d)Farris zone Felsenstein zone

100 sites

1,000 sites

10,000 sites

1.0

Acc

ura

cyA

ccu

racy

ParsimonyParsimony

LikelihoodLikelihood

SimulationSimulationresults:results:

Page 43: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Maximum likelihood models are Maximum likelihood models are oversimplifications of reality. If I assume the oversimplifications of reality. If I assume the

wrong model, won’t my results be meaningless?wrong model, won’t my results be meaningless?

• Not necessarily (maximum likelihood is pretty robust)Not necessarily (maximum likelihood is pretty robust)

Page 44: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Model used for simulation...Model used for simulation...

AA CC GG TTAA -- 55 66 22CC 55 -- 33 88GG 66 33 -- 11TT 22 88 11 --

Substitution rates:Substitution rates:

Base frequencies:Base frequencies: A=0.1A=0.1 C=0.2C=0.2 G=0.3G=0.3 T=0.4T=0.4

AA BB

CC DD

0.10.1

0.10.1 0.10.1

0.80.8 0.80.8

Page 45: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Performance of ML when its model is Performance of ML when its model is violated (one example)violated (one example)

0

0.2

0.4

0.6

0.8

1

100 1000 10000

Sequence Length

parsimonyML-JCML-K2PML-HKYML-GTR

Page 46: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Among site rate heterogeneity

• Proportion of invariable sites– Some sites don’t change do to strong functional or structural constraint (Hasegawa et

al., 1985)

• Site-specific rates– Different relative rates assumed for pre-assigned subsets of sites

• Gamma-distributed rates– Rate variation assumed to follow a gamma distribution with shape parameter

Lemur AAGCTTCATAG TTGCATCATCCA …TTACATCATCCAHomo AAGCTTCACCG TTGCATCATCCA …TTACATCCTCATPan AAGCTTCACCG TTACGCCATCCA …TTACATCCTCATGoril AAGCTTCACCG TTACGCCATCCA …CCCACGGACTTAPongo AAGCTTCACCG TTACGCCATCCT …GCAACCACCCTCHylo AAGCTTTACAG TTACATTATCCG …TGCAACCGTCCTMaca AAGCTTTTCCG TTACATTATCCG …CGCAACCATCCT

equal rates?

Page 47: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Performance of ML when its model is Performance of ML when its model is violated (another example)violated (another example)

...

0

0.02

0.04

0.06

0.08

0 1 2

Rate

=50

=200

Modeling among-site rate variation with a gamma distribution...Modeling among-site rate variation with a gamma distribution...

……can also estimate a proportion of “invariable” sites (pcan also estimate a proportion of “invariable” sites (p invinv))

=2

=0.5

Fre

quen

cy

Page 48: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Performance of ML when its model is Performance of ML when its model is violated (another example)violated (another example)

Sequence Length

Proportion Correct

Tree a = 0.5, =0.5pinv a = 1.0, =0.5pinv a = 1.0, =0.2pinv

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYigGTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHKYigGTRgHKYgGTRiHKYiGTRerHKYerParsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHKYigGTRgHKTgGTRiHKYiGTRerHKYerparsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHYYigGTRgHKYgGTRiHKYiGRTerHKYerparsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

Page 49: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

“MODERATE”–Felsenstein zone

= 1.0, pinv=0.5

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony

Page 50: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

“MODERATE”–Inverse-Felsenstein zone

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony

Page 51: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Bayesian Inference in Phylogenetics

• Uses Bayes formula:

Pr(|D) = Pr(D|) Pr() Pr(D)

Pr(D|) Pr()

L() Pr()

• Calculation involves integrating over all tree topologies and model-parameter values, subject to assumed prior distribution on parameters

(( =tree topology, =tree topology, branch-lengths, and branch-lengths, and substitution-model substitution-model parameters)parameters)

Page 52: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Bayesian Inference in Phylogenetics

• To approximate this posterior density (complicated multidimensional integral) we use Markov chain Monte Carlo (MCMC)– Simulated Markov chain in which transition probabilities are

assigned such that the stationary distribution of the chain is the posterior density of interest

– E.g., Metropolis-Hastings algorithm: Accept a proposed move from one state to another state * with probability min(r,1) where

r = Pr(*|D) Pr(| *)Pr(|D) Pr(*| )

– Sample chain at regular intervals to approximate posterior distribution

• MrBayes (by John Huelsenbeck and Fredrik Ronquist) is most popular Bayesian inference program

Page 53: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

AB

C D

AB

C D

Like

lihoo

d

Iterations

A brief intro to Markov chain Monte Carlo (MCMC)

A

B

C D

...

If the chain is run “long enough”, the stationary distribution of states in the chain will represent a good approximation to the target distribution (in this case, the Bayesian posterior)

1. Initialize the chain, e.g., by picking a random state X0 (topology,branch lengths, substitution-model parameters) from the assumed prior distribution

A

B

C

D

AB|CD

A

B

C

D

AB|CD

AB

C D

BC|AD

AB

C D

BC|AD

AB

C D

BC|AD

AB

C D

BC|AD

B

CD

A

AC|BDAB|CD

A

B

C

D

(X,Y ) = min 1,Pr Y | D( )q(X |Y )

Pr X | D( )q(Y | X)

⎝ ⎜

⎠ ⎟= min 1,

π (Y )

π (X)×

Pr(D |Y )

Pr(D | X)×q(X |Y )

q(X |Y )

⎝ ⎜

⎠ ⎟

2. For each time t, sample a new candidate state Y from some proposal distribution q(.|X t) (e.g., change branch lengths or topology plus branch lengths)

Calculate acceptance probability

3. If Y is accepted, let Xt+1 = Y; otherwise let Xt+1 = Xt

“burn in”

Page 54: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Model-based distancesModel-based distances• Can also calculate pairwise distances based on these modelsCan also calculate pairwise distances based on these models• These distances estimate the number of substitutions per site These distances estimate the number of substitutions per site

that have accumulated since the two sequences shared a that have accumulated since the two sequences shared a common ancestor, allowing for superimposed substitutions common ancestor, allowing for superimposed substitutions (“multiple hits”)(“multiple hits”)

• E.g.:E.g.:– Jukes-Cantor distanceJukes-Cantor distance– Kimura 2-parameter distanceKimura 2-parameter distance– General maximum-likelihood distances available for other General maximum-likelihood distances available for other

modelsmodels

Page 55: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

1 3

42

a d

e

c

b

−d12 −

d13 d23 −

d14 d24 d34 −

1

2

3

4

1 2 3 4

p12 = a+bp13 = a+c+dp14 = a+c+ep23 = b+c+dp24 = b+c+ep34 = d+e

pij = dij for all i and j if the treetopology is correct and distancesare additive

Distance-based optimality criteria“Additive trees”

Page 56: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Distances in general will not be additive, sochoose optimal tree according to one of the

following criteria (objective functions):

"Goodness - of - fit" : minimize wij pij−diji < j∑

r

Typicall , y r = 2 (least-squares) and wij = 1/dij2 ("Fitch-

Margoliash" method)

"Minimum- "evolution : minimize vkk=1

#branches

∑ or vkk=1

#branches

Page 57: Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Distance-based optimality criteriaMinimum evolution and least-squares

Pongo

Lemur catta

Pan

Homo sapiens

Gorilla

0.044

0.085

0.286

0.015

0.0500.045

0.050

0.39646 0.39021 0.0000390.39838 0.39602 0.0000060.09506 0.09507 0.0000000.37222 0.38084 0.0000740.11172 0.11011 0.0000030.11431 0.11592 0.0000030.37096 0.37096 0.0000000.18107 0.18894 0.0000620.19399 0.19475 0.0000010.18820 0.17958 0.000074

0.000261

pijdij SS

Least-Squares

0.286110.044360.015110.044630.050440.050380.084850.57588

Minumumevolution(ME)

LS branch lengths