Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Introduction to model based methods

Niklas WahlbergUniversity of Turku

Introduction to model based methods

Jarno TuimalaFree researcher / Finnish Tax Administration

14.4. Tue Introduction to models (Jarno) 16.4. Thu Distance-based methods (Jarno) 17.4. Fri ML analyses (Jarno)

20.4. Mon Assessing hypotheses (Jarno) 21.4. Tue Problems with molecular data (Jarno) 23.4. Thu Problems with molecular data (Jarno) Phylogenomics 24.4. Fri Search algorithms, visualization, and other computational aspects (Jarno)

Schedule

J

With >100 billion bases in GenBank, we are beginning to understand how DNA sequences evolve

Mitochondrial and nuclear genes differ in mutation dynamics

Different genes have their own mutation dynamics

DNA evolves through mutation

C A

C G T A1 2 3

1

Seq 1

Seq 2

Number of changes

Hidden evolution in DNA sequences

Ancest GGCGCGSeq 1 AGCGAGSeq 2 GCGGAC

Evolutionary model

Distance

Time

Correction for the

difference between the true and tha

observed distance.

J

Models incorporate information about the rates at which each nucleotide is replaced by each alternative nucleotide ◦ For DNA this can be expressed as a 4 x 4 rate

matrix (known as the Q matrix) Other model parameters may include:

◦ Site by site rate variation - often modelled as a statistical distribution - for example a gamma distribution

Modeling evolution

J

The mean instantaneous substitution rate (=the general mutation rate + rate of fixation in population)

The relative rates of substitution between each base pair

The average frequencies of each base in the dataset

Branch lengths Topology!

Parameters we are interested in

Purines Pyrimidines

πA πG

πTπC

A general model of sequence evolution

a

b

c d

e

fg

hi j

k

l

πA πG

πTπC

A general model of sequence evolution

a

b

c d

e

fg

hi j

k

l

transition

transition

J

transversions

If all substituitons were equally likely, the expected ratio (R) of transitions (P) to transversions (Q) would be about 0.5:◦ Re = P / Q ~ 0.5

In reality, this is not the case, and the ratio is usually higher.

Some models of sequence evolution take this ratio into account, some don't.

Transition / trasversion rate

J

A -μ(aπC+bπG+cπT) μaπC μbπG μcπT

C μgπA -μ(gπA+dπG+eπT) μdπG μeπT

G μhπA μjπC -μ(hπA+jπC+fπT) μfπT

T μiπA μkπC μlπG -μ(iπA+kπC+lπG)

A general model of molecular evolution

Q =

μ = mean instantaneous substitution rate

πA = frequency of A

a, b, c,... l = relative rate of substitution } product is the rate parameter

A C G T

Rate of change from base i to base j is independent of the base that occupied a site prior to i (Markov property)

Substitution rate does not change over time (homogeneity)

Relative frequencies of A, G, C, and T are at equilibrium (stationarity)

Time-homogenous time-continuous stationary Markov models

The Jukes and Cantor model is the simplest model

The JC model is a one parameter model1) it assumes that all bases are equally frequent (p=0.25)2) unless modified it assumes all sites can change and that they do so at the same rate

A C G TACG

T

a a aa a

a

-3a-3a

-3a

-3aa a a

a a

a

• = the rate of substitution ( changes from A to G every t)• The rate of substitution for each nucleotide is 3• In t steps there will be 3t changes

A G

TC

Jukes-Cantor model

A G

TC

Kimura model

= transitions = transversions

The Kimura model has 2 parameters

The K2P model is more realistic, but still1) it assumes that all bases are equally frequent

(p=0.25)2) unless modified it assumes all sites can change and that they do so at the same rate

A C G TACG

T

a a

- -

-

- a a

The Hasegawa-Kishino-Yano model

The HKY model takes into account variable base frequencies, but still1) unless modified it assumes all sites can change and that they do so at the same rate

A C G TACG

T

C Ga TG Ta

T

- -

-

-A Ca GAa CA

πA πG

πTπC

The GTR model

a

b

c d

e

f

-μ(aπC+bπG+cπT) μaπC μbπG μcπT

μaπA -μ(aπA+dπG+eπT) μdπG μeπT

μbπA μdπC -μ(bπA+dπC+fπT) μfπT

μcπA μeπC μfπG -μ(cπA+eπC+fπG)

The most general time-reversible model

Q =

μ = mean instantaneous substitution rate

πA = frequency of A

a, b, c,... f = relative rate of substitution } product is the rate parameter

Almost all models used are special cases of one model:◦ The general time reversible model

The next three slides are from: https://code.google.com/p/jmodeltest2/wiki/TheoreticalBackground

The most commonly used models

ACAGGTGAGGCTCAGCCAATTTGAGCTTTGTCGATAGGT

J

J

Hypotheses tested are: F = base frequencies; S = substitution type; I = proportion of invariable sites; G = gamma rates. J

JC

Variable base frequencies

3 substitution types


Single substitution type




Equal base frequencies

F81

HKY85

F84

TrN

GTR

K2P

K3ST

SYM



Model parameters can be: ◦ estimated from the data (using a likelihood

function)◦ can be pre-set based upon assumptions about the

data (for example that for all sequences all sites change at the same rate and all substitutions are equally likely - e.g. the Jukes and Cantor Model)

◦ wherever possible avoid assumptions which are violated by the data because they can lead to incorrect trees

Models

The most common additional parameters are:◦ A correction for the proportion of sites which are

invariable (parameter I )◦ A correction for variable site rates at those sites

which can change (parameter gamma, G ) All models can be supplemented with these

parameters (e.g. GTR+I+G, HKY+I+G )

Models can be made more parameter rich to increase their realism

Invariable sites

A gamma distribution can be used to model site rate heterogeneity

α = shape parameter

Computational difficulties in using continuous distribution

Most programs use discrete categories

Gamma distribution computationally costly

Rate

Frequency

The parameters I and G covary! (I + G ) can be estimated, but the values of I

and G are not easily teased apart Parameter G takes I into account, I not

needed

Usually though, a certain amount of sites (estimated from data) are assumed invariant, and rest (the varying sites) are allowed to follow the rates drawn from the discrete gamma distribution.

Difficulties in estimating parameters

J

But the more parameters you estimate from the data the more time needed for an analysis and the more sampling error accumulates◦ One might have a realistic model but large sampling

errors◦ Realism comes at a cost in time and precision!◦ Fewer parameters may give an inaccurate estimate,

but more parameters decrease the precision of the estimate

◦ In general use the simplest model which fits the data

Models can be made more parameter rich to increase their realism

When models are nested◦ Likelihood ratio test (LRT)◦ Test statistic: -2*ln(likelihood for model 1 / likelihood for model 2) Compared to Chi square distribution df1-df2 degrees of

freedom When models are not nested

◦ Akaike Information Criterion (AIC) 2k-2ln(likelihood), where k is the number of parameteres

estimated in the models The best model has the lowest AIC

◦ Bayesian Information Criterion (BIC) Similar to AIC

Choosing your model

JC









F81

HKY85

F84

TrN

GTR

K2P

K3ST

SYM



JC









F81

HKY85

F84

TrN

GTR

K2P

K3ST

SYM



JC









F81

HKY85

F84

TrN

GTR

K2P

K3ST

SYM



JC









F81

HKY85

F84

TrN

GTR

K2P

K3ST

SYM



Yang (1995) has shown that parameter estimates are reasonably stable across tree topologies provided trees are not “too wrong”.

Thus one can obtain a tree using a quick method, such as neighbor-joining, and then estimate parameters on that tree.

These parameters can then be used to calculate the likelihood of the tree.

When the likelihood of the tree is calculated under all the to-be-compared models, the model giving the lowest likelihood or AIC value can be selected.

The final tree is then estimated using this model.

Estimation of likelihood of substitution model parameters

For both tests, one needs to compute the likelihood of the trees under the models.

For now, assume we know the likelihood of the models we want to compare.

Need to know the likelihood of a model

LR = 2*(lnL1-lnL0)

LRT statistic approximately follows a chi-square distribution

Degrees of freedom equal to the number of extra parameters in the more complex model

Likelihood ratio test (LRT)

Alternative hypothesis

More parameter-rich

Null hypothesis

Less parameter-rich

HKY85 -lnL = 1787.08 GTR -lnL = 1784.82

Then, LR = 2 (1784.82 - 1787.08) = 4.53 degrees of freedom = 4 (GTR adds 4

additional parameters to HKY85)critical value (P = 0.05) = 9.49

GTR does not fit significantly better!

Example

A measure of the goodness of fit of a model◦ information lost when model M is used to

approximate the process of molecular evolution◦ AIC is an estimate of the expected relative distance

between a fitted model, M, and the unknown true mechanism that generated the data

AIC(M) = - 2*Log(Likelihood(M)) + 2*K(M)◦ K(M) is number of estimable parameters of model M

Given a dataset, models can be ranked according to their AIC

The model with the lowest AIC is selected

Akaike Information Criterion

BIC takes into account also sample size n BIC(M) = - 2xLog(Likelihood(M)) +

K(M)xLog(n)◦ K(M) is number of estimable parameters of model

M and n is the number of characters

Bayesian Information Criterion

Output of a model testing program

Kelchner & Thomas 2007, TREE 22:87-94

Latest of the latest! Model jumping

◦ Allow the data to determine which model is the most optimal during the analysis

Only available in MrBayes 3.2

JC K2P GTR

A priori separation of characters into different partitions

Each partition analyzed with a different model In addition to allowing heterogeneity across

data subsets in overall rate and in substitution model parameters, several programs also allow the user to unlink topology and branch lengths

“Different data subsets can thus have independent branch lengths or even different topologies.” (Ronquist and Huelsenbeck, 2003:1573)

Partitioned models

21 amino acids Models are based largely

on empirical aa replacement matrices

Examples: JTT, WAG, MtREV, Blosum62

Protein models

Parameters include topology and branch lengths!

How to estimate values for those parameters?◦ Distance methods◦ Maximum likelihood methods◦ Bayesian methods

Models have parameters

Objective function (score) that quantifies how well the data fit a tree

Used to evaluate and rank alternative trees Two logical steps for phylogenetic methods

that rely on optimality criteria◦ Definition of optimality criterion◦ Maximization (or minimization) of criterion for

alternative trees for their evaluation and ranking

Optimality Criteria

Documents

Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration