Upload
harvey-preston
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Introduction to model based methods
Niklas WahlbergUniversity of Turku
Introduction to model based methods
Jarno TuimalaFree researcher / Finnish Tax Administration
14.4. Tue Introduction to models (Jarno) 16.4. Thu Distance-based methods (Jarno) 17.4. Fri ML analyses (Jarno)
20.4. Mon Assessing hypotheses (Jarno) 21.4. Tue Problems with molecular data (Jarno) 23.4. Thu Problems with molecular data (Jarno) Phylogenomics 24.4. Fri Search algorithms, visualization, and other computational aspects (Jarno)
Schedule
J
With >100 billion bases in GenBank, we are beginning to understand how DNA sequences evolve
Mitochondrial and nuclear genes differ in mutation dynamics
Different genes have their own mutation dynamics
DNA evolves through mutation
C A
C G T A1 2 3
1
Seq 1
Seq 2
Number of changes
Hidden evolution in DNA sequences
Ancest GGCGCGSeq 1 AGCGAGSeq 2 GCGGAC
Evolutionary model
Distance
Time
Correction for the
difference between the true and tha
observed distance.
J
Models incorporate information about the rates at which each nucleotide is replaced by each alternative nucleotide ◦ For DNA this can be expressed as a 4 x 4 rate
matrix (known as the Q matrix) Other model parameters may include:
◦ Site by site rate variation - often modelled as a statistical distribution - for example a gamma distribution
Modeling evolution
J
The mean instantaneous substitution rate (=the general mutation rate + rate of fixation in population)
The relative rates of substitution between each base pair
The average frequencies of each base in the dataset
Branch lengths Topology!
Parameters we are interested in
Purines Pyrimidines
πA πG
πTπC
A general model of sequence evolution
a
b
c d
e
fg
hi j
k
l
πA πG
πTπC
A general model of sequence evolution
a
b
c d
e
fg
hi j
k
l
transition
transition
J
transversions
If all substituitons were equally likely, the expected ratio (R) of transitions (P) to transversions (Q) would be about 0.5:◦ Re = P / Q ~ 0.5
In reality, this is not the case, and the ratio is usually higher.
Some models of sequence evolution take this ratio into account, some don't.
Transition / trasversion rate
J
A -μ(aπC+bπG+cπT) μaπC μbπG μcπT
C μgπA -μ(gπA+dπG+eπT) μdπG μeπT
G μhπA μjπC -μ(hπA+jπC+fπT) μfπT
T μiπA μkπC μlπG -μ(iπA+kπC+lπG)
A general model of molecular evolution
Q =
μ = mean instantaneous substitution rate
πA = frequency of A
a, b, c,... l = relative rate of substitution } product is the rate parameter
A C G T
Rate of change from base i to base j is independent of the base that occupied a site prior to i (Markov property)
Substitution rate does not change over time (homogeneity)
Relative frequencies of A, G, C, and T are at equilibrium (stationarity)
Time-homogenous time-continuous stationary Markov models
The Jukes and Cantor model is the simplest model
The JC model is a one parameter model1) it assumes that all bases are equally frequent (p=0.25)2) unless modified it assumes all sites can change and that they do so at the same rate
A C G TACG
T
a a aa a
a
-3a-3a
-3a
-3aa a a
a a
a
• = the rate of substitution ( changes from A to G every t)• The rate of substitution for each nucleotide is 3• In t steps there will be 3t changes
A G
TC
Jukes-Cantor model
A G
TC
Kimura model
= transitions = transversions
The Kimura model has 2 parameters
The K2P model is more realistic, but still1) it assumes that all bases are equally frequent
(p=0.25)2) unless modified it assumes all sites can change and that they do so at the same rate
A C G TACG
T
a a
- -
-
- a a
The Hasegawa-Kishino-Yano model
The HKY model takes into account variable base frequencies, but still1) unless modified it assumes all sites can change and that they do so at the same rate
A C G TACG
T
C Ga TG Ta
T
- -
-
-A Ca GAa CA
πA πG
πTπC
The GTR model
a
b
c d
e
f
-μ(aπC+bπG+cπT) μaπC μbπG μcπT
μaπA -μ(aπA+dπG+eπT) μdπG μeπT
μbπA μdπC -μ(bπA+dπC+fπT) μfπT
μcπA μeπC μfπG -μ(cπA+eπC+fπG)
The most general time-reversible model
Q =
μ = mean instantaneous substitution rate
πA = frequency of A
a, b, c,... f = relative rate of substitution } product is the rate parameter
Almost all models used are special cases of one model:◦ The general time reversible model
The next three slides are from: https://code.google.com/p/jmodeltest2/wiki/TheoreticalBackground
The most commonly used models
ACAGGTGAGGCTCAGCCAATTTGAGCTTTGTCGATAGGT
J
J
Hypotheses tested are: F = base frequencies; S = substitution type; I = proportion of invariable sites; G = gamma rates. J
JC
Variable base frequencies
3 substitution types
2 substitution types
Single substitution type
3 substitution types
2 substitution types
Variable base frequencies
Equal base frequencies
F81
HKY85
F84
TrN
GTR
K2P
K3ST
SYM
6 substitution types
6 substitution types
Model parameters can be: ◦ estimated from the data (using a likelihood
function)◦ can be pre-set based upon assumptions about the
data (for example that for all sequences all sites change at the same rate and all substitutions are equally likely - e.g. the Jukes and Cantor Model)
◦ wherever possible avoid assumptions which are violated by the data because they can lead to incorrect trees
Models
The most common additional parameters are:◦ A correction for the proportion of sites which are
invariable (parameter I )◦ A correction for variable site rates at those sites
which can change (parameter gamma, G ) All models can be supplemented with these
parameters (e.g. GTR+I+G, HKY+I+G )
Models can be made more parameter rich to increase their realism
Invariable sites
A gamma distribution can be used to model site rate heterogeneity
α = shape parameter
Computational difficulties in using continuous distribution
Most programs use discrete categories
Gamma distribution computationally costly
Rate
Frequency
The parameters I and G covary! (I + G ) can be estimated, but the values of I
and G are not easily teased apart Parameter G takes I into account, I not
needed
Usually though, a certain amount of sites (estimated from data) are assumed invariant, and rest (the varying sites) are allowed to follow the rates drawn from the discrete gamma distribution.
Difficulties in estimating parameters
J
But the more parameters you estimate from the data the more time needed for an analysis and the more sampling error accumulates◦ One might have a realistic model but large sampling
errors◦ Realism comes at a cost in time and precision!◦ Fewer parameters may give an inaccurate estimate,
but more parameters decrease the precision of the estimate
◦ In general use the simplest model which fits the data
Models can be made more parameter rich to increase their realism
When models are nested◦ Likelihood ratio test (LRT)◦ Test statistic: -2*ln(likelihood for model 1 / likelihood for model 2) Compared to Chi square distribution df1-df2 degrees of
freedom When models are not nested
◦ Akaike Information Criterion (AIC) 2k-2ln(likelihood), where k is the number of parameteres
estimated in the models The best model has the lowest AIC
◦ Bayesian Information Criterion (BIC) Similar to AIC
Choosing your model
JC
Variable base frequencies
3 substitution types
2 substitution types
Single substitution type
3 substitution types
2 substitution types
Variable base frequencies
Equal base frequencies
F81
HKY85
F84
TrN
GTR
K2P
K3ST
SYM
6 substitution types
6 substitution types
JC
Variable base frequencies
3 substitution types
2 substitution types
Single substitution type
3 substitution types
2 substitution types
Variable base frequencies
Equal base frequencies
F81
HKY85
F84
TrN
GTR
K2P
K3ST
SYM
6 substitution types
6 substitution types
JC
Variable base frequencies
3 substitution types
2 substitution types
Single substitution type
3 substitution types
2 substitution types
Variable base frequencies
Equal base frequencies
F81
HKY85
F84
TrN
GTR
K2P
K3ST
SYM
6 substitution types
6 substitution types
JC
Variable base frequencies
3 substitution types
2 substitution types
Single substitution type
3 substitution types
2 substitution types
Variable base frequencies
Equal base frequencies
F81
HKY85
F84
TrN
GTR
K2P
K3ST
SYM
6 substitution types
6 substitution types
Yang (1995) has shown that parameter estimates are reasonably stable across tree topologies provided trees are not “too wrong”.
Thus one can obtain a tree using a quick method, such as neighbor-joining, and then estimate parameters on that tree.
These parameters can then be used to calculate the likelihood of the tree.
When the likelihood of the tree is calculated under all the to-be-compared models, the model giving the lowest likelihood or AIC value can be selected.
The final tree is then estimated using this model.
Estimation of likelihood of substitution model parameters
For both tests, one needs to compute the likelihood of the trees under the models.
For now, assume we know the likelihood of the models we want to compare.
Need to know the likelihood of a model
LR = 2*(lnL1-lnL0)
LRT statistic approximately follows a chi-square distribution
Degrees of freedom equal to the number of extra parameters in the more complex model
Likelihood ratio test (LRT)
Alternative hypothesis
More parameter-rich
Null hypothesis
Less parameter-rich
HKY85 -lnL = 1787.08 GTR -lnL = 1784.82
Then, LR = 2 (1784.82 - 1787.08) = 4.53 degrees of freedom = 4 (GTR adds 4
additional parameters to HKY85)critical value (P = 0.05) = 9.49
GTR does not fit significantly better!
Example
A measure of the goodness of fit of a model◦ information lost when model M is used to
approximate the process of molecular evolution◦ AIC is an estimate of the expected relative distance
between a fitted model, M, and the unknown true mechanism that generated the data
AIC(M) = - 2*Log(Likelihood(M)) + 2*K(M)◦ K(M) is number of estimable parameters of model M
Given a dataset, models can be ranked according to their AIC
The model with the lowest AIC is selected
Akaike Information Criterion
BIC takes into account also sample size n BIC(M) = - 2xLog(Likelihood(M)) +
K(M)xLog(n)◦ K(M) is number of estimable parameters of model
M and n is the number of characters
Bayesian Information Criterion
Output of a model testing program
Kelchner & Thomas 2007, TREE 22:87-94
Latest of the latest! Model jumping
◦ Allow the data to determine which model is the most optimal during the analysis
Only available in MrBayes 3.2
JC K2P GTR
A priori separation of characters into different partitions
Each partition analyzed with a different model In addition to allowing heterogeneity across
data subsets in overall rate and in substitution model parameters, several programs also allow the user to unlink topology and branch lengths
“Different data subsets can thus have independent branch lengths or even different topologies.” (Ronquist and Huelsenbeck, 2003:1573)
Partitioned models
21 amino acids Models are based largely
on empirical aa replacement matrices
Examples: JTT, WAG, MtREV, Blosum62
Protein models
Parameters include topology and branch lengths!
How to estimate values for those parameters?◦ Distance methods◦ Maximum likelihood methods◦ Bayesian methods
Models have parameters
Objective function (score) that quantifies how well the data fit a tree
Used to evaluate and rank alternative trees Two logical steps for phylogenetic methods
that rely on optimality criteria◦ Definition of optimality criterion◦ Maximization (or minimization) of criterion for
alternative trees for their evaluation and ranking
Optimality Criteria