Parallel Bayesian Phylogenetic Inference Xizhou Feng Directed by Dr. Duncan Buell Department of...
Preview:
Citation preview
- Slide 1
- Parallel Bayesian Phylogenetic Inference Xizhou Feng Directed
by Dr. Duncan Buell Department of Computer Science and Engineering
University of South Carolina, Columbia 2002-10-11
- Slide 2
- Topics Background Bayesian phylogenetic inference and MCMC
Serial implementation Parallel implementation Numerical result
Future research
- Slide 3
- Background Darwin: Species are related through a history of
common descent, and The history can be organized as a tree
structure (phylogeny). Modern species are put on the leaf nodes
Ancient species are put on the internal nodes The time of the
divergence is described by the length of the branches. A clade is a
group of organisms whose members share homologous features derived
from a common ancestor.
- Slide 4
- Phylogenetic tree Clade BranchBranch length Leaf node Current
species Internal node Ancestral species
- Slide 5
- Applications Phylogenetic tree is fundamental to understand
evolution and diversity Principle to organize biological data
Central to organism comparison Practical examples Resolve quarrel
over bacteria-to-human gene transfers (Nature 2001) Tracing route
of infectious disease transmission Identify new pathogens
Phylogenetic distribution of biochemical pathways
- Slide 6
- Use DNA data for phylogenetic inference
- Slide 7
- Objectives of phylogenetic inference input output Major
objectives include: Estimate the tree topology Estimate the branch
Length, and Describe the credibility of the result
- Slide 8
- Phylogenetic inference methods Algorithmic methods Defining a
sequence of specific steps that lead to the determination of a tree
e.g. UPGMA (unweighted pair group method using arithmetic average)
and Neighbor-Joining Optimality criterion-based methods 1) Define a
criterion; 2)Search the tree with best values Maximum parsimony
(minimize the total tree length) Maximum likelihood (maximize the
likelihood) Maximum posterior probability (the tree with the
highest probability to be the true tree)
- Slide 9
- Common Used phylogeny methods Data set Algorithm
Algorithmicmethod Optimization method Distance matrix Character
data UPGMA Neighbor-join Fitch-Margolish StatisticalSupported
Maximum Parsimony Maximum Likelihood Bayesian Methods Search
Strategy Greedysearch Divide &Conquer Stochasticsearch DCM,
HGT, Quartet GA, SA MCMC Exhaustive Branch & Bound Exact search
Stepwise addition Global arrangement Star decomposition
- Slide 10
- Aspects of phylogenetic methods Accuracy Is the constructed
tree a true tree? If not, the percentage of the wrong edges?
Complexity Neighbor-Join O(n 3 ) Maximum Parsimony (provably NP
hard) Maximum Likelihood (conjectured NP hard) Scalability Good for
small tree, how about large tree Robustness If the model or
assumption or the data is not exact correct, how about the result?
Convergence rate How long a sequence is needed to recover the true
tree? Statistical support With what probability is the computed
tree the true tree?
- Slide 11
- The computational challenge Compute the tree of life Source:
http://www.npaci.edu/envision/v16.3/hillis.html >1.7 million
known species Number of trees increase exponentially as new species
was added The complex of evolution Data Collection &
Computational system
- Slide 12
- Topics Background Bayesian phylogenetic inference and MCMC
Serial implementation Parallel implementation Numerical result
Future research
- Slide 13
- Bayesian Inference-1 Both observed data and parameters of
models are random variables Setting up the joint distribution When
data D is known, Bayes theory gives: Posterior probability
likelihoodPrior probability Unconditional Probability of data
Topology Branch length Parameter of models
- Slide 14
- Bayesian Inference-2 P(T|D) can be interpreted as the
probability of the tree is correct We need to do at least two
things: Approximate the posterior probability distributionposterior
probability distribution Evaluate the integral for P(T|D) These can
be done via Markov Chain Monte Carlo MethodMarkov Chain Monte Carlo
Having the posterior probability distribution, we can compute the
marginal probability of T as:
- Slide 15
- Markov chain Monte Carlo (MCMC) The basic idea of MCMC is: To
construct a Markov chain such that: Have the parameters as the
state space, and the stationary distribution is the posterior
probability distribution of the parameters Simulate the chain Treat
the realization as a sample from the posterior probability
distribution MCMC = sampling + continue search
- Slide 16
- Markov chain A Markov chain is a sequence of random variables
{X 0, X 1, X 2, } whose transition kernel T(X t, X t+1 ) is
determined only by the value of X t (t>0). Stationary
distribution: (x)= x ( (x)T(x,x)) is invariant Ergodic property: p
n (x) converges to (x) as n A homogeneous Markov chain is ergodic
if min(T(x,x)/ (x)>0
- Slide 17
- Metropolis-Hasting algorithm-1 Cornerstone of all MCMC methods,
Metropolis(1953) Hasting proposed a generalized version in (1970)
The key point is to how to define the accepted probability:
Metropolis: Hasting: Proposal probability Can be any form Such
that
- Slide 18
- Metropolis-Hasting algorithm-2 1.Initialize x 0, set t=0
2.Repeat : 1)Sample x from T(x t, x) 2)Draw U~uniform[0,1]
3)Update
- Slide 19
- Problems of MH Algorithm & Improvement Problems: Mixing
rate is slow when: Small step->low movement Larger step->low
acceptance Stopped at local optima Dimension of state space may
vary Improvement: Metropolis-coupled MCMC Multipoint MCMC
Population-based MCMC Time-reversible jump MCMC
- Slide 20
- Metropolis-coupled MCMC (Geyer 1991)MCMC Run several MCMC
chains with different distribution i (x) (i=1..m) in parallel 1 (x)
is used to sampling i (x) (i=2..m) are used to improve mixing For
example: i (x) = (x) 1/(1+ (I-1)) After each iteration, attempt to
swap the states of two chains using a Metropolis-Hasting step with
acceptance probability of
- Slide 21
- Illustration of Metropolis-coupled MCMCMCMC 1 (x)T 1 =0 2 (x) T
2 =2 3 (x) T 3 =4 4 (x) T 4 =8 Metropolis-coupled MCMC is also
called Parallel tempering
- Slide 22
- Multiple Try Metropolis (Liu et al 2000) xtxt y1y1 y2y2 y3y3
y4y4 x1*x1* x2*x2* x3*x3* x4*x4* yx t+1 Sample from T( x t,.)
Choose y= y i Sample from T(y,.) Accept y or keep x t using a M-H
step
- Slide 23
- Population-based MCMC Metropolis-coupled MCMC uses a minimal
interaction between multiple chains, why not more active
interaction Evolutionary Monte Carlo (Liang et al 2000) Combine
Genetic Algorithm with MCMC Used to Simulate protein folding
Conjugate Gradient Monte Carlo (Liu et al 2000) Use local
optimization for adaptation An improvement of ADS (Adaptive
Direction Sampling)
- Slide 24
- Topics Background Bayesian phylogenetic inference and MCMC
Serial implementation Choose the evolutionary model Compute the
likelihood Design proposal mechanisms Parallel implementation
Numerical result Future research
- Slide 25
- DNA substitution rate matrix AG CT Consider inference of
un-rooted tree and the computational complications, some simplified
models are used (see next slide) transitiontransversion Purine
Pyrimidine
- Slide 26
- GTR-family of substitution models GTR: general time- reversible
model, corresponding to a symmetric rate matrix. GTR TN93 HKY85 F84
F81 JC69 K2P K3ST SYM Single substitution typeEqual base
frequencies Single substitution type Two substitution types
(transition v.s. tranversion) Three substitution type (1
transversion, 2 transition) Equal base frequencies Two substitution
types (transition v.s. tranversion) Equal base frequencies Three
substitution type (1 transversion, 2 transition)
- Slide 27
- More complex models Substitute rates vary across sites
Invariable sites models + gamma distribution Correlation in the
rates at adjacent sites Codon models 61X61 instantaneous rate
matrix Secondary structure models
- Slide 28
- Compute conditional probability of branch Given substitution
rate matrix, how to compute p(b|a,t)-the probability of a is
substituted by b after time t a b t Eigenvalue of Q
- Slide 29
- Likelihood of a phylogeny tree for one site x1 x2 x3 x4 x5 t1
t4 t3 t2 When x 4 x 5 are known, When x4 x5 are unknown,
- Slide 30
- Likelihood calculation ( Felsenstein 1981) Given a rooted tree
with n leaf nodes (species), and each leaf node is represented by a
sequence x i with length N, the likelihood of a rooted tree is
represented as:
- Slide 31
- Likelihood calculation-2 Felsensteins algorithms for likelihood
calculation(1981) Initiation: Set k=2n-1 Recursion: Compute for all
a as follows If k is a leaf node: Set if ; Set if. If k is not a
leaf node: compute for all a its children nodes i, j. And set
Termination: Likelihood at site u is Note: algorithm modified from
Durbin et al (1998)
- Slide 32
- Likelihood calculation-3 The likelihood calculation requires
filling an N X M X S X R table N: number of sequences M: number of
sites S: number of state of charactersR: number of rate categories
Taxa-1 Taxa-2 Taxa-3 Taxa-n Site 1Site 2Site m-1Site m-2 1.0 0.0 A
C G T 1.0 0.0 1.0 0.0 rate 1 rate 2rate r
- Slide 33
- Local update Likelihood If we just change the topology and
branch length of tree locally, we only need refresh the table at
those affected nodes. In the following example, only the nodes with
red color need to change their conditional likelihood. Original
tree Proposed tree
- Slide 34
- Proposal mechanism for trees Stochastic Nearest-neighbor
interchange (NNI) Larget et al (1999) Huelsenbeck (2000) d c c v u
a m y x (1)Choose a backbone (2) Change m and y randomly c d m* y*
x* c u a v
- Slide 35
- Proposal mechanisms for parameters Independent parameter e.g.
transition/tranversion ratio k MinMaxk k+ k- k* A set of parameters
constrained to sum to a constant e.g. base frequency distribution
Draw a sample from the Dirichlet distribution Larget et al.
(1999)
- Slide 36
- Bayesian phylogenetic inference Phylogenetic tree DNA Data
Evolutionary model Likelihood Prior probability Posterior prob.
MCMC Starting treeProposal A sequence of Samples inference
Approximate the distribution
- Slide 37
- Topics Background Bayesian phylogenetic inference and MCMC
Serial implementation Parallel implementation Challenges of serial
computation Difficulty: MCMC is a serial algorithm Multiple chains
need to be synchronized Choose appropriate grid topology
Synchronize using random number Numerical result Future
research
- Slide 38
- Computational challenge Computing global likelihood needs
O(NMRS 2 ) multiplications Local updating topology & branch
length needs O(MRS 2 log(N)) Updating model parameter needs O(NMRS
2 ) local update needs all required data in memory Given N=1000
species, each sequence has M=5000 sites, rate category R=5, and DNA
nucleotide model S=4 Run 5 chains each with length of 100 million
generations Needs ~400 days (assume 1% global updates, 99% local
update) And O(NMRSLX2X2X8)~32Gigabyte memory =>So until more
advanced algorithms are developed, parallel computation is the
direct solution. Use 32 processor with 1 gigabyte memory, we can
compute the problem in ~2 weeks
- Slide 39
- Characteristic of good parallel algorithms Balancing workload
Concurrency identify, manage, and granularity Reducing
communication Communication-to-computation ratio Frequency, volume,
balance Reducing extra work Computing assignment Redundant
work
- Slide 40
- Single-chain MCMC algorithm Generate initial state S 0, S (t) =
S (0) =, t=0 Propose new state S Evaluate S Compute R and U U <
R ? S (t+1) =S No Yes S (t+1) =S (t) t=t+1 t>max generation
NoYes End
- Slide 41
- Multiple-chain MCMC algorithm Chain #1Chain #2Chain #3Chain #4
Generate S 1 (0) t=0 Generate S 2 (0) t=0 Generate S 3 (0) t=0
Generate S 4 (0) t=0 Propose & Update S 1 (t) Propose &
Update S 2 (t) Propose & Update S 3 (t) Propose & Update S
4 (t) choose two chains to swap Compute R and U U