Upload
roderic-page
View
1.195
Download
1
Tags:
Embed Size (px)
Citation preview
Crunching Huge Phylogenies:A Rapid Bootstrap Algorithm and Massive Parallelism on the IBM
BlueGene
Alexandros Stamatakis
Swiss Federal Institute of Technology Lausanne (EPFL)School of Computer & Communication Sciences
Laboratory for Computational Biology and BioinformaticsLausanne, Switzerland
&Swiss Institute of Bioinformatics
[email protected]/~stamatak
Alexandros Stamatakis, October 2007
The Missing Part
Data Assembly Tree AnalysisInference ?
Alexandros Stamatakis, October 2007
The Missing Part
Data Assembly Tree Analysis
Alexandros Stamatakis, October 2007
IBM BlueGene/Lsupercomputer
Alexandros Stamatakis, October 2007
Rapid BootstrappingBootstopping Criterion
Alexandros Stamatakis, October 2007
The Big Hardware Problem
1980 2007
CPU Speed 40% p.a.
Memory Speed 9% p.a.
Alexandros Stamatakis, October 2007
... and why this concerns Bioinformatics
1980 2007
CPU Speed 40% p.a.
Memory Speed 9% p.a.
Sequence Data
Alexandros Stamatakis, October 2007
... and why this concerns Bioinformatics
1980 2007
CPU Speed 40% p.a.
Memory Speed 9% p.a.
Sequence Data
Application of HPC techniques will become much more important
Alexandros Stamatakis, October 2007
Cache Hierarchy
Alexandros Stamatakis, October 2007
Outline● Introduction
● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services
● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion
● Related Projects● Outlook
Alexandros Stamatakis, October 2007
Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic
inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &
simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)
Alexandros Stamatakis, October 2007
Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic
inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &
simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)
ML & Bayesian: explicit model choice
Alexandros Stamatakis, October 2007
Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic
inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &
simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)
Complex Methods & Models required to reconstruct large & complicated trees !
Focus of this talk is on Maximum Likelihood!
Alexandros Stamatakis, October 2007
Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic
inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &
simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)
The real reason for working on ML: ......
Alexandros Stamatakis, October 2007
Challenges for Phyloinformatics
Holy grail: “Tree of Life” What is a good alignment in a
phylogenetic context? Simultaneous alignment and tree building Improve/extend models ... but thereby size
of computable trees decreases! More HPC awareness Exploit multi-core architectures Amount of available data grows at a
higher rate than algorithms are getting faster
Alexandros Stamatakis, October 2007
The algorithmic problem
Alexandros Stamatakis, October 2007
The number of trees
Alexandros Stamatakis, October 2007
The number of trees
Alexandros Stamatakis, October 2007
The number of trees
Alexandros Stamatakis, October 2007
The number of trees explodes!
BANG !
Alexandros Stamatakis, October 2007
Outline● Introduction
● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services
● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion
● Related Projects● Outlook
Alexandros Stamatakis, October 2007
Maximum Likelihood
Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4
AlignmentAlignment
Length: mLength: m
Alexandros Stamatakis, October 2007
Maximum Likelihood
Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4
AlignmentAlignment
Length: mLength: m
AACCGGTT
A C G TA C G T
SubstitutionSubstitutionmodelmodel
Alexandros Stamatakis, October 2007
Maximum Likelihood
Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4
AlignmentAlignment
Length: mLength: m
AACCGGTT
A C G TA C G T
SubstitutionSubstitutionmodelmodel
Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies
ππA A ππC C ππG G ππT T
Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies
ππA A ππC C ππG G ππT T
Alexandros Stamatakis, October 2007
Maximum Likelihood
Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4
AlignmentAlignment
Length: mLength: m
AACCGGTT
A C G TA C G T
SubstitutionSubstitutionmodelmodel
Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies
ππA A ππC C ππG G ππT T
Seq 1Seq 1
Seq 2Seq 2 Seq 4Seq 4
Seq 3Seq 3b1b1
b2b2
b5b5
b3b3
b4b4
Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies
ππA A ππC C ππG G ππT T
Alexandros Stamatakis, October 2007
Maximum Likelihood
Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4
AlignmentAlignment
Length: mLength: m
AACCGGTT
A C G TA C G T
SubstitutionSubstitutionmodelmodel
Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies
ππA A ππC C ππG G ππT T
Seq 1Seq 1
Seq 2Seq 2 Seq 4Seq 4
Seq 3Seq 3b1b1
b2b2
b5b5
b3b3
b4b4
virtual root: vrvirtual root: vr
Alexandros Stamatakis, October 2007
Maximum Likelihood
Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4
AlignmentAlignment
Length: mLength: m
AACCGGTT
A C G TA C G T
SubstitutionSubstitutionmodelmodel
Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies
ππA A ππC C ππG G ππT T
Seq 1Seq 1
Seq 2Seq 2 Seq 4Seq 4
Seq 3Seq 3b1b1
b2b2
b5b5
b3b3
b4b4
P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)
mm
vrvr
Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies
ππA A ππC C ππG G ππT T
Alexandros Stamatakis, October 2007
Maximum Likelihood
Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4
AlignmentAlignment
Length: mLength: m
AACCGGTT
A C G TA C G T
SubstitutionSubstitutionmodelmodel
Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies
ππA A ππC C ππG G ππT T
Seq 1Seq 1
Seq 2Seq 2 Seq 4Seq 4
Seq 3Seq 3b1b1
b2b2
b5b5
b3b3
b4b4
P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)
mm
vrvr
Lots of floating point operations!
Alexandros Stamatakis, October 2007
Maximum Likelihood
Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4
AlignmentAlignment
Length: mLength: m
AACCGGTT
A C G TA C G T
SubstitutionSubstitutionmodelmodel
Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies
ππA A ππC C ππG G ππT T
Seq 1Seq 1
Seq 2Seq 2 Seq 4Seq 4
Seq 3Seq 3
optimize branch lengthsoptimize branch lengths
Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies
ππA A ππC C ππG G ππT T
Alexandros Stamatakis, October 2007
Maximum Likelihood
Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4
AlignmentAlignment
Length: mLength: m
AACCGGTT
A C G TA C G T
SubstitutionSubstitutionmodelmodel
Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies
ππA A ππC C ππG G ππT T
Seq 1Seq 1
Seq 2Seq 2 Seq 4Seq 4
Seq 3Seq 3
optimize model parametersoptimize model parameters
Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies
ππA A ππC C ππG G ππT T
Alexandros Stamatakis, October 2007
Maximum Likelihood
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
Problem II: Computation of likelihood function is expensive
Problem III: Probably high score accuracy required
Problem IV: High memory consumption
Solution:
• New Algorithms
• New Models
• High Performance Computing
Alexandros Stamatakis, October 2007
Maximum Likelihood
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
Problem II: Computation of likelihood function is expensive
Problem III: Probably high score accuracy required
Problem IV: High memory consumption
Solution:
• New Algorithms
• New Models
• High Performance Computing
RAxML Randomized Axelerated
Maximum Likelihood
Alexandros Stamatakis, October 2007
Web & Grid Services RAxML Web-Server at San Diego Supercomputing
Center via www.phylo.org (CIPRES project) Web-Server at Vital-IT unit of Swiss Institute of
Bioinformatics phylobench.vital-it.ch/raxml-bb/ Includes novel search algorithm with 1 order of
magnitude run-time improvement Since Sept 3, about 700 jobs from 130 Ips Extension to SwissGrid planned Novel algorithm with Bootstopping to be
integrated into CIPRES portal soon RAxML integration into Distributed European
Infrastructure for Supercomputing Applications www.deisa.org started 10 days ago
Integration into Debian medical distribution
Alexandros Stamatakis, October 2007
RAxML Black Box
Alexandros Stamatakis, October 2007
RAxML Black Box
Why are Black Boxesuseful?
Alexandros Stamatakis, October 2007
Outline● Introduction
● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services
● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion
● Related Projects● Outlook
Alexandros Stamatakis, October 2007
Levels of Parallelism
Embarrassing Parallelism
MPI, CORBA, Grid Technologies
Alexandros Stamatakis, October 2007
Coarse-Grained Parallelism: MPI Version of RAxML
Master Process
Worker Processes
B-0B-1 B-3
B-2
B-4
PC-CLUSTER
InterconnectionNetwork
Alexandros Stamatakis, October 2007
Levels of Parallelism
Embarrassing Parallelism
Inference Parallelism
MPI, CORBA, Grid Technologies
MPI, algorithm-dependent
Alexandros Stamatakis, October 2007
Levels of Parallelism
Embarrassing Parallelism
Inference Parallelism
Loop-Level Parallelism
MPI, CORBA, Grid Technologies
MPI, algorithm-dependent
OpenMP, GPUs, IBM CELL (Playstation), IBM BlueGene,Clusters with fast Interconnect
Alexandros Stamatakis, October 2007
Loop Level Parallelism
P
QR
P[i] = f(Q[i], R[i])
virtual root
Alexandros Stamatakis, October 2007
Loop Level Parallelism
P
QR
P[i] = f(Q[i], R[i])
virtual root
This operation uses ≥ 90% of total execution time !
Alexandros Stamatakis, October 2007
Loop Level Parallelism
P
QR
P[i] = f(Q[i], R[i])
virtual root
This operation uses ≥ 90% of total execution time !→ simple fine-grained parallelization
Alexandros Stamatakis, October 2007
Loop Level Parallelism
P
QR
virtual root
Alexandros Stamatakis, October 2007
Loop Level Parallelism
P
QR
virtual root
Alexandros Stamatakis, October 2007
Loop Level Parallelism
P
QR
virtual root
Alexandros Stamatakis, October 2007
Loop Level Parallelism
P
QR
virtual rootThe real reason for assuming independent evolution among sites: ......
Alexandros Stamatakis, October 2007
Fine-Grained Parallelism:OpenMP version of RAxML
Alexandros Stamatakis, October 2007
Fine-Grained Parallelism:OpenMP version of RAxML
Alexandros Stamatakis, October 2007
HPC for ML (Bayesian) Proof of Concept & Programming
Techniques: RAxML on a Graphics Processing Unit RAxML on the IBM CELL & Playstation
Production Level Implementations: RAxML with OpenMP RaxML with MPI RAxML on BlueGene Multi-Core Architectures
Alexandros Stamatakis, October 2007
HPC for ML (Bayesian) Proof of Concept & Programming
Techniques: RAxML on a Graphics Processing Unit RAxML on the IBM CELL & Playstation
Production Level Implementations: RAxML with OpenMP RaxML with MPI RAxML on BlueGene Multi-Core Architectures
A good excuse to buy one
Alexandros Stamatakis, October 2007
RAxML-BlueGene Many slow processors: 1024 in one rack 512 MB or 1GB of main memory per node But: high performance network Challenges:
Distribute tree data structure among CPUs Exploit fast collective communication network
For optimal efficiency: loop-level + embarrassing parallelism hybrid parallelism with MPI
Test & Production Run Data With Olaf Bininda-Emonds, Jena: 2,182
mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human
Haplotype Map sequences x 500,000 base pairs
Alexandros Stamatakis, October 2007
RAxML-BlueGene Many slow processors: 1024 in one rack 512 MB or 1GB of main memory per node But: high performance network Challenges:
Distribute tree data structure among CPUs Exploit fast collective communication network
For optimal efficiency: loop-level + embarrassing parallelism hybrid parallelism with MPI
Test & Production Run Data With Olaf Bininda-Emonds, Jena: 2,182
mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human
Haplotype Map sequences x 500,000 base pairs
To be presented at IEEE/ACM 2007 Supercomputing Conference.
Alexandros Stamatakis, October 2007
RAxML-BlueGene Many slow processors: 1024 in one rack 512 MB or 1GB of main memory per node But: high performance network Challenges:
Distribute tree data structure among CPUs Exploit fast collective communication network
For optimal efficiency: loop-level + embarrassing parallelism hybrid parallelism with MPI
Test & Production Run Data With Olaf Bininda-Emonds, Jena: 2,182
mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human
Haplotype Map sequences x 500,000 base pairs
Largest ML analysis to date in terms of memory footprint
Alexandros Stamatakis, October 2007
Loop-Level Parallelism on BlueGene
Alexandros Stamatakis, October 2007
50 Seqs x 23,385 bp
Alexandros Stamatakis, October 2007
50 Seqs x 23,385 bp
Superlinear Speedup
Alexandros Stamatakis, October 2007
250 Seqs x 403,581 bp
Alexandros Stamatakis, October 2007
Embarrassing Parallelism
M
W
W
W
M
WW
W M W
W W
M
WW
W
Alexandros Stamatakis, October 2007
Outline● Introduction
● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services
● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion
● Related Projects● Outlook
Alexandros Stamatakis, October 2007
Confidence Values Tree without node confidence
values is mostly useless Problem:
Confidence value calculation is major computational obstacle
We can compute large trees but not analyse them: compute ≠analyse !
Current Slow Methods Sampling with Bayesian methods Non-parametric Bootstrapping
Alexandros Stamatakis, October 2007
A Tree with Confidence Values
Joint work with Marc Gottschling, Charite Hospital, Berlin
Alexandros Stamatakis, October 2007
BootstrappingOriginal Alignment
perturbation
compute tree compute tree compute tree
Alexandros Stamatakis, October 2007
BootstrappingOriginal Alignment
perturbation
compute tree compute tree compute tree
This needs to be done 100-1000 timesEmbarrassingly Parallel !
Alexandros Stamatakis, October 2007
Two Questions How to compute Bootstraps faster? How many Bootstrap replicates do we
need?
Alexandros Stamatakis, October 2007
Current Work:Rapid Bootstrapping Algorithm
Tested on 22 diverse (mammals, bacteria, archaea, grasses, fishes, plants, viral) real-world DNA/AA single-/multi-gene datasets containing 125-7,764 sequences
Pearson correlation on best-scoring ML trees between RBS (Rapid BS) & SBS (Standard BS) support values 0.95-0.99 (except one dataset at 0.91), average 0.97
Weighted topological distance < 6%, average 4% Program Acceleration: 8-20, average ≈ 15
Acceleration by one order of magnitude Full ML analysis (100BS + ML search) of datasets of
up to 5,000 sequences within less than 5 days on your desktop!
Allows for a sufficiently large number of Bootstrap replicates
Alexandros Stamatakis, October 2007
Quick & Dirty Bootstrap
Modify Algorithm
Computational Experiments
Alexandros Stamatakis, October 2007
Quick & Dirty Bootstrap
Modify Algorithm
Computational Experiments
iterate
Alexandros Stamatakis, October 2007
Rapid Bootstrap
1111111111111111111111111111
011022111111110110221111111110111102220111101111022201111111111011202111111110112021
Alexandros Stamatakis, October 2007
Rapid Bootstrap
1111111111111111111111111111
011022111111110110221111111110111102220111101111022201111111111011202111111110112021
Compute Starting TreeCompute Starting Tree
Alexandros Stamatakis, October 2007
Rapid Bootstrap
1111111111111111111111111111
011022111111110110221111111110111102220111101111022201111111111011202111111110112021
Optimize Model Params &Optimize Model Params &Branch LengthsBranch Lengths
Alexandros Stamatakis, October 2007
1111111111111111111111111111
0110221111111101102211111111 -110 -1101011110222011110111102220111 -105 -1051111111011202111111110112021 -100 -100
Rapid BootstrapUse Starting Tree &Use Starting Tree &
Model Params to compute Model Params to compute RELL scoresRELL scores
Alexandros Stamatakis, October 2007
1111111111111111111111111111
0110221111111101102211111111 -110 -1101011110222011110111102220111 -105 -1051111111011202111111110112021 -100 -100
Rapid BootstrapUse Starting Tree &Use Starting Tree &
Model Params to compute Model Params to compute RELL scoresRELL scores
Sort by RELLSort by RELL
Alexandros Stamatakis, October 2007
1111111111111111111111111111
1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110
Rapid Bootstrap
TT00: Thorough Search: Thorough Search
Alexandros Stamatakis, October 2007
1111111111111111111111111111
1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110
Rapid Bootstrap
TT00: Thorough Search : Thorough Search
TT11: Quick Search on T: Quick Search on T00
Alexandros Stamatakis, October 2007
1111111111111111111111111111
1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110
Rapid Bootstrap
TT00: Thorough Search : Thorough Search
TT11: Quick Search on T: Quick Search on T00
TT22: Quick Search on T: Quick Search on T11
Alexandros Stamatakis, October 2007
1111111111111111111111111111
1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110
Rapid Bootstrap
TT00: Thorough Search : Thorough Search
TT11: Quick Search on T: Quick Search on T00
TT22: Quick Search on T: Quick Search on T11
sequential dependency is bad for parallelism
Alexandros Stamatakis, October 2007
Scalability of Rapid Bootstrap
Alexandros Stamatakis, October 2007
Scalability of Rapid Bootstrap
Some datasets are harder than others
Alexandros Stamatakis, October 2007
Scalability of Rapid Bootstrap
Alexandros Stamatakis, October 2007
ML-Scores: Garli, RAxML, PHYML 715 Sequences
Alexandros Stamatakis, October 2007
Correlation 125 Taxa: 0.91
Alexandros Stamatakis, October 2007
Support Value Distribution
Alexandros Stamatakis, October 2007
Bootstrap Likelihood Values125 x 19,436
10,000 replicates only 195 non-trivial bipartitions
Alexandros Stamatakis, October 2007
Bootstrap Likelihood Values125 x 19,436
Alexandros Stamatakis, October 2007
3,491 rBCL sequencesRapid versus Standard BS
Correlation: 0.98
Alexandros Stamatakis, October 2007
7,764 DNA Best Tree
Alexandros Stamatakis, October 2007
7,764 DNA All Bipartitions
Alexandros Stamatakis, October 2007
775 x 3,838 AA
Alexandros Stamatakis, October 2007
New Opportunities
Assess Impact of Alignment Method on tree and support values
Test Bootstrap of the Bootstrap (double Bootstrap) procedures
Devise and empirically verify Bootstopping criteria
Alexandros Stamatakis, October 2007
Bootstrap of the Bootstrap140 AA (Efron et al PNAS 1996)
Alexandros Stamatakis, October 2007
Bootstrap of the Bootstrap3,491 rBCL
Alexandros Stamatakis, October 2007
Bootstopping Rapid Bootstrapping allows to assess
Bootstopping criteria as follows1. Compute a high number of BS replicates (10,000)2. Devise topology-based bootstopping criterion and
apply it to these 10,000 replicates3. Compare support values induced by bootstopped
trees (say 300 replicates) with 10,000 replicates
We have 10,000 replicates for 18 datasets containing 125 to 2,554 sequences
Alexandros Stamatakis, October 2007
Bootstopping Criterion
Every 50, 100, 150, ... replicates do a test: Say we have N BS trees Do the following 100 times:
Randomly split up this set of N trees into 2 equal sets S1, S2, of size N/2
Compute the bipartition support vectors for S1 and S2
Compute Pearson correlation of the support vectors
return average of the 100 Pearson correlations if average > 0.99 stop
Alexandros Stamatakis, October 2007
Result Overview
Bootstopped between 100-400 (avg 213)
Correlation on best tree: Bootstopped versus 10,000 replicates > 0.99 (avg 0.995)
Correlation of all bipartitions > 0.995 (avg 0.997)
Alexandros Stamatakis, October 2007
Bootstopping Best 140 AA
Alexandros Stamatakis, October 2007
Bootstopping Best 404 DNA (Multi-Gene)
Alexandros Stamatakis, October 2007
Bootstopping Best 994 DNA
Alexandros Stamatakis, October 2007
Bootstopping All 994 DNA
Alexandros Stamatakis, October 2007
Bootstopping Best 1,908 DNA
Alexandros Stamatakis, October 2007
Bootstopping Best 2,554 DNA
Alexandros Stamatakis, October 2007
Putting the Pieces together Blue-Gene: Can handle huge datasets
Use Cat approximation on BlueGene Further speedup of factor 3.5 Memory footprint reduction factor 4
Alexandros Stamatakis, October 2007
8,864 Bacteria under GTR+Γand GTR+CAT
Log Likelihood Score under Γ
14 days14 days7 days7 days
Execution Time
Alexandros Stamatakis, October 2007
Putting the Pieces together Blue-Gene: Can handle huge datasets
Use Cat approximation on BlueGene Further speedup of factor 3.5 Memory footprint reduction factor 4
Integrate rapid Bootstrap into BlueGene version
Additional speedup ≈ 15 Mechanisms available to accelerate
BlueGene version by factor 50-60 Integrate Bootstopping into BlueGene
Conclusion: We will soon be able to compute a small tree of life with 10,000 organisms and data from multiple genes!
Alexandros Stamatakis, October 2007
Outline● Introduction
● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services
● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion
● Related Projects● Outlook
Alexandros Stamatakis, October 2007
Host-Parasite Co-Evolution
Hosts (eg Mammals) Parasites (eg Lice)
Alexandros Stamatakis, October 2007
Host-Parasite Co-Evolution
Hosts Parasites
Adjacency Matrix 0/1
8 Parasites
6 hosts
Co-Evolution Hypothesis
Alexandros Stamatakis, October 2007
Host-Parasite Co-Evolution
Hosts Parasites
Adjacency Matrix 0/1
8 Parasites
6 hosts
Co-Evolution Hypothesis
Statistical Test
Alexandros Stamatakis, October 2007
What can HPC do forBioinformatics?Axelerated Parafit
“Parafit: statistical test of co-evolution”, Pierre Legendre, Syst. Biol. 2003
AxParafit (Axelerated Parafit) Statistical test of hypotheses of host-parasite co-
evolution C porting, optimization, BLAS integration Speedup up to factor 67 Master-Worker MPI-parallelization
Largest co-phylogenetic study to date conducted within 8 minutes instead of 4 weeks
Open-Source Code: http://icwww.epfl.ch/~stamatak/AxParafit.html
SwissGrid-based Web-Server planned
Alexandros Stamatakis, October 2007
AxParafit: Sequential Performance
Alexandros Stamatakis, October 2007
AxParafit: Parallel Performance
Alexandros Stamatakis, October 2007
The ML Benchmark:A Current Community Project
Standardized way required to test ML search programs Web-Server with real-world alignments and performance data
at Swiss Institute of Bioinformatics Many developers of popular ML programs involved
Stephane Guindon (PHYML) Montpellier Simon Wheelan (LeaPhy) Manchester Bui Quang Minh (IQPNNI) Vienna Derrick Zwickl (GARLI) Virginia Thomas Keane (dprML) Cambridge
Byproduct: SPEC-like CPU benchmark for phylogenetics Follow-up: (planned) ML competition at major conference with
industrial sponsor
Alexandros Stamatakis, October 2007
A Current Problem:Handling Multi-Gene Alignments
Gene 1 Gene 2
Sequence 1
Sequence 5
Missing Data ≠ Gap Data
Alexandros Stamatakis, October 2007
A Multi-Gene Model
Alexandros Stamatakis, October 2007
A Multi-Gene Model
Alexandros Stamatakis, October 2007
A Multi-Gene Model
Alexandros Stamatakis, October 2007
A Multi-Gene Model
LogLH (T) = LogLh (T|Red)
Alexandros Stamatakis, October 2007
LogLH (T) = LogLh (T|Red) +LogLH(T|Yellow)
A Multi-Gene Model
Alexandros Stamatakis, October 2007
LogLH (T) = LogLh (T|Red) +LogLH(T|Yellow)
A Multi-Gene ModelChallenge: devise efficient data structures for this
Alexandros Stamatakis, October 2007
Why are Individual Branches per Gene a Challenge?
Alexandros Stamatakis, October 2007
Why are Individual Branches per Gene a Challenge?
Alexandros Stamatakis, October 2007
Outlook
Alexandros Stamatakis, October 2007
Outlook
Tree of Life What is a good alignment in a
phylogenetic context? Simultaneous alignment and tree building More HPC & memory-aware programming Multi-core architectures Models for “gappy” multi-gene alignments
Alexandros Stamatakis, October 2007
Acknowledgements BlueGene Project
Michael Ott, TUM
Srinivas Aluru, Jaroslaw Zola, Iowa State
Dan Janies, Andrew Johnson, Ohio State
IBM CELL & Playstation
Filip Blagojevic, Dimitris Nikolopoulos, Virginia Tech
Christos Antonopoulos, Univ. of Thessaly
Bootstopping
Bernard Moret, Masoud Alipour, EPFL
Olaf Bininda-Emonds, Univ. Jena
RAxML Web-Server
Jacques Rougemont, SIB
Terri Liebowitz, SDSC
AxParafit/AxPcoords
Markus Goeker, Alexander Auch, Jan Meier-Kolthoff, University of Tuebingen
Datasets for Studies
Jun Inoue (Florida), Nicolas Salamin (Lausanne), Marc Gottschling (Berlin), Guido Grimm (Tuebingen), Nikos Poulakakis (Yale), Usman Roshan (NJIT)
Alexandros Stamatakis, October 2007
Thank you for your Attention !
Lake Geneva, Switzerland