Crunching Huge Phylogenies. A. Stamatakis

Crunching Huge Phylogenies:A Rapid Bootstrap Algorithm and Massive Parallelism on the IBM

BlueGene

Alexandros Stamatakis

Swiss Federal Institute of Technology Lausanne (EPFL)School of Computer & Communication Sciences

Laboratory for Computational Biology and BioinformaticsLausanne, Switzerland

&Swiss Institute of Bioinformatics

[email protected]/~stamatak

Alexandros Stamatakis, October 2007

The Missing Part

Data Assembly Tree AnalysisInference ?


The Missing Part

Data Assembly Tree Analysis


IBM BlueGene/Lsupercomputer


Rapid BootstrappingBootstopping Criterion


The Big Hardware Problem

1980 2007

CPU Speed 40% p.a.

Memory Speed 9% p.a.


... and why this concerns Bioinformatics

1980 2007

CPU Speed 40% p.a.


Sequence Data


... and why this concerns Bioinformatics

1980 2007

CPU Speed 40% p.a.


Sequence Data

Application of HPC techniques will become much more important


Cache Hierarchy


Outline● Introduction

● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services

● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion

● Related Projects● Outlook


Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic

inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &

simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)





ML & Bayesian: explicit model choice





Complex Methods & Models required to reconstruct large & complicated trees !

Focus of this talk is on Maximum Likelihood!





The real reason for working on ML: ......


Challenges for Phyloinformatics

Holy grail: “Tree of Life” What is a good alignment in a

phylogenetic context? Simultaneous alignment and tree building Improve/extend models ... but thereby size

of computable trees decreases! More HPC awareness Exploit multi-core architectures Amount of available data grows at a

higher rate than algorithms are getting faster


The algorithmic problem


The number of trees


The number of trees


The number of trees


The number of trees explodes!

BANG !







Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m


Maximum Likelihood


AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel


Maximum Likelihood


AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T


Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T




Maximum Likelihood


AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T




Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4




Maximum Likelihood


AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T




Seq 1Seq 1


Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

virtual root: vrvirtual root: vr


Maximum Likelihood


AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T




Seq 1Seq 1


Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)

mm

vrvr




Maximum Likelihood


AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T




Seq 1Seq 1


Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)

mm

vrvr

Lots of floating point operations!


Maximum Likelihood


AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T




Seq 1Seq 1


Seq 3Seq 3

optimize branch lengthsoptimize branch lengths




Maximum Likelihood


AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T




Seq 1Seq 1


Seq 3Seq 3

optimize model parametersoptimize model parameters




Maximum Likelihood

Goal: Obtain topology with maximum likelihood value

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood function is expensive

Problem III: Probably high score accuracy required

Problem IV: High memory consumption

Solution:

• New Algorithms

• New Models

• High Performance Computing


Maximum Likelihood

Goal: Obtain topology with maximum likelihood value

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood function is expensive

Problem III: Probably high score accuracy required

Problem IV: High memory consumption

Solution:

• New Algorithms

• New Models

• High Performance Computing

RAxML Randomized Axelerated

Maximum Likelihood


Web & Grid Services RAxML Web-Server at San Diego Supercomputing

Center via www.phylo.org (CIPRES project) Web-Server at Vital-IT unit of Swiss Institute of

Bioinformatics phylobench.vital-it.ch/raxml-bb/ Includes novel search algorithm with 1 order of

magnitude run-time improvement Since Sept 3, about 700 jobs from 130 Ips Extension to SwissGrid planned Novel algorithm with Bootstopping to be

integrated into CIPRES portal soon RAxML integration into Distributed European

Infrastructure for Supercomputing Applications www.deisa.org started 10 days ago

Integration into Debian medical distribution


RAxML Black Box


RAxML Black Box

Why are Black Boxesuseful?







Levels of Parallelism

Embarrassing Parallelism

MPI, CORBA, Grid Technologies


Coarse-Grained Parallelism: MPI Version of RAxML

Master Process

Worker Processes

B-0B-1 B-3

B-2

B-4

PC-CLUSTER

InterconnectionNetwork




Inference Parallelism


MPI, algorithm-dependent




Inference Parallelism

Loop-Level Parallelism


MPI, algorithm-dependent

OpenMP, GPUs, IBM CELL (Playstation), IBM BlueGene,Clusters with fast Interconnect


Loop Level Parallelism

P

QR

P[i] = f(Q[i], R[i])

virtual root



P

QR

P[i] = f(Q[i], R[i])

virtual root

This operation uses ≥ 90% of total execution time !



P

QR

P[i] = f(Q[i], R[i])

virtual root

This operation uses ≥ 90% of total execution time !→ simple fine-grained parallelization



P

QR

virtual root



P

QR

virtual root



P

QR

virtual root



P

QR

virtual rootThe real reason for assuming independent evolution among sites: ......


Fine-Grained Parallelism:OpenMP version of RAxML


Fine-Grained Parallelism:OpenMP version of RAxML


HPC for ML (Bayesian) Proof of Concept & Programming

Techniques: RAxML on a Graphics Processing Unit RAxML on the IBM CELL & Playstation

Production Level Implementations: RAxML with OpenMP RaxML with MPI RAxML on BlueGene Multi-Core Architectures


HPC for ML (Bayesian) Proof of Concept & Programming

Techniques: RAxML on a Graphics Processing Unit RAxML on the IBM CELL & Playstation

Production Level Implementations: RAxML with OpenMP RaxML with MPI RAxML on BlueGene Multi-Core Architectures

A good excuse to buy one


RAxML-BlueGene Many slow processors: 1024 in one rack 512 MB or 1GB of main memory per node But: high performance network Challenges:

Distribute tree data structure among CPUs Exploit fast collective communication network

For optimal efficiency: loop-level + embarrassing parallelism hybrid parallelism with MPI

Test & Production Run Data With Olaf Bininda-Emonds, Jena: 2,182

mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human

Haplotype Map sequences x 500,000 base pairs








To be presented at IEEE/ACM 2007 Supercomputing Conference.








Largest ML analysis to date in terms of memory footprint


Loop-Level Parallelism on BlueGene


50 Seqs x 23,385 bp


50 Seqs x 23,385 bp

Superlinear Speedup


250 Seqs x 403,581 bp



M

W

W

W

M

WW

W M W

W W

M

WW

W







Confidence Values Tree without node confidence

values is mostly useless Problem:

Confidence value calculation is major computational obstacle

We can compute large trees but not analyse them: compute ≠analyse !

Current Slow Methods Sampling with Bayesian methods Non-parametric Bootstrapping


A Tree with Confidence Values

Joint work with Marc Gottschling, Charite Hospital, Berlin


BootstrappingOriginal Alignment

perturbation

compute tree compute tree compute tree


BootstrappingOriginal Alignment

perturbation

compute tree compute tree compute tree

This needs to be done 100-1000 timesEmbarrassingly Parallel !


Two Questions How to compute Bootstraps faster? How many Bootstrap replicates do we

need?


Current Work:Rapid Bootstrapping Algorithm

Tested on 22 diverse (mammals, bacteria, archaea, grasses, fishes, plants, viral) real-world DNA/AA single-/multi-gene datasets containing 125-7,764 sequences

Pearson correlation on best-scoring ML trees between RBS (Rapid BS) & SBS (Standard BS) support values 0.95-0.99 (except one dataset at 0.91), average 0.97

Weighted topological distance < 6%, average 4% Program Acceleration: 8-20, average ≈ 15

Acceleration by one order of magnitude Full ML analysis (100BS + ML search) of datasets of

up to 5,000 sequences within less than 5 days on your desktop!

Allows for a sufficiently large number of Bootstrap replicates


Quick & Dirty Bootstrap

Modify Algorithm

Computational Experiments


Quick & Dirty Bootstrap

Modify Algorithm

Computational Experiments

iterate


Rapid Bootstrap

1111111111111111111111111111

011022111111110110221111111110111102220111101111022201111111111011202111111110112021


Rapid Bootstrap

1111111111111111111111111111

011022111111110110221111111110111102220111101111022201111111111011202111111110112021

Compute Starting TreeCompute Starting Tree


Rapid Bootstrap

1111111111111111111111111111

011022111111110110221111111110111102220111101111022201111111111011202111111110112021

Optimize Model Params &Optimize Model Params &Branch LengthsBranch Lengths


1111111111111111111111111111

0110221111111101102211111111 -110 -1101011110222011110111102220111 -105 -1051111111011202111111110112021 -100 -100

Rapid BootstrapUse Starting Tree &Use Starting Tree &

Model Params to compute Model Params to compute RELL scoresRELL scores


1111111111111111111111111111

0110221111111101102211111111 -110 -1101011110222011110111102220111 -105 -1051111111011202111111110112021 -100 -100

Rapid BootstrapUse Starting Tree &Use Starting Tree &

Model Params to compute Model Params to compute RELL scoresRELL scores

Sort by RELLSort by RELL


1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

TT00: Thorough Search: Thorough Search


1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

TT00: Thorough Search : Thorough Search

TT11: Quick Search on T: Quick Search on T00


1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap





1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap




sequential dependency is bad for parallelism


Scalability of Rapid Bootstrap



Some datasets are harder than others




ML-Scores: Garli, RAxML, PHYML 715 Sequences


Correlation 125 Taxa: 0.91


Support Value Distribution


Bootstrap Likelihood Values125 x 19,436

10,000 replicates only 195 non-trivial bipartitions


Bootstrap Likelihood Values125 x 19,436


3,491 rBCL sequencesRapid versus Standard BS

Correlation: 0.98


7,764 DNA Best Tree


7,764 DNA All Bipartitions


775 x 3,838 AA


New Opportunities

Assess Impact of Alignment Method on tree and support values

Test Bootstrap of the Bootstrap (double Bootstrap) procedures

Devise and empirically verify Bootstopping criteria


Bootstrap of the Bootstrap140 AA (Efron et al PNAS 1996)


Bootstrap of the Bootstrap3,491 rBCL


Bootstopping Rapid Bootstrapping allows to assess

Bootstopping criteria as follows1. Compute a high number of BS replicates (10,000)2. Devise topology-based bootstopping criterion and

apply it to these 10,000 replicates3. Compare support values induced by bootstopped

trees (say 300 replicates) with 10,000 replicates

We have 10,000 replicates for 18 datasets containing 125 to 2,554 sequences


Bootstopping Criterion

Every 50, 100, 150, ... replicates do a test: Say we have N BS trees Do the following 100 times:

Randomly split up this set of N trees into 2 equal sets S1, S2, of size N/2

Compute the bipartition support vectors for S1 and S2

Compute Pearson correlation of the support vectors

return average of the 100 Pearson correlations if average > 0.99 stop


Result Overview

Bootstopped between 100-400 (avg 213)

Correlation on best tree: Bootstopped versus 10,000 replicates > 0.99 (avg 0.995)

Correlation of all bipartitions > 0.995 (avg 0.997)


Bootstopping Best 140 AA


Bootstopping Best 404 DNA (Multi-Gene)


Bootstopping Best 994 DNA


Bootstopping All 994 DNA


Bootstopping Best 1,908 DNA


Bootstopping Best 2,554 DNA


Putting the Pieces together Blue-Gene: Can handle huge datasets

Use Cat approximation on BlueGene Further speedup of factor 3.5 Memory footprint reduction factor 4


8,864 Bacteria under GTR+Γand GTR+CAT

Log Likelihood Score under Γ

14 days14 days7 days7 days

Execution Time


Putting the Pieces together Blue-Gene: Can handle huge datasets

Use Cat approximation on BlueGene Further speedup of factor 3.5 Memory footprint reduction factor 4

Integrate rapid Bootstrap into BlueGene version

Additional speedup ≈ 15 Mechanisms available to accelerate

BlueGene version by factor 50-60 Integrate Bootstopping into BlueGene

Conclusion: We will soon be able to compute a small tree of life with 10,000 organisms and data from multiple genes!







Host-Parasite Co-Evolution

Hosts (eg Mammals) Parasites (eg Lice)



Hosts Parasites

Adjacency Matrix 0/1

8 Parasites

6 hosts

Co-Evolution Hypothesis



Hosts Parasites

Adjacency Matrix 0/1

8 Parasites

6 hosts

Co-Evolution Hypothesis

Statistical Test


What can HPC do forBioinformatics?Axelerated Parafit

“Parafit: statistical test of co-evolution”, Pierre Legendre, Syst. Biol. 2003

AxParafit (Axelerated Parafit) Statistical test of hypotheses of host-parasite co-

evolution C porting, optimization, BLAS integration Speedup up to factor 67 Master-Worker MPI-parallelization

Largest co-phylogenetic study to date conducted within 8 minutes instead of 4 weeks

Open-Source Code: http://icwww.epfl.ch/~stamatak/AxParafit.html

SwissGrid-based Web-Server planned


AxParafit: Sequential Performance


AxParafit: Parallel Performance


The ML Benchmark:A Current Community Project

Standardized way required to test ML search programs Web-Server with real-world alignments and performance data

at Swiss Institute of Bioinformatics Many developers of popular ML programs involved

Stephane Guindon (PHYML) Montpellier Simon Wheelan (LeaPhy) Manchester Bui Quang Minh (IQPNNI) Vienna Derrick Zwickl (GARLI) Virginia Thomas Keane (dprML) Cambridge

Byproduct: SPEC-like CPU benchmark for phylogenetics Follow-up: (planned) ML competition at major conference with

industrial sponsor


A Current Problem:Handling Multi-Gene Alignments

Gene 1 Gene 2

Sequence 1

Sequence 5

Missing Data ≠ Gap Data


A Multi-Gene Model


A Multi-Gene Model


A Multi-Gene Model


A Multi-Gene Model

LogLH (T) = LogLh (T|Red)


LogLH (T) = LogLh (T|Red) +LogLH(T|Yellow)

A Multi-Gene Model


LogLH (T) = LogLh (T|Red) +LogLH(T|Yellow)

A Multi-Gene ModelChallenge: devise efficient data structures for this


Why are Individual Branches per Gene a Challenge?


Why are Individual Branches per Gene a Challenge?


Outlook


Outlook

Tree of Life What is a good alignment in a

phylogenetic context? Simultaneous alignment and tree building More HPC & memory-aware programming Multi-core architectures Models for “gappy” multi-gene alignments


Acknowledgements BlueGene Project

Michael Ott, TUM

Srinivas Aluru, Jaroslaw Zola, Iowa State

Dan Janies, Andrew Johnson, Ohio State

IBM CELL & Playstation

Filip Blagojevic, Dimitris Nikolopoulos, Virginia Tech

Christos Antonopoulos, Univ. of Thessaly

Bootstopping

Bernard Moret, Masoud Alipour, EPFL

Olaf Bininda-Emonds, Univ. Jena

RAxML Web-Server

Jacques Rougemont, SIB

Terri Liebowitz, SDSC

AxParafit/AxPcoords

Markus Goeker, Alexander Auch, Jan Meier-Kolthoff, University of Tuebingen

Datasets for Studies

Jun Inoue (Florida), Nicolas Salamin (Lausanne), Marc Gottschling (Berlin), Guido Grimm (Tuebingen), Nikos Poulakakis (Yale), Usman Roshan (NJIT)


Thank you for your Attention !

Lake Geneva, Switzerland