28
HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA GRAAL team

HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

  • Upload
    hadien

  • View
    226

  • Download
    1

Embed Size (px)

Citation preview

Page 1: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

HPC in Bioinformatics and Genomics

Daniel Kahn, Clément Rezvoy and Frédéric Vivien

Lyon 1 University & INRIA HELIX teamLIP-ENS & INRIA GRAAL team

Page 2: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Moore’s law in genomics

Ø Exponential increase

Ø Doubling time ~20 months

Page 3: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

New high-throughput technologies

Ø Pyrosequencing (Roche 454 GS FLX)l 100-400 Mb per run (1 day)

l Long reads (up to 400 bp)

l ~15 Gb raw data

Ø Illumina Genome Analyzerl 1,500 Mb per run (3 days)

l Short reads (35 bp)

l ~1 Tb raw data

Ø Applied Biosystems SOLID sequencerl 3,000 Mb per run (5 days)

l Short reads (35 bp)

l ~15 Tb raw data

Page 4: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA
Page 5: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Uses of high throughput sequencing

Ø Population genomicsl For instance, 1000 human genome project

Ø Individual sequencing

Ø Metagenomicsl Comprehensive appraisal of microbial communities and gene repertoires

in various environments

Ø Phylogenomicsl Resolving the history of genes and species

Ø ….

Ø As many computing challenges

Page 6: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Large scale protein sequence analysis

Ø All vs. all

Ø The challenge of protein modularityl Most proteins are combinatorial arrangements of conserved modules

(domains)

LuxR

GerE

FixJ

OmpR

SpoOA

NtrC

NifA

Page 7: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

The ProDom project

Ø Need for an automated process in order to allow for comprehensive analysis

Ø Automatically decompose proteins into domains and cluster domain families, using MKDOM2

Ø Generate multiple alignments and trees for all families

Ø Automatically generate mutually consistent representations for all proteins

Page 8: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Resolving combinatorial proteins

Page 9: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

query

internal repeat detection

yes

query

no

PSI-BLAST

DB

DB changesremove newly found domains

split modified sequencessort by size

DB

query

no match matches repeat matches

(i+1)th iteration

ith iteration The MKDOM2 program

Page 10: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Drawbacks of sequential MKDOM2

Ø Greedy algorithm

Ø Scales quadratically

Ø Data follow Moore’s law

è no more tractable !

Page 11: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Parallelization of MKDOM2

Ø Parallelization of the main loop

Ø Distribute sequences for independent family construction

Ø Difficulties:l Heterogeneous run times for the main loop

l Possible dependencies between families

è Precalculate an all vs. all comparison in order to select independent queries

è Send batches of independent sequences before worker nodes are idle

è Verify family independence a posteriori

Page 12: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA
Page 13: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA
Page 14: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA
Page 15: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA
Page 16: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA
Page 17: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA
Page 18: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Speed-up on medium scale test set

Ø 32 Archaeal genomes

Ø 21.5 M aminoacids

Page 19: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Large-scale test set

Ø 263 genomes

Ø 950,216 protein sequences

Ø 339 M aminoacids

Ø Run on GRID’5000 (150 nodes)

Ø Half of the data set processed in only 20 hours

Page 20: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Database crunching

Page 21: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Increasing query sizes

Page 22: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Variable sizes of domain families

Page 23: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Heterogeneous run times

Ø ~1000-fold range

Page 24: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Large result queue

Page 25: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

… yet efficient node usage

Ø 86% processor usage

Page 26: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Full-scale protein domain analysis

Ø To be scaled-up 7-fold for full processing of UniProt today !

Ø Will require stable MPI usage of ~1000 processors over the grid

Ø Appropriate infrastructure not yet identified

Ø Other program MPI_MKDOM3 envisioned to make full use of precalculated all vs. all comparison

… required in order to further cope with Moore’s law

Page 27: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA

INRA ToulouseEmmanuel COURCELLEDaniel KAHN

Support- PRABI- EU (EMBRACE & IMPACT)- IN2P3- GRID’5000

Lyon 1 UniversityINRIA HELIX projectAurélie LAUGRAUD Lauranne DUQUENNEDaniel KAHN

LIP-ENS LyonClément REZVOYFrédéric VIVIEN

Page 28: HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA