47
Luciano Milanesi 2007/12/19, Napoli 1 HPC and GRID challenges in Bioinformatics. Milanesi Luciano National Research Council Institute of Biomedical Technologies, Milan, Italy [email protected]

Milanesi Luciano National Research Council Institute of Biomedical Technologies, Milan, Italy

  • Upload
    dori

  • View
    32

  • Download
    1

Embed Size (px)

DESCRIPTION

HPC and GRID challenges in Bioinformatics. Milanesi Luciano National Research Council Institute of Biomedical Technologies, Milan, Italy [email protected]. Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 1

HPC and GRID challenges in Bioinformatics.

Milanesi LucianoNational Research Council Institute of Biomedical Technologies, Milan, Italy [email protected]

Page 2: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 2

Introduction

• The potential of new biological and biomedical technological platforms in connection with HPC and GRID technology will be particularly useful to deal with the increasing amount, complexity, and heterogeneity of biological and biomedical data.

• Bioinformatics applications for eHealth have become an ideal research area where computer scientists can apply and further develop new intelligent computation methods, in both experimental and theoretical cases.

The European Bioinformatics initiative based on infrastructure created by the EGEE and BioinfoGRID and related projects will be illustrated.

Page 3: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 3

Introduction: Post-genomic

• “Post-genomic” focuses on the new tools and new methodologies emerging from the knowledge of genome sequences.

• Production and use of DNA micro arrays, analysis of transciptome, proteome, metabolome are the different topics developed in this class.

Page 4: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 4

The human organism:

• ~ 3 billion nucleotides• ~ 30,000 genes coding for• ~ 100,000-300,000 transcripts• ~ 1-2 million proteins• ~ 60 trillion cells of• ~ 300 cell types in• ~14,000 distinguishable • morphological structures

Page 5: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 5

ICT and Genomics

• A key development in the computational world has been the arrival of de novo design algorithms that use all available spatial information to be found within the target to design novel drugs.

• Coupling these algorithms to the rapidly growing body of information from structural genomics together with the new ICT technology (eg. HPC, GRID, Web Services, ecc.)

• provides a powerful new possibility for exploring design to a broad spectrum of genomics targets, including more challenging techniques such as:

• protein–protein interactions, docking, molecular dynamics, system biology, gene network ecc.

Page 6: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 6

System Biology for Health

Page 7: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 7

EGEE Related EU projects

EUGRIDGRID

ISSeG

BEinGRID

Di l i gentA DIgital Library Infrastructureon Grid ENabled Technology

EUIndia

Page 8: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 8

BioinfoGRID

Page 9: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 9

BioinfoGRID Project

.

• The BIOINFOGRID project proposes to combine the Bioinformatics services and applications for molecular biology users with the Grid Infrastructure by EGEE and EGEEII projects.

• In the BIOINFOGRID initiative plan to perform research in genomics, transcriptomics, proteomics and molecular dynamics applications studies based on GRID technology.

Page 10: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 10

Genomics applications in GRID

• GRID analysis of genomic databases: integration of precomputed data, gene identification, differentiation of pseudogenes, comparative genome analysis, etc.

• Perform functional protein analysis in GRID by using the functional protein domain annotations on large protein families using GRID and related databases.

Page 11: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 11

Bioinformatics Applications

• CSTminer Goal: compare the entire genome of the Human being against

the entire genome of some animals (mouse, dog… ecc) First test: Human against mouse Challenge dimension:

850 million of BLAST comparison (~ 2 sec of CPU for each comparison)

More than 50 CPU years needed. More than 65000 job submitted. Up to 2 million of comparison per hour. 22 different farms used. More then 900 different hosts used. 2 month of run on INFN-Grid infrastructure

Second test: Some genes of Human against many animals Challenge dimension:

1.7 million of comparison More than 900 CPU hours needed. < 1 day on INFN-Grid infrastructure

Page 12: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 12

Proteomics Applications in GRID

• Protein surface calculation : the grid will be used to elaborate the volumetric description of the protein obtaining a precise representation of the corresponding surface.

Page 13: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 13

Transcriptomics applications

• Computational GRIDs to analyse trascriptomics data

Description• To perform algorithmic tools for gene expression data

analysis in GRID: evaluate the computational tools for extracting biologically significant information from gene expression data.

• Algorithms will focus on clustering steady state and time series gene expression data, multiple testing and meta analysis of different microarray experiments from different groups, and identification of transcription sites.

Page 14: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 14

Transcriptomics applications

Data analysis specific for bioinformatics allow the GRID user to store and search genetics data, with direct access to the data files stored on Data Storage element on GRID servers.

Researchers perform their activities regardless geographical location, interact with colleagues, share and access data

Scientific instruments and experiments provide huge amount of data from microarray

Page 15: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 15

Influenza A Neuraminidase

• Grid-enabled High-throughput in-silico Screening against Influenza A Neuraminidase

• Encouraged by the success of the first EGEE biomedical data challenge against malaria (WISDOM), the second data challenge battling avian flu was kicked off in April 2006 to identify new drugs for the potential variants of the Influenza A virus.

• Mobilizing thousands of CPUs on the Grid, the 6-weeks high-throughput screening activity has fulfilled over 100 CPU years of computing power.

• In this project, the impact of a world-wide Grid infrastructure to efficiently deploy large scale virtual screening to speed up the drug design process has been demonstrated.

Page 16: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 16

VE-infrastructure shared between Europe and Latin America

Identification of Applications in EELA

• EELA Biomedical Applications Fall into Three Categories– Bioinformatics Applications

BLAST in Grids. Phylogeny.

– Computational Biochemical Processes Wide in-Silico Docking on Malaria

(WISDOM).– Biomedical Models

GEANT4 Application for Tomographic

Emission (GATE)

Page 17: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 17

ACGT Project

Page 18: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 18

EuChinaGRID

• Facility for the prediction of the three dimensional structure of “never born proteins”

ComputingElement

Euchina Virtual Organization (EGEE)

StorageElement

2. Transfer

application

User InterfacePortal

AABTDDSAD

1.Submit

sequence

3. Store protein

4. Visualize

PDB1.32 3.23 3.442.77 4.33 5.661.32 3.23 3.44

Page 19: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 19

Grid added value for international collaboration on neglected diseases

• Grids offer unprecedented opportunities for sharing information and resources world wide

Grids are unique tools for :-Collecting and sharing information (Epidemiology, Genomics)-Networking experts-Mobilizing resources routinely or in emergency (vaccine & drug discovery)

Page 20: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 20

Molecular applications in GRID

Aim : The objective is to docking and Molecular Dynamics simulations, which usually take a very long time to complete the analysis.

Description• Wide In Silico Docking On Malaria initiative WISDOM-

II:This project perform the docking and molecular dynamics simulation on the GRID platform for discovery new targets for neglected diseases . Analysis can be performed notably using the data generated by the WISDOM application on the EGEE infrastructure.

Page 21: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 21

Grid impact on drug discovery workflow down to drug delivery (1/2)

• Grids provide the necessary tools and data to identify new biological targets– Bioinformatics services (database replication, workflow…)– Resources for CPU intensive tasks such as genomics

comparative analysis, inverse docking…

• Grids provide the resources to speed up lead discovery– Large scale in silico docking to identify potentially promising

compounds– Molecular dynamics computations to refine virtual screening and

further assess selected compounds

Page 22: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 22

Grid impact on drug discovery workflow down to drug delivery (2/2)

• Grids provide environments for epidemiology– Federation of databases to collect data in endemic areas to

study a disease and to evaluate impact of vaccine, vector control measures

– Resources for data analysis and mathematical modelling

• Grids provide the services needed for clinical trials– Federation of databases to collect data in the centres

participating to the clinical trials

• Grids provide the tools to monitor drug delivery– Federation of databases to monitor drug delivery

Page 23: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 23

Docking: predict howsmall molecules bind

to a receptor ofknown 3D structure

Starting compound database

Starting target structure model

DOCKING

Predicted binding models

Post-analysis

Compounds for assay

Virtual screening process by docking

There are successful examples–rapid,–cost effective…

But there are limitations–CPU and storage needed

Page 24: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 24

Grid-enabled high throughput virtual screening by docking

A few target structures

Millions of chemicalcompounds

• 1 to 30 mn by docking• A few MB by output• 100 CPU years, 1 TB

• Large scale deployment on grid infrastructure

• Challenges: - Speed-up the process - Manage the data

Docking

software

Page 25: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 25

WISDOM-II, second large scale docking deployment against malaria

Parasite DNA synthesis

Parasite cell replication

Parasite DNA synthesis

Parasite detoxification

CEA, Acamba project, France

U. of Modena, Italia

U. of Los Andes, VenezuelaU. of Modena, Italia

U. of Pretoria,South-Africa

Biology partners

Tubulin from Plasmodium/plant/mamal

DHFR from Plasmodium falciparum

DHFR from Plasmodium vivax

GST from Plasmodium falciparum

Malaria target Involved in

Page 26: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 26

Grid infrastructures and projects contributing to WISDOM-II

: European grid infrastructure : European grid project

EELA

EUMedGrid EUChinaGrid

: Regional/national grid infrastructure

AuvergridEGEE

TWGrid

EMBRACE BioinfoGridSHARE

Page 27: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 27

Filtering process

1,000, 000 chemical compounds

Sorting based on scoring in different parameter sets;Consensus scoring

10,000 compounds selected

Based on key interactions

1,000 compounds

Key interactions, binding modes, descriptors,

knowledge of active site

100 compounds

MD

50 compounds to be tested in experimental lab

Credit: V. KasamFraunhofer Institute

Page 28: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 28

A grid for neglected diseases

Use the grid technology to foster research and development on malaria and other neglected diseases

Univ. Los Andes:Biological

targets, Malaria biology

LPC Clermont-Ferrand:

Biomedical grid

SCAI Fraunhofer:Knowledge extraction,

Chemoinformatics Univ. Modena:

Biological targets, Molecular Dynamics

ITB CNR:Bioinformatics,

Molecular modelling

Univ. Pretoria:Bioinformatics, Malaria biology

Academica Sinica:Grid user interface

Contacts also established with WHO, Microsoft, TATRC, Argonne, SDSC, SERONO, NOVARTIS, Sanofi-Aventis, Hospitals in subsaharian Africa,

HealthGrid:Biomedical grid, Dissemination

CEA, Acamba project:

Biological targets, Chemogenomics

BioinfoGRID:Bioinformatics

Grid

Page 29: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 29

The Cell Cycle

• Cell Cycle: – repeated sequence of events which leads the division of a

mother cell into daughter cells– Biological process frequently studied in correlation to

tumour disease– It is considered a valuable target for drug discovery in the

context of cancer and neurodegenerative disease

Page 30: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 30

Systems Biology Approach

• Systems biology studies how biological functions emerge

from the protein-protein interactions in the living systems;

• The complexity of this biological process relies in the high number of genes and networks of protein interactions involved in;

• The quantification of the behavior of each cell cycle

components has a crucial role in the understanding the complex

mechanism of cell cycle regulation.

Page 31: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 31

System Biology: Cell Cycle

Page 32: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 32

Simulation Section

2D plot: image exported in png using GnuPlot

The simulation of a single ODE system describing a cell cycle model

Page 33: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 33

Genetic Diseases

High throughput techniques (i.e. DNA microarray)

to screen the whole genome

Low reliability

Validation through Tissue Microarray

Tissue Microarray in GRID

Page 34: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 34

Genes and proteins detection

Tissue Microarray in GRID

Page 35: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 35

GRID Node

AMGA SERVER

SE

UI

CESE

CE

CE

SE

elaboration

elaboration

elaboration

GRID Node

GRID Node

Edge detection on every TMA on GRID having “age”>80

AND “gender”=F AND “desease”=colon

cancer

Tissue Microarray in GRID

Page 36: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 36

Deployment of BLAST in Grid

• A large fraction of the biological data produced is publicly available on web or ftp sites – data can be downloaded as “flat files”.

• A procedure has been set up to– Check the remote site for un updated version of the DB’s– Automatic download of the data– Register the file in a grid catalogue (LFC)– Create a DB index for its use with BLAST (using the Grid)– Register the indexes file(s) in the grid catalogue (LFC)

Page 37: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 37

• The Automatic Updater (AU)

constantly monitors FTP sites

looking for newest versions of

each databases

– When a new timestamp on FTP

sites is detected, the newest

version is automatically

downloaded and replaces the

older version on the grid

– Before clearing the older version,

an xdelta patch is computed

allowing to regenerate the old

version starting from the new

one.

Biological Database handling

Page 38: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 38

• This software for the data management allows to replicate dynamically each database in relation with its usage in order to balance the number of replicas, and so the performance, taking into account the occupied disk space.

• It relies on the statistical analysis of the database usage by the grid jobs, working on data acquired after each job execution, regarding grid queue times, database set up times and overall job computation.

• We face complex data challenges performing both the parsing of the output results and the storage of the data in the database directly from the GRID

Biological Database handling

Page 39: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 39

• In order to make this software rapidly accessible a user interface has been developed.

• It is used to submit jobs in the grid infrastructure, to visualize in a clear form the obtained results and to hide the complexity of the distributed platform.

Results

Page 40: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 40

• The main feature of the portal is the possibility to hide completely the JDL scripts layer for the grid job submission.

• While it is still possible to submit simple job to grid writing it’s own JDL script, the idea is to hide this process to make the grid use more user friendly for the bioinformatics community.

Results

Page 41: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 41

• The interfaces to application

jobs are automatically

generated by the conversion of

XML files that describe both

the end user parameters and

the structure of the JDL scripts

that have to be automatically

generated to submit the jobs.

Results

Page 42: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 42

• A selection can be made

among different databases

against which to perform the

analysis: all these databases

are updated automatically.

• In figure is reported the

summary of the submitted

application jobs, with information

about the analysis software, the

global computation status and

the user interface used for

submission.

Results

Page 43: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 43

Italian Bioinformatics Networks

Milano LITBIO- Laboratory of Interdisciplinary Technologies in BIOinformatics

Bari LIBI- Laboratory for International BIoinformatics

Napoli LAB GTP- LABoratory for the development of Bioinformatics tools and their integration with Genomics, Transcriptomics and Proteomics data.

Bari LBBM - Bioinformatics Laboratory

for the Molecular Biodiversity

30 Research Nodes

Page 44: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 44

CNR-BIOINFORMATICS Networks

National Research Council

CNR-Bioinformatics project

24 CNR Research Nodes

Page 45: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 45

Italian PON GRID based Networks

Page 46: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 46

• Concept basis Basis is the International physiome project www.physiome.org

• Computational frameworks and ICT-based tools for multiscale models of the human anatomy, physiology and pathology

• Libraries of data and toolbox for simulation and visualisation

Patient specific model from biosignals and images including molecular images

Virtual Physiological Human

Loukianos Gatzouli ICT for Health

Page 47: Milanesi Luciano National Research Council   Institute of Biomedical Technologies, Milan, Italy

Luciano Milanesi 2007/12/19, Napoli 47

Acknowledgments

• BioinfoGRID http://www.bioinfogrid.eu

• EGEE Enabling Grid for E-science project http://www.eu.egee.org

• EELA: e-Infrastructure between Europe and Latin America project http://www.eu-eela.org/index.htm

• Euchinagrid: Interconnection & Interoperability of Grids between Europe & China project.http://www.euchinagrid.org/

• FIRB-MIUR LITBIO: Laboratory for Interdisciplinary Technologies in Bioinformatics http://www.litbio.org,