Upload
dori
View
32
Download
1
Embed Size (px)
DESCRIPTION
HPC and GRID challenges in Bioinformatics. Milanesi Luciano National Research Council Institute of Biomedical Technologies, Milan, Italy [email protected]. Introduction. - PowerPoint PPT Presentation
Citation preview
Luciano Milanesi 2007/12/19, Napoli 1
HPC and GRID challenges in Bioinformatics.
Milanesi LucianoNational Research Council Institute of Biomedical Technologies, Milan, Italy [email protected]
Luciano Milanesi 2007/12/19, Napoli 2
Introduction
• The potential of new biological and biomedical technological platforms in connection with HPC and GRID technology will be particularly useful to deal with the increasing amount, complexity, and heterogeneity of biological and biomedical data.
• Bioinformatics applications for eHealth have become an ideal research area where computer scientists can apply and further develop new intelligent computation methods, in both experimental and theoretical cases.
The European Bioinformatics initiative based on infrastructure created by the EGEE and BioinfoGRID and related projects will be illustrated.
Luciano Milanesi 2007/12/19, Napoli 3
Introduction: Post-genomic
• “Post-genomic” focuses on the new tools and new methodologies emerging from the knowledge of genome sequences.
• Production and use of DNA micro arrays, analysis of transciptome, proteome, metabolome are the different topics developed in this class.
Luciano Milanesi 2007/12/19, Napoli 4
The human organism:
• ~ 3 billion nucleotides• ~ 30,000 genes coding for• ~ 100,000-300,000 transcripts• ~ 1-2 million proteins• ~ 60 trillion cells of• ~ 300 cell types in• ~14,000 distinguishable • morphological structures
Luciano Milanesi 2007/12/19, Napoli 5
ICT and Genomics
• A key development in the computational world has been the arrival of de novo design algorithms that use all available spatial information to be found within the target to design novel drugs.
• Coupling these algorithms to the rapidly growing body of information from structural genomics together with the new ICT technology (eg. HPC, GRID, Web Services, ecc.)
• provides a powerful new possibility for exploring design to a broad spectrum of genomics targets, including more challenging techniques such as:
• protein–protein interactions, docking, molecular dynamics, system biology, gene network ecc.
Luciano Milanesi 2007/12/19, Napoli 6
System Biology for Health
Luciano Milanesi 2007/12/19, Napoli 7
EGEE Related EU projects
EUGRIDGRID
ISSeG
BEinGRID
Di l i gentA DIgital Library Infrastructureon Grid ENabled Technology
EUIndia
Luciano Milanesi 2007/12/19, Napoli 8
BioinfoGRID
Luciano Milanesi 2007/12/19, Napoli 9
BioinfoGRID Project
.
• The BIOINFOGRID project proposes to combine the Bioinformatics services and applications for molecular biology users with the Grid Infrastructure by EGEE and EGEEII projects.
• In the BIOINFOGRID initiative plan to perform research in genomics, transcriptomics, proteomics and molecular dynamics applications studies based on GRID technology.
Luciano Milanesi 2007/12/19, Napoli 10
Genomics applications in GRID
• GRID analysis of genomic databases: integration of precomputed data, gene identification, differentiation of pseudogenes, comparative genome analysis, etc.
• Perform functional protein analysis in GRID by using the functional protein domain annotations on large protein families using GRID and related databases.
Luciano Milanesi 2007/12/19, Napoli 11
Bioinformatics Applications
• CSTminer Goal: compare the entire genome of the Human being against
the entire genome of some animals (mouse, dog… ecc) First test: Human against mouse Challenge dimension:
850 million of BLAST comparison (~ 2 sec of CPU for each comparison)
More than 50 CPU years needed. More than 65000 job submitted. Up to 2 million of comparison per hour. 22 different farms used. More then 900 different hosts used. 2 month of run on INFN-Grid infrastructure
Second test: Some genes of Human against many animals Challenge dimension:
1.7 million of comparison More than 900 CPU hours needed. < 1 day on INFN-Grid infrastructure
Luciano Milanesi 2007/12/19, Napoli 12
Proteomics Applications in GRID
• Protein surface calculation : the grid will be used to elaborate the volumetric description of the protein obtaining a precise representation of the corresponding surface.
Luciano Milanesi 2007/12/19, Napoli 13
Transcriptomics applications
• Computational GRIDs to analyse trascriptomics data
Description• To perform algorithmic tools for gene expression data
analysis in GRID: evaluate the computational tools for extracting biologically significant information from gene expression data.
• Algorithms will focus on clustering steady state and time series gene expression data, multiple testing and meta analysis of different microarray experiments from different groups, and identification of transcription sites.
Luciano Milanesi 2007/12/19, Napoli 14
Transcriptomics applications
Data analysis specific for bioinformatics allow the GRID user to store and search genetics data, with direct access to the data files stored on Data Storage element on GRID servers.
Researchers perform their activities regardless geographical location, interact with colleagues, share and access data
Scientific instruments and experiments provide huge amount of data from microarray
Luciano Milanesi 2007/12/19, Napoli 15
Influenza A Neuraminidase
• Grid-enabled High-throughput in-silico Screening against Influenza A Neuraminidase
• Encouraged by the success of the first EGEE biomedical data challenge against malaria (WISDOM), the second data challenge battling avian flu was kicked off in April 2006 to identify new drugs for the potential variants of the Influenza A virus.
• Mobilizing thousands of CPUs on the Grid, the 6-weeks high-throughput screening activity has fulfilled over 100 CPU years of computing power.
• In this project, the impact of a world-wide Grid infrastructure to efficiently deploy large scale virtual screening to speed up the drug design process has been demonstrated.
Luciano Milanesi 2007/12/19, Napoli 16
VE-infrastructure shared between Europe and Latin America
Identification of Applications in EELA
• EELA Biomedical Applications Fall into Three Categories– Bioinformatics Applications
BLAST in Grids. Phylogeny.
– Computational Biochemical Processes Wide in-Silico Docking on Malaria
(WISDOM).– Biomedical Models
GEANT4 Application for Tomographic
Emission (GATE)
Luciano Milanesi 2007/12/19, Napoli 17
ACGT Project
Luciano Milanesi 2007/12/19, Napoli 18
EuChinaGRID
• Facility for the prediction of the three dimensional structure of “never born proteins”
ComputingElement
Euchina Virtual Organization (EGEE)
StorageElement
2. Transfer
application
User InterfacePortal
AABTDDSAD
1.Submit
sequence
3. Store protein
4. Visualize
PDB1.32 3.23 3.442.77 4.33 5.661.32 3.23 3.44
Luciano Milanesi 2007/12/19, Napoli 19
Grid added value for international collaboration on neglected diseases
• Grids offer unprecedented opportunities for sharing information and resources world wide
Grids are unique tools for :-Collecting and sharing information (Epidemiology, Genomics)-Networking experts-Mobilizing resources routinely or in emergency (vaccine & drug discovery)
Luciano Milanesi 2007/12/19, Napoli 20
Molecular applications in GRID
Aim : The objective is to docking and Molecular Dynamics simulations, which usually take a very long time to complete the analysis.
Description• Wide In Silico Docking On Malaria initiative WISDOM-
II:This project perform the docking and molecular dynamics simulation on the GRID platform for discovery new targets for neglected diseases . Analysis can be performed notably using the data generated by the WISDOM application on the EGEE infrastructure.
Luciano Milanesi 2007/12/19, Napoli 21
Grid impact on drug discovery workflow down to drug delivery (1/2)
• Grids provide the necessary tools and data to identify new biological targets– Bioinformatics services (database replication, workflow…)– Resources for CPU intensive tasks such as genomics
comparative analysis, inverse docking…
• Grids provide the resources to speed up lead discovery– Large scale in silico docking to identify potentially promising
compounds– Molecular dynamics computations to refine virtual screening and
further assess selected compounds
Luciano Milanesi 2007/12/19, Napoli 22
Grid impact on drug discovery workflow down to drug delivery (2/2)
• Grids provide environments for epidemiology– Federation of databases to collect data in endemic areas to
study a disease and to evaluate impact of vaccine, vector control measures
– Resources for data analysis and mathematical modelling
• Grids provide the services needed for clinical trials– Federation of databases to collect data in the centres
participating to the clinical trials
• Grids provide the tools to monitor drug delivery– Federation of databases to monitor drug delivery
Luciano Milanesi 2007/12/19, Napoli 23
Docking: predict howsmall molecules bind
to a receptor ofknown 3D structure
Starting compound database
Starting target structure model
DOCKING
Predicted binding models
Post-analysis
Compounds for assay
Virtual screening process by docking
There are successful examples–rapid,–cost effective…
But there are limitations–CPU and storage needed
Luciano Milanesi 2007/12/19, Napoli 24
Grid-enabled high throughput virtual screening by docking
A few target structures
Millions of chemicalcompounds
• 1 to 30 mn by docking• A few MB by output• 100 CPU years, 1 TB
• Large scale deployment on grid infrastructure
• Challenges: - Speed-up the process - Manage the data
Docking
software
Luciano Milanesi 2007/12/19, Napoli 25
WISDOM-II, second large scale docking deployment against malaria
Parasite DNA synthesis
Parasite cell replication
Parasite DNA synthesis
Parasite detoxification
CEA, Acamba project, France
U. of Modena, Italia
U. of Los Andes, VenezuelaU. of Modena, Italia
U. of Pretoria,South-Africa
Biology partners
Tubulin from Plasmodium/plant/mamal
DHFR from Plasmodium falciparum
DHFR from Plasmodium vivax
GST from Plasmodium falciparum
Malaria target Involved in
Luciano Milanesi 2007/12/19, Napoli 26
Grid infrastructures and projects contributing to WISDOM-II
: European grid infrastructure : European grid project
EELA
EUMedGrid EUChinaGrid
: Regional/national grid infrastructure
AuvergridEGEE
TWGrid
EMBRACE BioinfoGridSHARE
Luciano Milanesi 2007/12/19, Napoli 27
Filtering process
1,000, 000 chemical compounds
Sorting based on scoring in different parameter sets;Consensus scoring
10,000 compounds selected
Based on key interactions
1,000 compounds
Key interactions, binding modes, descriptors,
knowledge of active site
100 compounds
MD
50 compounds to be tested in experimental lab
Credit: V. KasamFraunhofer Institute
Luciano Milanesi 2007/12/19, Napoli 28
A grid for neglected diseases
Use the grid technology to foster research and development on malaria and other neglected diseases
Univ. Los Andes:Biological
targets, Malaria biology
LPC Clermont-Ferrand:
Biomedical grid
SCAI Fraunhofer:Knowledge extraction,
Chemoinformatics Univ. Modena:
Biological targets, Molecular Dynamics
ITB CNR:Bioinformatics,
Molecular modelling
Univ. Pretoria:Bioinformatics, Malaria biology
Academica Sinica:Grid user interface
Contacts also established with WHO, Microsoft, TATRC, Argonne, SDSC, SERONO, NOVARTIS, Sanofi-Aventis, Hospitals in subsaharian Africa,
HealthGrid:Biomedical grid, Dissemination
CEA, Acamba project:
Biological targets, Chemogenomics
BioinfoGRID:Bioinformatics
Grid
Luciano Milanesi 2007/12/19, Napoli 29
The Cell Cycle
• Cell Cycle: – repeated sequence of events which leads the division of a
mother cell into daughter cells– Biological process frequently studied in correlation to
tumour disease– It is considered a valuable target for drug discovery in the
context of cancer and neurodegenerative disease
Luciano Milanesi 2007/12/19, Napoli 30
Systems Biology Approach
• Systems biology studies how biological functions emerge
from the protein-protein interactions in the living systems;
• The complexity of this biological process relies in the high number of genes and networks of protein interactions involved in;
• The quantification of the behavior of each cell cycle
components has a crucial role in the understanding the complex
mechanism of cell cycle regulation.
Luciano Milanesi 2007/12/19, Napoli 31
System Biology: Cell Cycle
Luciano Milanesi 2007/12/19, Napoli 32
Simulation Section
2D plot: image exported in png using GnuPlot
The simulation of a single ODE system describing a cell cycle model
Luciano Milanesi 2007/12/19, Napoli 33
Genetic Diseases
High throughput techniques (i.e. DNA microarray)
to screen the whole genome
Low reliability
Validation through Tissue Microarray
Tissue Microarray in GRID
Luciano Milanesi 2007/12/19, Napoli 34
Genes and proteins detection
Tissue Microarray in GRID
Luciano Milanesi 2007/12/19, Napoli 35
GRID Node
AMGA SERVER
SE
UI
CESE
CE
CE
SE
elaboration
elaboration
elaboration
GRID Node
GRID Node
Edge detection on every TMA on GRID having “age”>80
AND “gender”=F AND “desease”=colon
cancer
Tissue Microarray in GRID
Luciano Milanesi 2007/12/19, Napoli 36
Deployment of BLAST in Grid
• A large fraction of the biological data produced is publicly available on web or ftp sites – data can be downloaded as “flat files”.
• A procedure has been set up to– Check the remote site for un updated version of the DB’s– Automatic download of the data– Register the file in a grid catalogue (LFC)– Create a DB index for its use with BLAST (using the Grid)– Register the indexes file(s) in the grid catalogue (LFC)
Luciano Milanesi 2007/12/19, Napoli 37
• The Automatic Updater (AU)
constantly monitors FTP sites
looking for newest versions of
each databases
– When a new timestamp on FTP
sites is detected, the newest
version is automatically
downloaded and replaces the
older version on the grid
– Before clearing the older version,
an xdelta patch is computed
allowing to regenerate the old
version starting from the new
one.
Biological Database handling
Luciano Milanesi 2007/12/19, Napoli 38
• This software for the data management allows to replicate dynamically each database in relation with its usage in order to balance the number of replicas, and so the performance, taking into account the occupied disk space.
• It relies on the statistical analysis of the database usage by the grid jobs, working on data acquired after each job execution, regarding grid queue times, database set up times and overall job computation.
• We face complex data challenges performing both the parsing of the output results and the storage of the data in the database directly from the GRID
Biological Database handling
Luciano Milanesi 2007/12/19, Napoli 39
• In order to make this software rapidly accessible a user interface has been developed.
• It is used to submit jobs in the grid infrastructure, to visualize in a clear form the obtained results and to hide the complexity of the distributed platform.
Results
Luciano Milanesi 2007/12/19, Napoli 40
• The main feature of the portal is the possibility to hide completely the JDL scripts layer for the grid job submission.
• While it is still possible to submit simple job to grid writing it’s own JDL script, the idea is to hide this process to make the grid use more user friendly for the bioinformatics community.
Results
Luciano Milanesi 2007/12/19, Napoli 41
• The interfaces to application
jobs are automatically
generated by the conversion of
XML files that describe both
the end user parameters and
the structure of the JDL scripts
that have to be automatically
generated to submit the jobs.
Results
Luciano Milanesi 2007/12/19, Napoli 42
• A selection can be made
among different databases
against which to perform the
analysis: all these databases
are updated automatically.
• In figure is reported the
summary of the submitted
application jobs, with information
about the analysis software, the
global computation status and
the user interface used for
submission.
Results
Luciano Milanesi 2007/12/19, Napoli 43
Italian Bioinformatics Networks
Milano LITBIO- Laboratory of Interdisciplinary Technologies in BIOinformatics
Bari LIBI- Laboratory for International BIoinformatics
Napoli LAB GTP- LABoratory for the development of Bioinformatics tools and their integration with Genomics, Transcriptomics and Proteomics data.
Bari LBBM - Bioinformatics Laboratory
for the Molecular Biodiversity
30 Research Nodes
Luciano Milanesi 2007/12/19, Napoli 44
CNR-BIOINFORMATICS Networks
National Research Council
CNR-Bioinformatics project
24 CNR Research Nodes
Luciano Milanesi 2007/12/19, Napoli 45
Italian PON GRID based Networks
Luciano Milanesi 2007/12/19, Napoli 46
• Concept basis Basis is the International physiome project www.physiome.org
• Computational frameworks and ICT-based tools for multiscale models of the human anatomy, physiology and pathology
• Libraries of data and toolbox for simulation and visualisation
Patient specific model from biosignals and images including molecular images
Virtual Physiological Human
Loukianos Gatzouli ICT for Health
Luciano Milanesi 2007/12/19, Napoli 47
Acknowledgments
• BioinfoGRID http://www.bioinfogrid.eu
• EGEE Enabling Grid for E-science project http://www.eu.egee.org
• EELA: e-Infrastructure between Europe and Latin America project http://www.eu-eela.org/index.htm
• Euchinagrid: Interconnection & Interoperability of Grids between Europe & China project.http://www.euchinagrid.org/
• FIRB-MIUR LITBIO: Laboratory for Interdisciplinary Technologies in Bioinformatics http://www.litbio.org,