EGEE-II INFSO-RI-031688
Enabling Grids for E-sciencE
www.eu-egee.org
EGEE and gLite are registered trademarks
Gergely Sipos
MTA SZTAKILaboratory of Parallel and Distributed Systems
www.lpds.sztaki.hu
Life sciences applicationson the EGEE Grid
2
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 2
The EGEE Project
• Aim of EGEE: “to establish a seamless European Grid infrastructure for the support of the European Research Area (ERA)”
• EGEE– 1 April 2004 – 31 March 2006– 71 partners in 27 countries, federated in regional Grids
• EGEE-II– 1 April 2006 – 30 April 2008– Expanded consortium
• EGEE-III– 1 May 2008 – 30 April 2010– Transition to sustainable model
3
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Life sciences cluster in EGEE
Life sciences is one of the strategic communities for EGEE
• Life sciences cluster in EGEE:– To increase the impact of EGEE on this community– To drive the development of the EGEE services– To develop domain specific, high level services– Main topics:
Drug discovery Medical imaging Bioinformatics
4
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
4
Enabling Grids for E-sciencE
Biomed Virtual Organization
Size of the infrastructure today:• > 250 sites in 48 countries• > 68 000 CPU cores• ~ 20 PB disk + tape MSS• > 150 000 jobs/day• > 9000 registered usersOut of which, Biomed VO:• > 100 sites in 30 countries• ~ 17 000 CPU• > 150 registered users
6
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
6
Enabling Grids for E-sciencE
Life sciences applications
Resources
Communication layer (GEANT, Internet...)
EGEE middleware services
Applications
Pro
du
ctio
n g
rid
infr
astr
uct
ure
lev
el
Resources Resources Resources Resources
Applications Applications Applications
Domain-specific services Domain-specific services
App
licat
ions
leve
l
7
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
7
Enabling Grids for E-sciencE
Application example 1: WISDOM
Resources
Communication layer (GEANT, Internet...)
Biomed Virtual Organization, EGEE middleware services
WISDOM
Pro
du
ctio
n g
rid
infr
astr
uct
ure
lev
el
Resources Resources Resources Resources
AMGA metadata catalogDIANE grid job scheduler
GAP user interface moduleApp
licat
ions
leve
l
8
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
WISDOM In silico Drug Discovery
• WISDOM: http://wisdom.healthgrid.org/• Goal: find new drugs for neglected and emerging
diseases– Neglected diseases lack R&D– Emerging diseases require very rapid response time
• Need for an optimized environment– To achieve production in a limited time– To optimize performances
• Method: grid-enabled virtual docking– Cheaper than in vitro tests– Faster than in vitro tests
9
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
High throughput virtual dockingEnabling Grids for E-sciencE
Chemical compounds :Chembridge – 500,000Drug like – 500,000
Targets :Plasmepsin II (1lee, 1lf2, 1lf3)Plasmepsin IV (1ls5)(enzymes)
Millions of chemicalcompounds available
in laboratories
High Throughput Screening1-10$/compound, nearly impossible
Molecular docking (FlexX, Autodock)~80 CPU years, 1 TB data
Computational data challenge~6 weeks on ~1000/1600 computers
Hits screeningusing assays performed onliving cells
Chemical compounds : ZINCMolecular docking : FlexX, AutodockTargets structures : PDBGrid infrastructure : EGEE
Leads
Clinical testing
Drug
10
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Computing model & workflow
Simulationjobs run on theEGEE Grid
Simulationresults stored
on the EGEE Grid
12
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Efficiency
Estimated duration on 1 CPU 88.3 years
Duration on EGEE 6 weeks
Cumulative number of Grid jobs 54,000
Maximum number of concurrent CPUs used
2,000
Approximated throughput 2 sec/docking
• Second data challenge for avian flu drug analysis– 8 targets against 300,000 compounds
(2,400,000 simulations)
13
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Statistics of deployment
• First Data Challenge: July 1st - August 15th 2005– Target: malaria– 80 CPU years– 1 TB of data produced– 1700 CPUs used in parallel– 1st large scale docking on world-wide e-infrastructure
• Second Data Challenge: April 15th - June 30th 2006 – Target: avian flu– 100 CPU years– 800 GB of data produced– 1700 CPUs used in parallel– Infrastructure was configured in 45 days
• Third Data Challenge: October 1st - 15th December 2006 – Target: malaria– 400 CPU years– 1,6 TB of data produced– Up to 5000 CPUs used in parallel– Very high docking throughput: > 100.000 compounds per hour
14
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
14
Enabling Grids for E-sciencE
Application example 2: Bronze standard
Resources
Communication layer (GEANT, Internet...)
Biomed Virtual Organization, EGEE middleware services
Bronze standard workflow
Pro
du
ctio
n g
rid
infr
astr
uct
ure
lev
el
Resources Resources Resources Resources
MOTEUR workflow manager
App
licat
ions
leve
l
15
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Scientific challenge
• Medical image registration is the process by which two images acquired independently are registered into a common frame.
Unregistered Registered
O1
O2
T
• Registration accuracy is critical for many image analysis procedures• Bronze Standard is a statistical procedure to estimate the performance of registration algorithms
16
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Implementation on EGEE Enabling Grids for E-sciencE
A Params
PFRegister
Service
GetFromEGEE YasminaPFMatchICP
CrestLines
B
Baladin
FormatConv GetFromEGEE GetFromEGEE
GetFromEGEE
FormatConv
FormatConv FormatConv
MultiTransfoTest
ParamsParams Params
Params
Params
Accuracy Translation Accuracy Rotation
WriteResults
WriteResults
WriteResults WriteResults
Params
MethodToTest
Params Params
~100 image pairs
~800 EGEE jobs
17
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
17
Enabling Grids for E-sciencE
Application example 3: Bioinformatics Grid Portal
Resources
Communication layer (GEANT, Internet...)
Biomed Virtual Organization, EGEE middleware services
Bioinformatics Grid Portal
Pro
du
ctio
n g
rid
infr
astr
uct
ure
lev
el
Resources Resources Resources Resources
App
licat
ions
leve
l
18
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Enabling Grids for E-sciencE
18
GPSA: Bioinformatics Grid Portal
• Scientific objectives– Protein sequence analysis– Analyse data from high-throughput Biology: genome projects, structural biology, ….
• Tools–Web interface: NPS@–Protein databases are stored on grid storage as flat files
SWISS-PROT, SP-TrEMBL, NRL_3D, PATTINPROT, …
– Legacy bioinformatics applications
FASTA, BLAST, PSI-BLAST, SSEARCH, …
• Contact– http://npsa-pbil.ibcp.fr/– [email protected]
20
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
How to get involved with EGEE
• More information on EGEE:– http://www.eu-egee.org – Life Sciences cluster:
http://technical.eu-egee.org/index.php?id=258 – Coordinator of life sciences cluster:
Vincent BRETON ([email protected])
• To get your own application ported to EGEE:– Support team: http://www.lpds.sztaki.hu/gasuc
• To get access to Biomed Virtual Organization– Obtain a certificate from NIIF CA: http://www.ca.niif.hu/– Register to Virtual Organization:
https://voms.cnaf.infn.it:8443/voms/bio/webui/request/user/create – Access grid from P-GRADE Portal, Bioinformatics Grid Portal, etc.
• EGEE User Forum, Catania, Italy, 2-6 March, 2009:– http://indico.cern.ch/conferenceDisplay.py?confId=40435
21
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 21
www.eu-egee.org
www.lpds.sztaki.hu
Gergely Sipos