Upload
dinhcong
View
218
Download
0
Embed Size (px)
Citation preview
The Francis Crick Institute
Creating a Research Computing Platform for the Science of Human Health
David Fergusson
Introduction
2
Challenges for biomed ”big data” scienceDistributed Data Sets
Distributed computing resources
Separate authentication/authorization mechanisms
Researchers want to combine and synthesise data
How do we do this?3
Example
Dr David Fergusson,Head of Scientific Computing,Francis Crick Institute
Challenges of providing shared platformsfor staff from existing institutes– CRUK London Research Institute– National Institute for Medical Research
Compute and data requirements for 1,250 scientists working in biomed– In a central London building
Direction of travel towards more and wider collaboration, requirement for controlled sharing of sensitive data
4
Photo credit: Francis Crick Institute
Addressing the problem in the short term
SafeShare – shared secure authorisation/authentication
Shared Data Centre(s) – avoid costly/insecure moving of data
eMedlab – collaborative science/shared operations model
5
UK e-Infrastructure
A new bottom up approach
6
People’s National eInfrastructure
Uganda
Medical Bioinformatics
Business and local government
ESRC £64M
MRC £120M
SECURE
What has worked?
Consolidation through collaboration
Swansea: One system supporting Farr Wales, ADRC Wales, MRC CLIMB, Dementia Platform UK
Scotland: EPCC supporting Farr Scotland and ADRC Scotland, leveraging expertise from Archer, UK-RDF
Leeds: ARC supporting Farr HeRC, Leeds Med Bio, Consumer Data RC
Slough DC: eMedLab, Imperial Med Bio, KCL bio cluster
Jisc network: Safe Share
JISC SafeShare
9
John Chapman, Deputy head, information security, Jisc
The safe share project
About Jisc » AssentAssent:
Single, unifying technology that enables you to effectively manage and control access to a wide range of web and non-web services and applications.These include cloud infrastructures, High Performance Computing, Grid Computing and commonly deployed services such as email, file store, remote access and instant messaging
11
About Jisc » Safe ShareSafe Share:
Providing and building services on encrypted VPN infrastructure between organisationsEnhanced confidentiality and integrity requirements per ISO27001Requirement to move electronic health data securely and support research collaborationWorking with biomedical researchers at Farr Institute, MRC Medical Bioinformatics initiative, ESRC Administrative Data Centres
12
The safe share project
The safe share project 15
Drivers
• Requirement for connectivity to move and access electronic health data securely
• Challenge to give public confidence that data is appropriately protected
• Provide economies of scale in secure connectivity
The safe share project
• Jisc management and funding of £960k to pilot potential solutions with the aim of developing a service in 2016/17
Partners
The safe share project 16
University of BristolCardiff University
University of Leeds
Swansea University
University of Edinburgh
UCLFrancis Crick Institute
University of Oxford
University of Southampton
University of Manchester
St Andrews University
The Farr Institute The MRC Medical Bioinformatics initiative
The Administrative Data Research Network
University of BristolCardiff UniversityUniversity of EdinburghFrancis Crick InstituteUniversity of LeedsUCLUniversity of ManchesterUniversity of OxfordUniversity of St AndrewsUniversity of SouthamptonSwansea University
The safe share project
The safe share project 17
Authentication, Authorisation and Accounting Infrastructure (AAAI)
Use Cases:• HeRC, N8 HPC – access between facilities using home institution credentials
• eMedLab – partners will be able to use a common AAAI to access this new system (for analysis of for instance human genome data, medical images, clinical, psychological and social data)
• Swansea University Health Informatics Group – investigating Moonshot as an authentication mechanism to allow use of home institution credentials
• University of Oxford: to enable researchers to use home institution credentials for authentication to request access to datasets for studies e.g. into dementia
The safe share project
The safe share project 18
Example “service slice”: FarrInstitution LAN
Safe sharecore
Janet, internet or other network
Farr trusted environments
safe share router at edge
The safe share project
The safe share project 19
Example “service slice”: FarrInstitution LAN
Farr trusted environments
Janet, internet or other network
safe share router at edge
Safe sharecore
UK Academic
Shared Data Centre
20
Shared data centre
£900K investment from HEFCEAnchor tenants:
– Francis Crick Institute– King’s College London– London School of Economics– Queen Mary University of London– Wellcome Trust Sanger Institute– University College London 21
Potential cost-saving/resource benefits
Jisc Shared Datacentre is already a cost savingeMedLab award, and need for quick spend, gave impetus to UCL, KCL, QMUL, Sanger, LSE and Crick to identify off-site datacentre hosting (Slough)– Anchor tenants get price reduction based on volume of space usedProcurement led by JiscDatacentre connected to Janet network (Jisc investment) Improved PUE; Slough 1.25 cf ~2 for HEI datacentre (UCL save ~£2M p.a.)
Datacentre Connection Topology
N3/PSNH/PSN
eMedLab
Collaborative science
Shared Operation
24
Objectives - Flexibility
• To help generate new insights and clinical outcomes by combining data from diverse sources and disciplines
• Bring computing workloads to the data, minimising the need for costly data movements
• To allow customised use of resources• To enable innovative ways of working collaboratively• To allow a distributed support model
25
Institutional Collaboration
Support team
eMedLab academy• Training via CDFs and courses• Promote collaborations via “Labs”
eMedLab infrastructure
• Shared computer cluster• Integrate exchange heterogeneous data • Methods and insights across diseases
Hardware Overview
eMedLabis a hub
6+1 partners
3 data types
electronic health records
genomic
images
3 expertises
clinician scientists
analytics
basic science
3 disease areas
rare
cancer
cardio
>6M patients
What is eMedLab?
Distributed/Federated support(What has worked/savings ..)
eMedLabOps team(shared team)
Knowledge sharing/transfer
(inc. developing UK industrial capacity –OCF/OpenStack)
Support
Support
Support
SupportSupport
Support
Project Model
Many projects, same challenges
Information governanceSecure data transferUser managementAAAIWorking with Janet to explore how to support most/all projects
Cultural Barriers Challenges
Finance – government funding with spend window of 1 year only+Mitigated by use of efficient procurement teams and framework agreements
+Working closely with vendors to ensure tight time targets met- Drain on (unfunded) project management and finance team resourcesRegulatory challenge+Mitigated by clear policies, governance, supported by training+Changing EU data protection legislation- Risk of bad PR and/or data leaksPeople +Everyone is open, collaborative, generous with time and knowledge
eMedLab production service Projects• UCL & WTSI - Enabling Collaborative Medical Genomics Analysis Using Arvados – Javier Herrero
• Crick KCL UCL - A scalable and flexible collaborative eMedLab cancer genomics cluster to share large-scale datasets and computational resources – Peter van Loo
• UCL QMUL Farr - Creating and exploiting research datamart using i2b2 and novel data-driven methods - Spiros Denaxas
• LSHTM & QMUL - An evaluation of a genomic analysis tools VM on the EMedLab, applied to infectious disease projects at the LSHTM using data from EBI and Sanger & Genetic Analysis of UK Biobank Data - Taane Clark & Helen Warren
• UCL & ICH - The HIGH-5 Programme - High definition, in-depth phenotyping at GOSH, plus related projects - Phil Beales & Hywel Williams & Chela James
eMedLabenablesprojects
eMedLab brings data and expertise together across diseases
(potential)
• Mechanisms of cancer diversity and genome instability• Better understanding of biomarkers• DARWIN Clinical Trial to target clonal drivers
Cancer evolution and heterogeneity (Swanton & Van Loo)
• Cancers evolve heterogeneously• Diverse driver mutations and instability mechanisms
• TracerX: Track lung cancer evolution• Data: genomes, MRI, molecular pathology• Who: clinicians, statisticians, evolutionary biologists
37
CAMPCrick Analysis and Data Management Platform
David Fergusson,Bruno Silva,Adam Hufman,Luke Reimbach
CAMP
Fast Data ingest
Batch And cloud automated Analysis
Fast parallel Shared Storage
Advanced cloud storage
Tiered storage
Archive
DR storage and compute
Postprocessing
and buffer
Instruments
Analysis of data independent of location, platform, OS
Data ingest
Shared Storage
AnalysisAnd cloud
Advanced cloud storage
Data File profiles
40
No.s of files
Size of file
1,000,000
64k – 1Mb 500Mb – 4Gb
41
Future Problems
Metadata
Collecting metadata– Metadata collection has to be automated – manual systems are too labour
intensive– Open shared metadata formats – currently many proprietaryGA4GH, etc.
– Tools for managing metadata Using metadata
– Object storage– Non-tree based search– Natural language search
Staging data between multiple infrastructures Data infrastructures do not have shared understanding of data location.
Ideally compute moves to the data.– Much biomed data (medical imaging, genomics) is becoming too large to move efficiently– Published resource information would allow jobs to be moved to data location– “Data resource broker”– Increasingly multiple data sets need to be synthesised to create new knowledge – Compute jobs need to span multiple data locations.– (shared physical data locations & virtual secure networks between data resources are a beginning).
Sharing data seamlessly between infrastructuresResearchers and work need to be able to move seamlessly between infrastructures
Currently many proprietary barriers.
Different virtualisation technologies.
Is containerisation the answer?
Also relies on shared AAAI
Two types of data/analysisCurrently the biomed data we deal with splits into two main categories:
1. moderate numbers of large files – Typically image data but also annotated genomic sequence data– Files typically 10s Mb – 10s Gb– Generally 1000s – 10,000s files
2. Very large numbers of small files– Typically sequencing data– Files typically Mb down to Kb sizes– Very large numbers of files ~ 2 - 300,000,000
45
Data Pipelines
Typically have data in three states:
Active analysis
Available for re-analysis
Archive
46
Storage tiers
47
Data volumes
Instruments producing large scale data– 100s of Tb per sample for new EMs
Distributed production of data – Shared and international instruments– Sequencing everywhere
Data through-put in analysis For biomed research the ability to move data through a pipeline rapidly and to run
many pipelines in parallel is essential.
“Balanced” infrastructures, where data IO, network and compute speeds match
Being able to support concurrent analyses of widely differing profiles on an infrastructure
Data segmentation in analysis
Image data– 1000s of medium sized files (Mb – Gb). – Video– 3D– Data may be inflated immediately (recently 50Tb capture -> 150Tb)– Identification of objects, modelling
Sequencing data– Millions of small files– Genomic analysis can be 1000s of medium sized files– Data inflates with analysis and annotation– Finding interconnections and distant relations
50
Data management efficiency
Compression– Image data– Sequence data – CRAM
Managing duplication
Data management tools – where is a data tool suite?– Monitoring– Viewing – Manipulating
51
Thank you for reading the information within this document; you have now reached the end.
52