50
The Francis Crick Institute Creating a Research Computing Platform for the Science of Human Health David Fergusson

Creating a Research Computing platform for the Science of Human

Embed Size (px)

Citation preview

Page 1: Creating a Research Computing platform for the Science of Human

The Francis Crick Institute

Creating a Research Computing Platform for the Science of Human Health

David Fergusson

Page 2: Creating a Research Computing platform for the Science of Human

Introduction

2

Page 3: Creating a Research Computing platform for the Science of Human

Challenges for biomed ”big data” scienceDistributed Data Sets

Distributed computing resources

Separate authentication/authorization mechanisms

Researchers want to combine and synthesise data

How do we do this?3

Page 4: Creating a Research Computing platform for the Science of Human

Example

Dr David Fergusson,Head of Scientific Computing,Francis Crick Institute

Challenges of providing shared platformsfor staff from existing institutes– CRUK London Research Institute– National Institute for Medical Research

Compute and data requirements for 1,250 scientists working in biomed– In a central London building

Direction of travel towards more and wider collaboration, requirement for controlled sharing of sensitive data

4

Photo credit: Francis Crick Institute

Page 5: Creating a Research Computing platform for the Science of Human

Addressing the problem in the short term

SafeShare – shared secure authorisation/authentication

Shared Data Centre(s) – avoid costly/insecure moving of data

eMedlab – collaborative science/shared operations model

5

Page 6: Creating a Research Computing platform for the Science of Human

UK e-Infrastructure

A new bottom up approach

6

Page 7: Creating a Research Computing platform for the Science of Human

People’s National eInfrastructure

Uganda

Medical Bioinformatics

Business and local government

ESRC £64M

MRC £120M

SECURE

Page 8: Creating a Research Computing platform for the Science of Human

What has worked?

Consolidation through collaboration

Swansea: One system supporting Farr Wales, ADRC Wales, MRC CLIMB, Dementia Platform UK

Scotland: EPCC supporting Farr Scotland and ADRC Scotland, leveraging expertise from Archer, UK-RDF

Leeds: ARC supporting Farr HeRC, Leeds Med Bio, Consumer Data RC

Slough DC: eMedLab, Imperial Med Bio, KCL bio cluster

Jisc network: Safe Share

Page 9: Creating a Research Computing platform for the Science of Human

JISC SafeShare

9

Page 10: Creating a Research Computing platform for the Science of Human

John Chapman, Deputy head, information security, Jisc

The safe share project

Page 11: Creating a Research Computing platform for the Science of Human

About Jisc » AssentAssent:

Single, unifying technology that enables you to effectively manage and control access to a wide range of web and non-web services and applications.These include cloud infrastructures, High Performance Computing, Grid Computing and commonly deployed services such as email, file store, remote access and instant messaging

11

Page 12: Creating a Research Computing platform for the Science of Human

About Jisc » Safe ShareSafe Share:

Providing and building services on encrypted VPN infrastructure between organisationsEnhanced confidentiality and integrity requirements per ISO27001Requirement to move electronic health data securely and support research collaborationWorking with biomedical researchers at Farr Institute, MRC Medical Bioinformatics initiative, ESRC Administrative Data Centres

12

Page 13: Creating a Research Computing platform for the Science of Human

The safe share project

The safe share project 15

Drivers

• Requirement for connectivity to move and access electronic health data securely

• Challenge to give public confidence that data is appropriately protected

• Provide economies of scale in secure connectivity

The safe share project

• Jisc management and funding of £960k to pilot potential solutions with the aim of developing a service in 2016/17

Page 14: Creating a Research Computing platform for the Science of Human

Partners

The safe share project 16

University of BristolCardiff University

University of Leeds

Swansea University

University of Edinburgh

UCLFrancis Crick Institute

University of Oxford

University of Southampton

University of Manchester

St Andrews University

The Farr Institute The MRC Medical Bioinformatics initiative

The Administrative Data Research Network

University of BristolCardiff UniversityUniversity of EdinburghFrancis Crick InstituteUniversity of LeedsUCLUniversity of ManchesterUniversity of OxfordUniversity of St AndrewsUniversity of SouthamptonSwansea University

Page 15: Creating a Research Computing platform for the Science of Human

The safe share project

The safe share project 17

Authentication, Authorisation and Accounting Infrastructure (AAAI)

Use Cases:• HeRC, N8 HPC – access between facilities using home institution credentials

• eMedLab – partners will be able to use a common AAAI to access this new system (for analysis of for instance human genome data, medical images, clinical, psychological and social data)

• Swansea University Health Informatics Group – investigating Moonshot as an authentication mechanism to allow use of home institution credentials

• University of Oxford: to enable researchers to use home institution credentials for authentication to request access to datasets for studies e.g. into dementia

Page 16: Creating a Research Computing platform for the Science of Human

The safe share project

The safe share project 18

Example “service slice”: FarrInstitution LAN

Safe sharecore

Janet, internet or other network

Farr trusted environments

safe share router at edge

Page 17: Creating a Research Computing platform for the Science of Human

The safe share project

The safe share project 19

Example “service slice”: FarrInstitution LAN

Farr trusted environments

Janet, internet or other network

safe share router at edge

Safe sharecore

Page 18: Creating a Research Computing platform for the Science of Human

UK Academic

Shared Data Centre

20

Page 19: Creating a Research Computing platform for the Science of Human

Shared data centre

£900K investment from HEFCEAnchor tenants:

– Francis Crick Institute– King’s College London– London School of Economics– Queen Mary University of London– Wellcome Trust Sanger Institute– University College London 21

Page 20: Creating a Research Computing platform for the Science of Human

Potential cost-saving/resource benefits

Jisc Shared Datacentre is already a cost savingeMedLab award, and need for quick spend, gave impetus to UCL, KCL, QMUL, Sanger, LSE and Crick to identify off-site datacentre hosting (Slough)– Anchor tenants get price reduction based on volume of space usedProcurement led by JiscDatacentre connected to Janet network (Jisc investment) Improved PUE; Slough 1.25 cf ~2 for HEI datacentre (UCL save ~£2M p.a.)

Page 21: Creating a Research Computing platform for the Science of Human

Datacentre Connection Topology

N3/PSNH/PSN

Page 22: Creating a Research Computing platform for the Science of Human

eMedLab

Collaborative science

Shared Operation

24

Page 23: Creating a Research Computing platform for the Science of Human

Objectives - Flexibility

• To help generate new insights and clinical outcomes by combining data from diverse sources and disciplines

• Bring computing workloads to the data, minimising the need for costly data movements

• To allow customised use of resources• To enable innovative ways of working collaboratively• To allow a distributed support model

25

Page 24: Creating a Research Computing platform for the Science of Human

Institutional Collaboration

Page 25: Creating a Research Computing platform for the Science of Human

Support team

eMedLab academy• Training via CDFs and courses• Promote collaborations via “Labs”

eMedLab infrastructure

• Shared computer cluster• Integrate exchange heterogeneous data • Methods and insights across diseases

Page 26: Creating a Research Computing platform for the Science of Human

Hardware Overview

Page 27: Creating a Research Computing platform for the Science of Human

eMedLabis a hub

6+1 partners

3 data types

electronic health records

genomic

images

3 expertises

clinician scientists

analytics

basic science

3 disease areas

rare

cancer

cardio

>6M patients

Page 28: Creating a Research Computing platform for the Science of Human

What is eMedLab?

Page 29: Creating a Research Computing platform for the Science of Human

Distributed/Federated support(What has worked/savings ..)

eMedLabOps team(shared team)

Knowledge sharing/transfer

(inc. developing UK industrial capacity –OCF/OpenStack)

Support

Support

Support

SupportSupport

Support

Page 30: Creating a Research Computing platform for the Science of Human

Project Model

Page 31: Creating a Research Computing platform for the Science of Human

Many projects, same challenges

Information governanceSecure data transferUser managementAAAIWorking with Janet to explore how to support most/all projects

Page 32: Creating a Research Computing platform for the Science of Human

Cultural Barriers Challenges

Finance – government funding with spend window of 1 year only+Mitigated by use of efficient procurement teams and framework agreements

+Working closely with vendors to ensure tight time targets met- Drain on (unfunded) project management and finance team resourcesRegulatory challenge+Mitigated by clear policies, governance, supported by training+Changing EU data protection legislation- Risk of bad PR and/or data leaksPeople +Everyone is open, collaborative, generous with time and knowledge

Page 33: Creating a Research Computing platform for the Science of Human

eMedLab production service Projects• UCL & WTSI - Enabling Collaborative Medical Genomics Analysis Using Arvados – Javier Herrero

• Crick KCL UCL - A scalable and flexible collaborative eMedLab cancer genomics cluster to share large-scale datasets and computational resources – Peter van Loo

• UCL QMUL Farr - Creating and exploiting research datamart using i2b2 and novel data-driven methods - Spiros Denaxas

• LSHTM & QMUL - An evaluation of a genomic analysis tools VM on the EMedLab, applied to infectious disease projects at the LSHTM using data from EBI and Sanger & Genetic Analysis of UK Biobank Data - Taane Clark & Helen Warren

• UCL & ICH - The HIGH-5 Programme - High definition, in-depth phenotyping at GOSH, plus related projects - Phil Beales & Hywel Williams & Chela James

Page 34: Creating a Research Computing platform for the Science of Human

eMedLabenablesprojects

eMedLab brings data and expertise together across diseases

(potential)

• Mechanisms of cancer diversity and genome instability• Better understanding of biomarkers• DARWIN Clinical Trial to target clonal drivers

Cancer evolution and heterogeneity (Swanton & Van Loo)

• Cancers evolve heterogeneously• Diverse driver mutations and instability mechanisms

• TracerX: Track lung cancer evolution• Data: genomes, MRI, molecular pathology• Who: clinicians, statisticians, evolutionary biologists

Page 35: Creating a Research Computing platform for the Science of Human

37

CAMPCrick Analysis and Data Management Platform

David Fergusson,Bruno Silva,Adam Hufman,Luke Reimbach

Page 36: Creating a Research Computing platform for the Science of Human

CAMP

Fast Data ingest

Batch And cloud automated Analysis

Fast parallel Shared Storage

Advanced cloud storage

Tiered storage

Archive

DR storage and compute

Postprocessing

and buffer

Instruments

Analysis of data independent of location, platform, OS

Page 37: Creating a Research Computing platform for the Science of Human

Data ingest

Shared Storage

AnalysisAnd cloud

Advanced cloud storage

Page 38: Creating a Research Computing platform for the Science of Human

Data File profiles

40

No.s of files

Size of file

1,000,000

64k – 1Mb 500Mb – 4Gb

Page 39: Creating a Research Computing platform for the Science of Human

41

Future Problems

Page 40: Creating a Research Computing platform for the Science of Human

Metadata

Collecting metadata– Metadata collection has to be automated – manual systems are too labour

intensive– Open shared metadata formats – currently many proprietaryGA4GH, etc.

– Tools for managing metadata Using metadata

– Object storage– Non-tree based search– Natural language search

Page 41: Creating a Research Computing platform for the Science of Human

Staging data between multiple infrastructures Data infrastructures do not have shared understanding of data location.

Ideally compute moves to the data.– Much biomed data (medical imaging, genomics) is becoming too large to move efficiently– Published resource information would allow jobs to be moved to data location– “Data resource broker”– Increasingly multiple data sets need to be synthesised to create new knowledge – Compute jobs need to span multiple data locations.– (shared physical data locations & virtual secure networks between data resources are a beginning).

Page 42: Creating a Research Computing platform for the Science of Human

Sharing data seamlessly between infrastructuresResearchers and work need to be able to move seamlessly between infrastructures

Currently many proprietary barriers.

Different virtualisation technologies.

Is containerisation the answer?

Also relies on shared AAAI

Page 43: Creating a Research Computing platform for the Science of Human

Two types of data/analysisCurrently the biomed data we deal with splits into two main categories:

1. moderate numbers of large files – Typically image data but also annotated genomic sequence data– Files typically 10s Mb – 10s Gb– Generally 1000s – 10,000s files

2. Very large numbers of small files– Typically sequencing data– Files typically Mb down to Kb sizes– Very large numbers of files ~ 2 - 300,000,000

45

Page 44: Creating a Research Computing platform for the Science of Human

Data Pipelines

Typically have data in three states:

Active analysis

Available for re-analysis

Archive

46

Page 45: Creating a Research Computing platform for the Science of Human

Storage tiers

47

Page 46: Creating a Research Computing platform for the Science of Human

Data volumes

Instruments producing large scale data– 100s of Tb per sample for new EMs

Distributed production of data – Shared and international instruments– Sequencing everywhere

Page 47: Creating a Research Computing platform for the Science of Human

Data through-put in analysis For biomed research the ability to move data through a pipeline rapidly and to run

many pipelines in parallel is essential.

“Balanced” infrastructures, where data IO, network and compute speeds match

Being able to support concurrent analyses of widely differing profiles on an infrastructure

Page 48: Creating a Research Computing platform for the Science of Human

Data segmentation in analysis

Image data– 1000s of medium sized files (Mb – Gb). – Video– 3D– Data may be inflated immediately (recently 50Tb capture -> 150Tb)– Identification of objects, modelling

Sequencing data– Millions of small files– Genomic analysis can be 1000s of medium sized files– Data inflates with analysis and annotation– Finding interconnections and distant relations

50

Page 49: Creating a Research Computing platform for the Science of Human

Data management efficiency

Compression– Image data– Sequence data – CRAM

Managing duplication

Data management tools – where is a data tool suite?– Monitoring– Viewing – Manipulating

51

Page 50: Creating a Research Computing platform for the Science of Human

Thank you for reading the information within this document; you have now reached the end.

52