48
Genomes, Clouds, and Organization eMedLab Workshop, London May, 2016 Chris Dwan Director, Research Computing [email protected] @fdmts

2016 05 sanger

Embed Size (px)

Citation preview

Page 1: 2016 05 sanger

Genomes, Clouds, and Organization

eMedLab Workshop, LondonMay, 2016

Chris DwanDirector, Research Computing

[email protected] @fdmts

Page 2: 2016 05 sanger

Conclusions

• In order to take full advantage of cloud technologies, we need to change not just what we do, but also how we do it.

• Organizations need to fundamentally rethink how they engage with technology and technologists in order to remain relevant.

• The groups who get good at collaboration in this new world will lead the next decade of biomedical science.

Page 3: 2016 05 sanger

• The Broad Institute is a non-profit biomedical research institute founded in 2004

• Fifty core faculty members, from MIT and Harvard, plus hundreds of associate members.

• ~1000 directly affiliated personnel• ~2,400+ associated researchers

 Programs and Initiativesfocused on specific disease or biology areas

CancerGenome BiologyCell CircuitsPsychiatric DiseaseMetabolismMedical and Population GeneticsInfectious DiseaseEpigenomics

Platformsfocused technological innovation and application

GenomicsData SciencesTherapeuticsImagingMetabolite ProfilingProteomicsGenetic Perturbation

The Broad Institute

Page 4: 2016 05 sanger

• The Broad Institute is a non-profit biomedical research institute founded in 2004

• Fifty core faculty members and hundreds of associate members from MIT and Harvard

• ~1000 research and administrative personnel, plus ~2,400+ associated researchers

• ~1.4 x 106 genotyped samples

 Programs and Initiativesfocused on specific disease or biology areas

CancerGenome BiologyCell CircuitsPsychiatric DiseaseMetabolismMedical and Population GeneticsInfectious DiseaseEpigenomics

Platformsfocused technological innovation and application

GenomicsData SciencesTherapeuticsImagingMetabolite ProfilingProteomicsGenetic Perturbation

The Broad Institute

“This generation has a historic opportunity and responsibility to transform medicine by using systematic approaches in the biological sciences to dramatically accelerate the understanding and cure of disease”

Page 5: 2016 05 sanger

People @ Broad

Page 6: 2016 05 sanger
Page 7: 2016 05 sanger

WGS / day: ~120 140 .. (plus other products)Data generation: ~ 0.5PB/mo (200 MB/s) Network: ~1.6Gb/sec

This is not going to slow down any time soon.

Page 8: 2016 05 sanger

WGS / day: ~120 140 …Data storage: ~200 MB/s (0.5PB/mo) Network: ~1.6Gb/sec

This is not going to slow down any time soon.

Colocated File Storage: ~30PColocated HPC: ~14k coresColocated Object Storage Capacity: ~5P

Public cloud data: ~7PPublic cloud cores: ~15k cores steady state

Internal network: 10Gb/secExternal network: 100Gb/sec

Page 9: 2016 05 sanger

Base pairs vs. Samples

Page 10: 2016 05 sanger

The future is already here – it’s just not very well distributed

William Gibson

Page 11: 2016 05 sanger

A lot of technology has happened since we were all worried about “data tsunamis” in 2007.

Page 12: 2016 05 sanger

Amazon’s innovation

2002: All sharing of data, provisioning of services, configuration of infrastructure – everything is via programmatic call (API)

APIs must be written to be called by external customers.

Anyone who does not do this will be fired, have a nice day.

2004: Amazon launches a product with which I can provision servers and storage as easily as I buy books.

Page 13: 2016 05 sanger

Cloudbursting (Aug, 2015)

50,000+ cores used for ~2 hours

Page 14: 2016 05 sanger

Data Storage (May 2016)

Page 15: 2016 05 sanger
Page 16: 2016 05 sanger

Avere (June 2015): A cloud gateway for files.

• Data uploaded 4 PB and counting• Compression and client side encryption in-line (push-button)• Simple enough that we’re out in front of the computational capabilities ($$)

Broad Data Center Google Cloud Services

Cloud Bucket

PhysicalAvere

Cluster

VirtualAvere

ClusterPhysicalCompute

Hosts

VirtualCompute

HostsPhysical Data Store

Free

Expensive

Page 17: 2016 05 sanger

Liberation from the location of metal

The billing API is the best way to get usage information out of google’s cloud offerings.

Eight Exabytes Free

Page 18: 2016 05 sanger

File based storage: The Information Limits• Single namespace filers hit real-world limits at:

– ~5PB (restriping times, operational hotspots, MTBF headaches)– ~109 files: Directories must either be wider or deeper than human

brains can handle.

• Filesystem paths are presumed to persist forever– Forests of symbolic links– “Charlotte’s web”

• Access semantics are fundamentally inadequate.– We need complex, dynamic, context sensitive semantics including

consent for research use.– File hierarchies will never scale to a federated world.

Page 19: 2016 05 sanger

3rd Party Companies Fill Cloud Feature GapsCloudhealth dashboard atop the billing API

Storage $$

Network $$

Page 20: 2016 05 sanger

Direct storage cost

Two kinds of network egress

Data’s trip to the cloud should be one-way.

Page 21: 2016 05 sanger

Genomes on the Cloud (April 2016)

Testing the genome analysis

pipeline“Go-live”

Page 22: 2016 05 sanger

“To be without method is deplorable, but to depend entirely on method is worse.”

The Mustard Seed Garden Manual of Painting, 1679

Page 23: 2016 05 sanger

Most laboratory and clinical work

Consumer of analysis

User of GUI and visual tools

A Technology Engagement Spectrum

“Users”

Page 24: 2016 05 sanger

Most laboratory and clinical work

Consumer of analysis

User of GUI and visual tools

Author of scripts and workflows for personal use

Author of scripts and command line tools for use by others

A Technology Engagement Spectrum

“Users”

Well served by traditional “research computing”

Page 25: 2016 05 sanger

Most laboratory and clinical work

Manager of compute infrastructure for use by others.

Consumer of analysis

User of GUI and visual tools

Author of scripts and workflows for personal use

Author of scripts and command line tools for use by others

Manager of compute infrastructure for personal use

A Technology Engagement Spectrum

“Users”

“Shadow IT”

Well served by traditional “research computing”

Page 26: 2016 05 sanger

Most laboratory and clinical work

Manager of compute infrastructure for use by others.

Consumer of analysis

User of GUI and visual tools

Author of scripts and workflows for personal use

Author of scripts and command line tools for use by others

Manager of compute infrastructure for personal use

A Technology Engagement Spectrum

“Users”

“Shadow IT”

Well served by traditional “research computing”

To The Cloud!

Page 27: 2016 05 sanger

Most laboratory and clinical work

Manager of compute infrastructure for use by others.

Consumer of analysis

User of GUI and visual tools

Author of scripts and workflows for personal use

Author of scripts and command line tools for use by others

Manager of compute infrastructure for personal use

A Technology Engagement Spectrum

“Users”

“Shadow IT”

Well served by traditional “research computing”

To The Cloud!

To The Other Cloud!

Page 28: 2016 05 sanger

Most laboratory and clinical work

Manager of compute infrastructure for use by others.

Consumer of analysis

User of GUI and visual tools

Author of scripts and workflows for personal use

Author of scripts and command line tools for use by others

Manager of compute infrastructure for personal use

A Technology Engagement Spectrum

“Users”

“Shadow IT”

Well served by traditional “research computing”

To The Cloud!

To The Other Cloud!

Already happily off-prem, PaaS,

etc.

Page 29: 2016 05 sanger

Most laboratory and clinical work

Manager of compute infrastructure for use by others.

Consumer of analysis

User of GUI and visual tools

Author of scripts and workflows for personal use

Author of scripts and command line tools for use by others

Manager of compute infrastructure for personal use

Tool

Bui

ldin

g

Trai

ning

/ A

cces

s

Shifting how we engage with technology

A Technology Engagement Spectrum

“Users”

“Shadow IT”

Well served by traditional “research computing”

Page 30: 2016 05 sanger

What does “cloud” mean to me?

• Engineering and Design Approach: – All infrastructure and technology choices are

seamlessly available, as necessary, to every project and product.

• Integrative Organizing Principle*– Technologists directly engaged and accessible– Shared accountability for business / project goals.

Organizations who fail to integrate in this way will be routed around.

*DevOps

Page 31: 2016 05 sanger

Product(revenue

generation)

User Services(workstations,

laptops, printers)

Run the Business(HR, Finance, …)

IT / Infrastructure

Internal Service Catalog

Business Priorities

A traditional IT organization, splitting infrastructure and technical architecture away from business priorities

Page 32: 2016 05 sanger

Product(increased connection with architectural and infrastructure design)

User Services(workstations,

laptops, printers)

Run the Business(HR, Finance, …)

Infrastructure

Business Priorities

Internal Service Catalog DevOps(direct engagement w/ teams through entire

product lifecycle)

The beginnings of a DevOps transition, characterized by teams named “DevOps,” that serve particular projects

Page 33: 2016 05 sanger

Business units dive into

infrastructure as they need,

partnering with technologists to

achieve business goals

A mature DevOps IT organization composed of the same staff, working in a fundamentally different way.

Business Priorities

Page 34: 2016 05 sanger

Clouds open new possibilities for IT Services

Traditional IT:• Globally shared services• NFS, AD / LDAP, DNS, …• Many services provided using

public cloudsResponsibility: CIO

Page 35: 2016 05 sanger

Clouds open new possibilities for IT Services

Traditional IT:• Globally shared services• NFS, AD / LDAP, DNS, …• Many services provided using

public cloudsResponsibility: CIO

Cancer Genome Analysis Connectivity MapBilling Support:• IT provides coordination between internal cost

objects and cloud vendor “projects” or “roles”• No shared servicesResponsibility: User

Page 36: 2016 05 sanger

Governance remains critical

$$ !!

Page 37: 2016 05 sanger

Clouds open new possibilities for IT Services

Traditional IT:• Globally shared services• NFS, AD / LDAP, DNS, …• Many services provided using

public cloudsResponsibility: CIO

Cancer Genome Analysis Connectivity MapBilling Support:• IT provides coordination between internal cost

objects and cloud vendor “projects” or “roles”• No shared servicesResponsibility: User

Cloud / Hybrid Model• Granular shared services• VPN used to expose selected

services to particular projects Responsibility: Project / Service Lead

BITS DevOps DSDE Dev Cloud Pilot

API API API

Page 38: 2016 05 sanger
Page 39: 2016 05 sanger

The Cloud Future (where we are going)

• We are not so special:• Dozens to hundreds of businesses have multiple exabytes of data.• Health care / life sciences is playing catch-up.

• Objects, not files: • Engineer like an MMORPG* designer.• Do not copy files. Access APIs.• Avere gets around this by turning objects back into files.

• Cloud aware access patterns: • Data egress is expensive. • Do computing adjacent to the data.• Figure out a cost model to support this world.

• Everybody will not use the same cloud vendor:• If we want to collaborate at scale, we need to stop thinking in terms of single,

monolithic solutions.

*Massively Multiplayer Online Role Playing Game

Page 40: 2016 05 sanger

Funding for specific analysis

Funding allocated by headcount, team, or department

Unfunded

Cos

t / s

cale

of a

naly

sis

Large

Trivial

Moderate

Ongoing unfunded support burden

Fixed capacity on shared use systems.

Hard choices, limitations

Ad-hoc / opportunistic use

Elastic capacity on shared use

systems

MoonshotsLost opportunity

Distinct funding models

Page 41: 2016 05 sanger

You move towards and become like that which you think about.

Page 42: 2016 05 sanger

The Big Data Healthcare Feeding Frenzy

• “If we sequence X new patients with condition Y every year, the sequencing data alone will take up ALL THE EXABYTES”*

• The data storage and analysis needs of precision / personalized / genomic medicine are not unreasonable by comparison with major, data driven industries (100s of Exabytes over the next decade).

• We can compensate by being thoughtful about what data we store, how we store it, and how we share it.

* If you multiply a number by a sufficiently large number the product is a large number.

Page 43: 2016 05 sanger

… people who had nothing to do with the design and execution of the study …

... use another group’s data for their own ends …

… even use the data to try to disprove what the original investigators had posited…

… some researchers have characterized as “research parasites”

Fear, Uncertainty, and Doubt

Page 44: 2016 05 sanger

What we need

• Incentive structures that reward making data accessible and useful– All indicators except the benefit of the patient lead to suboptimal behavior– This will require courage.

• National / global data scale data repositories, standards, and toolkits– Death to walled gardens, monolithic systems, and GUIs.– Life to APIs built for a global community (c.f. Amazon, 2002)

• Open, fearless conversation about data protection vs. appropriate use– Genomic data is inherently personally identifiable and should be treated as such– “Appropriate usage” goes well beyond legal conformity

Page 45: 2016 05 sanger

Standards are needed for genomic data

“The mission of the Global Alliance for Genomics and Health is to accelerate progress in human health by helping to establish a common framework of harmonized approaches to enable effective and responsible sharing of genomic and clinical data, and by catalyzing data sharing projects that drive and demonstrate the value of data sharing.”

Regulatory IssuesEthical IssuesTechnical Issues

Page 46: 2016 05 sanger

This stuff is important

We have an opportunity to change lives and health outcomes, and to realize the gains of genomic medicine, not in an indefinite future, but this year.

We also have an opportunity to waste vast amounts of money (very rapidly) and still not really help anybody.

I would like to work together with you to build a better future.

[email protected]

Page 47: 2016 05 sanger

Conclusions

• In order to take full advantage of cloud technologies, we need to change not just what we do, but also how we do it.

• Organizations need to fundamentally rethink how they engage with technology and technologists in order to remain relevant.

• The groups who get good at collaboration in this new world will lead the next decade of biomedical science.

Page 48: 2016 05 sanger

Thank You