High-Performance Networking Use Cases in Life Sciences

High-Performance NetworkingUse Cases in Life Sciences

2014 Internet2 Technology Exchange; Indianapolis, INSlides available at http://www.slideshare.net/arieberman

Who am I?

Director of Government Services, Principal Investigator

I’m a fallen scientist - Ph.D. Molecular Biology, Neuroscience, Bioinformatics

I’m an HPC/Infrastructure geek - 15 years

I help enable science!

I’m Ari

BioTeam

‣ Independent consulting shop

‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done

‣ Infrastructure, Informatics, Software Development, Cross-disciplinary Assessments

‣ 11+ years bridging the “gap” between science, IT & high performance computing

‣ Our wide-ranging work is what gets us invited to speak at events like this ...

What do we do?BioTeam

Laboratory Knowledge

Converged Solution

Mostly work in Life SciencesOur domain coverage

• Government • Universities • Big pharma • Biotech • Private institutes • Diagnostic startups • Oil and Gas • Geospatial • Hollywood Animation • Law Enforcement

OK, so why am I here talking to you?

We have a unique perspective across much of life sciences

We’ve noticed a few things

‣ Big Data has arrived in Life Sciences

‣ Data is being generated at unprecedented rates

‣ Research and Biomedical Orgs were caught off guard

‣ IT running to catch up, limited budgets

‣ Money is tight, Orgs reluctant to invest in Bio-IT

25% of all Life Scientists will require HPC in 2015!

Big Picture / Meta Issue

‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed

‣ IT not a part of the conversation, running to catch up

Science progressing way faster than IT can refresh/change

The Central Problem Is ...

‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure

• Bench science is changing month-to-month ... • ... while our IT infrastructure only gets refreshed every

2-7 years

‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...)

It’s a risky time to be doing Bio-IT

What are the drivers in Bio-IT today?

Genomics: Next Generation Sequencing (NGS)

It’s like the hard drive of life

The big deal about DNA

‣ DNA is the template of life

‣ DNA is read --> RNA

‣ RNA is read --> Proteins

‣ Proteins are the functional machinery that make life possible

‣ Understanding the template = understanding basis for disease

Sequencing by SynthesisHow does NGS work?

Reference assembly, variant callingHow does NGS work?

Gateway to personalized medicineThe Human Genome

‣ 3.2 Gbp

‣ 23 chromosomes

‣ ~21,000 genes

‣ Over 55M known variations

...and why NGS is the primary driver

The Problem...

‣ Sequencers are now relatively cheap and fast

‣ Some can generate a human genome in 18 hours, for $2,000

‣ Everyone is doing it

‣ Can generate 3TB of data in that time

‣ First genome took 13 years and $2.7B to complete

‣ Know of 10 organizations: 100,000 genomes over 5 years

...and why NGS is the primary driver

The Problem...

‣ Sequencers are now relatively cheap and fast

‣ Some can generate a human genome in 18 hours, for $2,000

‣ Everyone is doing it

‣ Can generate 3TB of data in that time

‣ First genome took 13 years and $2.7B to complete

‣ Know of 10 organizations: 100,000 genomes over 5 years

That’s 14PB of data, folks

Other Methodologies Not Far Behind

High-throughput Imaging

‣ Robotics screening millions of compounds on live cells 24/7 • Not as much data as genomics in

volume, but just as complex • Data volumes in the 10’s TB/week

‣ Confocal Imaging • Scanning 100’s of tissue sections/

week, each with 10’s of scans, each with 20-40 layers and multiple florescent channels

• Data volumes in the 1’s - 10’s TB/week

High-power, dense detector MRI scanners in use 24/7 at large research hospitals

High-res medical imaging

‣ Creating 3D models of brains, comparing large datasets

‣ Using those models to perform detailed neurosurgery with real-time analytic feedback from supercomputer in the OR (cool stuff)

‣ Also generates 10’s of TB/week

This is a huge problem

‣ Causing a literal deluge of data, in the 10’s of Petabytes

‣ NIH generating 1.5PB of data/month

‣ First real case in life science where 100Gb networking might really be needed

‣ But, not enough storage or compute

And, just to make things more complicated

We have them allFile & Data Types

‣ Massive text files

‣ Massive binary files

‣ Flatfile ‘databases’

‣ Spreadsheets everywhere

‣ Directories w/ 6 million files

‣ Large files: 600GB+

‣ Small files: 30kb or smaller22

Why, giant meta-analyses, of course

What to do with all that data?

‣ Typical problem across all of big data: how do you use it?

‣ In life sciences: no real standards of data formats

‣ Data scattered all over, despite push for Data Commons

‣ Not always accessible

‣ Combining the data if you have it all is a real challenge

Scientists don’t like to share (really!)A Compounding Problem...

‣ The fear: • if someone sees data before it

is published, they might steal it and publish it themselves (getting scooped)

‣ Causes: • Long time to publication • Outdated methods of

assigning scientific credit • Not properly incentivized

Sharing requiredA Problem for Data Commons

‣ Data piling up (scientists are hoarders)

‣ Bad network infrastructures

‣ Few central analytics platforms

‣ Wild-west file formats/algorithms

‣ No sharing25

Sharing requiredA Problem for Data Commons

‣ Data piling up (scientists are hoarders)

‣ Bad network infrastructures

‣ Few central analytics platforms

‣ Wild-west file formats/algorithms

‣ No sharing25

Hyperscale analytics will only work if the data is accessible!

Every kind of flow imaginableClear issue for Networking

‣ Mouse —> Elephant

‣ Typical problem: firewalls not designed for this

‣ Potentially massive amount of constant data movement

‣ How are people handling all of this?

Use Cases in Life Sciences

Getting Data out of the Laboratory

Usually very little IT infrastructure in labsLaboratories not Integrated

‣ Tons of data generating equipment going in now

‣ Can generate 15GB of data in 50 hours

‣ Others can generate 64GB/day

‣ Labs are not designed to transmit data, lucky if wired for ethernet

OK, so write data over ethernet to network drive…Getting data out

‣ Sounds good, 64GB in 24 hours ~= 6Mb/s

‣ Problem: desktop class ethernet adaptors

‣ No error checking, no retries, no MD5, no local buffer

‣ If network goes, whole run is lost

Scientists have to get creative, but not in a good wayGetting data out

‣ Usually ends up going to local workstation

‣ Go buy the cheapest disks they can

‣ Carry it somewhere, transfer the data to a workstation

‣ Put the disk in a drawer under a sink (really)

‣ Works if lab only does one or two runs/month, fails if more

Unless you’re dealing with a bigger lab with lots of equipment, or a core facility

Lab data transit not huge!

‣ Fast networking not required, 100Mb OK

‣ Just GOOD networking

‣ ….for now (more later)

Some generalized network models that have successfully solved the problem

Successful models

‣ Most of it is protocol and topology

‣ Quality of Service (QoS)

‣ Appropriate segmentation (L2 and/or L3)

‣ MPLS paths

‣ Intermediate protocols (i.e., Aspera FASP)

‣ One way or another, guarantee transfer

Storing the Data

As storage needs increase, the need to transmit it goes up too

Storage: a networking problem

‣ Networking will quickly replace storage as #1 headache in Bio-IT

‣ Petascale storage is useless without high-performance networking

‣ Most enterprise networks won’t cut it

Most single laboratories don’t have an immediate need for peta-scale storage

Storage: an Org Problem

‣ BUT - labs need to be peta-capable

‣ Can’t predict how much or what kind of equipment

‣ Have to build for an indeterminate future

‣ Does it make sense for each lab to buy own storage? • Probably not, doesn’t scale well

financially36

Orgs that don’t invest will find themselves in a mess of storage support

Storage: an Org Problem

‣ This is when the storage problem becomes a networking problem

‣ Scientists need to share, collaborate

‣ Lab with 100TB of data, needs to share with offsite or onsite scientist

‣ Also: backups and disaster recovery: data is the new commodity

Without high-performance networking, petascale anything is useless

Storage: a networking problem

‣ Traditional enterprise networks don’t cut it

‣ Large single-stream flows get squashed through firewalls and IDS

‣ Centralized: 10’s of PBs

‣ Distributed: 100’s of PBs • Likely a lot of duplication

‣ Network becomes key

‣ Cloud use makes this an even bigger problem

Storage: options!

‣ There are a ton of options for storage • Local: small and large • Institutional: mostly large • Distributed Institutional: distributed NAS

(GPFS over WAN), Object store networks, iRODS

• Public clouds: block and object storage

‣ All require high-performance networking

‣ Anything external requires awesome external connection

External connections that make petascale storage useful to scientists

Storage networking: solutions

‣ OC-192 • Works for large institutions willing to

make investment • Cost prohibitive: $200-$300k/month • Start-up cost of at least $1-2M for

border equipment

‣ Internet2 10/100Gb Hybrid ports • Much better cost, fewer routing

options • $200k/year

‣ Google Fiber, AT&T Gigapower?40

Internal networking more critical than external for petascale storage

‣ Infrastructure must be able to support the inevitable 1PB transit • Disaster recovery • High-availability • Backup

‣ Need at least 10Gb • Probably dedicated 10Gb per >1PB

storage facility: 40Gb min —> 1Tb backbone

‣ 1Gb will not cut it for that data size • ~97 days to transmit at saturation • 10Gb: ~9.7 days

And now, the real problem: topology and logical design

‣ Need a scaling internal topology

‣ One core switch doing all routing and packet transit == bad

‣ More advanced designs needed

‣ Also: prioritize performance over security • Nearly impossible for most orgs

‣ Most implemented option: Science DMZ

Sensitive data have policies and compliance issues, breaking them can be illegal

Science DMZ: not for everything

‣ Need logical topology flexible enough for security AND performance

‣ Best example: ISP model • Collapsed PE/CE on single router at edge • OSPF routing at edge, fast label

switching on dual 100Gb cores • VRF for network segments • MPLS for fast transit and bandwidth

guarantees

‣ Side benefit: trusted and untrusted Science DMZ

Analyzing the data

The pinnacle of data transit, the reason we store it in the first place

Compute == Answers!

‣ High performance computing: clusters, supercomputers, single servers, powerful workstations, etc.

‣ Mostly a datacenter issue

‣ Unless… • Storage not centralized or co-

located: data duplicated unless have a killer network

• New methods: data doesn’t move, compute moves to data

Assumes the use of central high-performance storage system

Use Case: Get data to cluster

‣ Easier problem within the same datacenter

‣ Large data needs large pipe

‣ Output of storage device needs to be fast • Needs to drive data to/from all

compute nodes simultaneously

‣ Large clusters: big problem • Needs parallel filesystems:

GPFS, Lustre46

Use of local disk in newer clustersInternal network esp. important

‣ Implementation of storage/analytics systems for Big Data/HDFS

‣ Hadoop, Gluster, local ZFS volumes, virtual disk pools

‣ Now storage can be both internal and external

‣ I/O throughput is critical47

Application characteristics

‣ Mostly single process apps

‣ Some SMP/threaded apps performance bound by IO and/or RAM

‣ Lots of Perl/Python/R

‣ Hundreds of apps, codes & toolkits

‣ 1TB - 2TB RAM “High Memory” nodes becoming essential

‣ MPI is rare • Well written MPI is even rarer

‣ Few MPI apps actually benefit from expensive low-latency interconnects* • *Chemistry, modeling and structure work is

the exception48

Genomics especiallyLife Science very I/O bound

‣ Sync time for data often takes longer than the job itself

‣ Have to load up to 300GB into memory, for 1min process

‣ Do this thousands of times

‣ Largely due to bad programming and improperly configured systems

Interconnects between the nodes and the cluster’s connection to the main network critical

Cluster networking Solutions

‣ Optimal cluster networks: fat tree and torus topologies • All layer 2, internally ‣ Most keep subscription to 1:4,

depending on usage

‣ Top-level switches connect at high speed to datacenter network • Newest are multiple 10Gb or 40Gb • Infiniband internal networks:

Mellanox ConnectX3 - ethernet and IB capable switch ports

Sharing the data: Collaboration

Fundamental to scienceCollaboration

‣ Now that data production is reaching petascale, collaboration is getting harder

‣ Projects are getting more complex, more data is being generated, takes more people to work on the science

‣ Journal authorships: common to see 40+ authors now

‣ Clearly a networking problem at its core

‣ Let’s face it, doing this right is expensive!52

The gist of collaborative data sharing in life sciencesData Movement & Data Sharing

‣ Peta-scale data movement needs • Within an organization • To/from collaborators • To/from suppliers • To/from public data repos

‣ Peta-scale data sharing needs • Collaborators and partners may

be all over the world53

Most common high-speed network: FedEx

Physical & NetworkWe Have Both Ingest Problems

‣ Significant physical ingest occurring in Life Science • Standard media: naked SATA drives

shipped via Fedex

‣ Cliche example: • 30 genomes outsourced means 30

drives will soon be sitting in your mail pile

‣ Organizations often use similar methods to freight data between buildings and among geographic sites

Physical Ingest Just Plain Nasty

‣ Easy to talk about in theory

‣ Seems “easy” to scientists and even IT at first glance

‣ Really really nasty in practice

• Incredibly time consuming • Significant operational burden • Easy to do badly / lose data

Science DMZ: making it easier to collaborateCollaboration Solutions

57Image source: “The Science DMZ: Introduction & Architecture” -- esnet

Internet2: making data accessible and affordableCollaboration Solutions

‣ Internet2 is bringing Research and Education together • High-speed, clean networking at its

core • Novel and advanced uses of SDN • Subsidized rates: national high-

performance networking affordable

‣ AL2S: quickly establish national networks at high-speed

‣ Combined with Science DMZ: platform for collaboration

Push for Cloud use: Most use Amazon Web Services, Google Cloud not far behind

Collaboration Solutions

‣ Many Orgs are pushing for cloud

‣ Unsupported scientists end up using cloud

‣ It’s fast, flexible, affordable, if done right

‣ Great place for large public datasets to live

‣ Has existing high(ish)-performance networking

‣ If done wrong, way more expensive than local compute

‣ Biggest problem: getting data to it!59

Hybrid HPC: Also known as hybrid cloudsCollaboration Solutions

‣ Relatively new idea • small local footprint • large, dynamic, scalable, orchestrated

public cloud component

‣ DevOps is key to making this work

‣ High-speed network to public cloud required

‣ Software interface layer acting as the mediator between local and public resources

‣ Good for tight budgets, has to be done right to work

‣ Not many working examples yet60

Central storage of knowledge with computeData Commons

‣ Common structure for data storage and indexing (a cloud?)

‣ Associated compute for analytics

‣ Development platform for application development (PaaS)

‣ Make discovery more possible

An Example of Progress

Huge Government Agency trying to make agriculture better in every way

USDA: Agricultural Research Service

‣ Researchers doing amazing research on how crops and animals can be better farmed

‣ Lower environmental impacts

‣ Better economic returns

‣ How to optimize how agriculture functions in the US

‣ But, there’s a problem…63

Every kind of high-throughput research talked about they are doing, and more, and on a massive scale

They’re doing all the things!

Just to list a few…

‣ Genomics (a lot of de novo assembly)

‣ Large scale imaging • LIDAR • Satellite

‣ Simulations

‣ Climatology

‣ Remote sensing

‣ Farm equipment sensors (IoT)65

Their current network

• Upgrading to DS3 • Still a lot of T1 • Won’t cut it for

science

Build a Science DMZ: SciNet, on an Internet2 AL2S Backbone

The new initiative

Hybrid HPC, Storage, Virtualization environmentSciNet to feature compute

What’s the Big Picture?

Utilizing scientific computing to enable discoveryProblems getting solved

Converged Infrastructure

The meta issue

‣ Individual technologies and their general successful use are fine

‣ Unless they all work together as a unified solution, it all means nothing

‣ Creating an end-to-end solution based on the use case (science!): converged infrastructure

It’s what we do[Hyper-]convergence

Converged Solution

People matter tooConvergence

Converged Solution

“The network IS the computer” - John Gage, Sun Microsystems

Universal Truth

‣ Convergence is not possible without networking

‣ Also not possible without GOOD networking

‣ Life Sciences is learning lessons learned by physics and astronomy 5-10 years ago

‣ Biggest problem is Org acceptance and investment in personnel and equipment

‣ Next-Gen biomedical research advancing too quickly: must invest now

end; Thanks! slides at http://www.slideshare.net/arieberman

High-Performance Networking Use Cases in Life Sciences

Science

REVIEW Chemical Sciences Journal, Vol. 2012astonjournals.com/manuscripts/Vol2012/CSJ-38_Vol2012.pdfChemical Sciences Journal, Vol. 2012: ... cases of nausea, vomiting, diarrhea,

Exploring quantum computing use cases for life sciences

Annual Meeting. Decision Sciences Institute; 25 (Honolulu ...CASESTUDIES 20THANNUALCASEWORKSHOP—CASES INDECISION SCIENCES 107 TrentMobile Communications Nabil Hassan andJeffrey

Software Defined Networking – Real World Use Cases (Test ... · Software Defined Networking – Real World Use Cases (Test ... Confidential. Our Students, Our Future Benjamin Carle

1 Major and Remaining Challenges in Networking, Wireless, and Intelligent Systems Information Sciences Institute “We are the ‘&’ in R&D Information Sciences

Logistical Networking for Energy Sciences

5G & PNF Use Cases for R7 GuiLin - Networking

Social Networking Cases

Online social networking for the sciences and how an average scientist makes himself visible

Networking: View from NASA - Home Page - Sciences and ... Networking... · Networking: View from NASA Richard B. Rood, ... Number of Products Delivered by the DAACs ... – Become

Embedded Librarianship in Academic Health Sciences Programs: Cases from the Frontline

Online social networking for the chemical sciences

Embedded Librarianship in Academic Health Sciences Programs: Cases from the Frontline Elizabeth Bucciarelli Health Sciences Librarian Eastern Michigan

GEPOF tutorial- home networkinggrouper.ieee.org/groups/802/3/GEPOFSG/email/pdfcnr2Yo_YaJ.pdf · OUTLINE% • Home!networking!requirements!!! • GEPOF!use!cases! – FTTHhome!networking!applicaons!

Cloud networking use cases with VNS3

Risks in Social networking sites Gilberto Marzano Rezeknes University of applied sciences, Latvia

2009 SCHOOl OF inFORMaTiOn SCienCeS Magazine Fall 2009.pdf · The School of Information Sciences has redesigned the Telecommunications Networking Lab (831 Information Sciences Building)

Increasing the Efficiency of Workflows: Use Cases in the Life Sciences

PowerPoint Presentation · 2015-12-19 · Open Networking Foundation, “SDN Migration Considerations and Use Cases,” ONF TR-506, 2014. Open Networking Foundation, “SDN Migration

11 04 1016-00-000s Usage Cases and Functional Requirements Mesh Networking Military Perspective