Transcript
Page 1: High-Performance Networking Use Cases in Life Sciences

High-Performance NetworkingUse Cases in Life Sciences

1

2014 Internet2 Technology Exchange; Indianapolis, INSlides available at http://www.slideshare.net/arieberman

Page 2: High-Performance Networking Use Cases in Life Sciences

Who am I?

2

Director of Government Services, Principal Investigator

I’m a fallen scientist - Ph.D. Molecular Biology, Neuroscience, Bioinformatics

I’m an HPC/Infrastructure geek - 15 years

I help enable science!

I’m Ari

Page 3: High-Performance Networking Use Cases in Life Sciences

3

BioTeam

‣ Independent consulting shop

‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done

‣ Infrastructure, Informatics, Software Development, Cross-disciplinary Assessments

‣ 11+ years bridging the “gap” between science, IT & high performance computing

‣ Our wide-ranging work is what gets us invited to speak at events like this ...

Page 4: High-Performance Networking Use Cases in Life Sciences

What do we do?BioTeam

4

Laboratory Knowledge

Page 5: High-Performance Networking Use Cases in Life Sciences

What do we do?BioTeam

4

Laboratory Knowledge

Page 6: High-Performance Networking Use Cases in Life Sciences

What do we do?BioTeam

4

Laboratory Knowledge

Page 7: High-Performance Networking Use Cases in Life Sciences

What do we do?BioTeam

4

Laboratory Knowledge

Page 8: High-Performance Networking Use Cases in Life Sciences

What do we do?BioTeam

4

Laboratory Knowledge

Page 9: High-Performance Networking Use Cases in Life Sciences

What do we do?BioTeam

4

Laboratory Knowledge

Page 10: High-Performance Networking Use Cases in Life Sciences

What do we do?BioTeam

4

Laboratory Knowledge

Converged Solution

Page 11: High-Performance Networking Use Cases in Life Sciences

What do we do?BioTeam

4

Laboratory Knowledge

Converged Solution

Page 12: High-Performance Networking Use Cases in Life Sciences

Mostly work in Life SciencesOur domain coverage

• Government • Universities • Big pharma • Biotech • Private institutes • Diagnostic startups • Oil and Gas • Geospatial • Hollywood Animation • Law Enforcement

5

Page 13: High-Performance Networking Use Cases in Life Sciences

6

OK, so why am I here talking to you?

Page 14: High-Performance Networking Use Cases in Life Sciences

We have a unique perspective across much of life sciences

We’ve noticed a few things

‣ Big Data has arrived in Life Sciences

‣ Data is being generated at unprecedented rates

‣ Research and Biomedical Orgs were caught off guard

‣ IT running to catch up, limited budgets

‣ Money is tight, Orgs reluctant to invest in Bio-IT

7

25% of all Life Scientists will require HPC in 2015!

Page 15: High-Performance Networking Use Cases in Life Sciences

8

Big Picture / Meta Issue

‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed

‣ IT not a part of the conversation, running to catch up

Page 16: High-Performance Networking Use Cases in Life Sciences

Science progressing way faster than IT can refresh/change

The Central Problem Is ...

‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure

• Bench science is changing month-to-month ... • ... while our IT infrastructure only gets refreshed every

2-7 years

‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...)

9

Page 17: High-Performance Networking Use Cases in Life Sciences

10

It’s a risky time to be doing Bio-IT

11

What are the drivers in Bio-IT today?

Page 18: High-Performance Networking Use Cases in Life Sciences

11

Genomics: Next Generation Sequencing (NGS)

Page 19: High-Performance Networking Use Cases in Life Sciences

It’s like the hard drive of life

12

The big deal about DNA

‣ DNA is the template of life

‣ DNA is read --> RNA

‣ RNA is read --> Proteins

‣ Proteins are the functional machinery that make life possible

‣ Understanding the template = understanding basis for disease

Page 20: High-Performance Networking Use Cases in Life Sciences

Sequencing by SynthesisHow does NGS work?

13

Page 21: High-Performance Networking Use Cases in Life Sciences

Reference assembly, variant callingHow does NGS work?

14

Page 22: High-Performance Networking Use Cases in Life Sciences

Reference assembly, variant callingHow does NGS work?

14

Page 23: High-Performance Networking Use Cases in Life Sciences

Reference assembly, variant callingHow does NGS work?

14

Page 24: High-Performance Networking Use Cases in Life Sciences

Gateway to personalized medicineThe Human Genome

‣ 3.2 Gbp

‣ 23 chromosomes

‣ ~21,000 genes

‣ Over 55M known variations

15

Page 25: High-Performance Networking Use Cases in Life Sciences

...and why NGS is the primary driver

16

The Problem...

‣ Sequencers are now relatively cheap and fast

‣ Some can generate a human genome in 18 hours, for $2,000

‣ Everyone is doing it

‣ Can generate 3TB of data in that time

‣ First genome took 13 years and $2.7B to complete

‣ Know of 10 organizations: 100,000 genomes over 5 years

Page 26: High-Performance Networking Use Cases in Life Sciences

...and why NGS is the primary driver

16

The Problem...

‣ Sequencers are now relatively cheap and fast

‣ Some can generate a human genome in 18 hours, for $2,000

‣ Everyone is doing it

‣ Can generate 3TB of data in that time

‣ First genome took 13 years and $2.7B to complete

‣ Know of 10 organizations: 100,000 genomes over 5 years

That’s 14PB of data, folks

Page 27: High-Performance Networking Use Cases in Life Sciences

17

Other Methodologies Not Far Behind

Page 28: High-Performance Networking Use Cases in Life Sciences

High-throughput Imaging

‣ Robotics screening millions of compounds on live cells 24/7 • Not as much data as genomics in

volume, but just as complex • Data volumes in the 10’s TB/week

‣ Confocal Imaging • Scanning 100’s of tissue sections/

week, each with 10’s of scans, each with 20-40 layers and multiple florescent channels

• Data volumes in the 1’s - 10’s TB/week

18

Page 29: High-Performance Networking Use Cases in Life Sciences

High-power, dense detector MRI scanners in use 24/7 at large research hospitals

High-res medical imaging

‣ Creating 3D models of brains, comparing large datasets

‣ Using those models to perform detailed neurosurgery with real-time analytic feedback from supercomputer in the OR (cool stuff)

‣ Also generates 10’s of TB/week

19

Page 30: High-Performance Networking Use Cases in Life Sciences

20

This is a huge problem

‣ Causing a literal deluge of data, in the 10’s of Petabytes

‣ NIH generating 1.5PB of data/month

‣ First real case in life science where 100Gb networking might really be needed

‣ But, not enough storage or compute

Page 31: High-Performance Networking Use Cases in Life Sciences

21

And, just to make things more complicated

Page 32: High-Performance Networking Use Cases in Life Sciences

We have them allFile & Data Types

‣ Massive text files

‣ Massive binary files

‣ Flatfile ‘databases’

‣ Spreadsheets everywhere

‣ Directories w/ 6 million files

‣ Large files: 600GB+

‣ Small files: 30kb or smaller22

Page 33: High-Performance Networking Use Cases in Life Sciences

Why, giant meta-analyses, of course

23

What to do with all that data?

‣ Typical problem across all of big data: how do you use it?

‣ In life sciences: no real standards of data formats

‣ Data scattered all over, despite push for Data Commons

‣ Not always accessible

‣ Combining the data if you have it all is a real challenge

Page 34: High-Performance Networking Use Cases in Life Sciences

Scientists don’t like to share (really!)A Compounding Problem...

‣ The fear: • if someone sees data before it

is published, they might steal it and publish it themselves (getting scooped)

‣ Causes: • Long time to publication • Outdated methods of

assigning scientific credit • Not properly incentivized

24

Page 35: High-Performance Networking Use Cases in Life Sciences

Sharing requiredA Problem for Data Commons

‣ Data piling up (scientists are hoarders)

‣ Bad network infrastructures

‣ Few central analytics platforms

‣ Wild-west file formats/algorithms

‣ No sharing25

Page 36: High-Performance Networking Use Cases in Life Sciences

Sharing requiredA Problem for Data Commons

‣ Data piling up (scientists are hoarders)

‣ Bad network infrastructures

‣ Few central analytics platforms

‣ Wild-west file formats/algorithms

‣ No sharing25

Hyperscale analytics will only work if the data is accessible!

Page 37: High-Performance Networking Use Cases in Life Sciences

Every kind of flow imaginableClear issue for Networking

‣ Mouse —> Elephant

‣ Typical problem: firewalls not designed for this

‣ Potentially massive amount of constant data movement

‣ How are people handling all of this?

26

Page 38: High-Performance Networking Use Cases in Life Sciences

27

Use Cases in Life Sciences

Page 39: High-Performance Networking Use Cases in Life Sciences

28

Getting Data out of the Laboratory

Page 40: High-Performance Networking Use Cases in Life Sciences

Usually very little IT infrastructure in labsLaboratories not Integrated

‣ Tons of data generating equipment going in now

‣ Can generate 15GB of data in 50 hours

‣ Others can generate 64GB/day

‣ Labs are not designed to transmit data, lucky if wired for ethernet

29

Page 41: High-Performance Networking Use Cases in Life Sciences

Usually very little IT infrastructure in labsLaboratories not Integrated

‣ Tons of data generating equipment going in now

‣ Can generate 15GB of data in 50 hours

‣ Others can generate 64GB/day

‣ Labs are not designed to transmit data, lucky if wired for ethernet

29

Page 42: High-Performance Networking Use Cases in Life Sciences

Usually very little IT infrastructure in labsLaboratories not Integrated

‣ Tons of data generating equipment going in now

‣ Can generate 15GB of data in 50 hours

‣ Others can generate 64GB/day

‣ Labs are not designed to transmit data, lucky if wired for ethernet

29

Page 43: High-Performance Networking Use Cases in Life Sciences

OK, so write data over ethernet to network drive…Getting data out

‣ Sounds good, 64GB in 24 hours ~= 6Mb/s

‣ Problem: desktop class ethernet adaptors

‣ No error checking, no retries, no MD5, no local buffer

‣ If network goes, whole run is lost

30

Page 44: High-Performance Networking Use Cases in Life Sciences

Scientists have to get creative, but not in a good wayGetting data out

‣ Usually ends up going to local workstation

‣ Go buy the cheapest disks they can

‣ Carry it somewhere, transfer the data to a workstation

‣ Put the disk in a drawer under a sink (really)

‣ Works if lab only does one or two runs/month, fails if more

31

Page 45: High-Performance Networking Use Cases in Life Sciences

Unless you’re dealing with a bigger lab with lots of equipment, or a core facility

Lab data transit not huge!

‣ Fast networking not required, 100Mb OK

‣ Just GOOD networking

‣ ….for now (more later)

32

Page 46: High-Performance Networking Use Cases in Life Sciences

Some generalized network models that have successfully solved the problem

Successful models

‣ Most of it is protocol and topology

‣ Quality of Service (QoS)

‣ Appropriate segmentation (L2 and/or L3)

‣ MPLS paths

‣ Intermediate protocols (i.e., Aspera FASP)

‣ One way or another, guarantee transfer

33

Page 47: High-Performance Networking Use Cases in Life Sciences

34

Storing the Data

Page 48: High-Performance Networking Use Cases in Life Sciences

As storage needs increase, the need to transmit it goes up too

Storage: a networking problem

‣ Networking will quickly replace storage as #1 headache in Bio-IT

‣ Petascale storage is useless without high-performance networking

‣ Most enterprise networks won’t cut it

35

Page 49: High-Performance Networking Use Cases in Life Sciences

Most single laboratories don’t have an immediate need for peta-scale storage

Storage: an Org Problem

‣ BUT - labs need to be peta-capable

‣ Can’t predict how much or what kind of equipment

‣ Have to build for an indeterminate future

‣ Does it make sense for each lab to buy own storage? • Probably not, doesn’t scale well

financially36

Page 50: High-Performance Networking Use Cases in Life Sciences

Orgs that don’t invest will find themselves in a mess of storage support

Storage: an Org Problem

‣ This is when the storage problem becomes a networking problem

‣ Scientists need to share, collaborate

‣ Lab with 100TB of data, needs to share with offsite or onsite scientist

‣ Also: backups and disaster recovery: data is the new commodity

37

Page 51: High-Performance Networking Use Cases in Life Sciences

Without high-performance networking, petascale anything is useless

Storage: a networking problem

‣ Traditional enterprise networks don’t cut it

‣ Large single-stream flows get squashed through firewalls and IDS

‣ Centralized: 10’s of PBs

‣ Distributed: 100’s of PBs • Likely a lot of duplication

‣ Network becomes key

‣ Cloud use makes this an even bigger problem

38

Page 52: High-Performance Networking Use Cases in Life Sciences

Storage: options!

‣ There are a ton of options for storage • Local: small and large • Institutional: mostly large • Distributed Institutional: distributed NAS

(GPFS over WAN), Object store networks, iRODS

• Public clouds: block and object storage

‣ All require high-performance networking

‣ Anything external requires awesome external connection

39

Page 53: High-Performance Networking Use Cases in Life Sciences

External connections that make petascale storage useful to scientists

Storage networking: solutions

‣ OC-192 • Works for large institutions willing to

make investment • Cost prohibitive: $200-$300k/month • Start-up cost of at least $1-2M for

border equipment

‣ Internet2 10/100Gb Hybrid ports • Much better cost, fewer routing

options • $200k/year

‣ Google Fiber, AT&T Gigapower?40

Page 54: High-Performance Networking Use Cases in Life Sciences

Internal networking more critical than external for petascale storage

Storage networking: solutions

‣ Infrastructure must be able to support the inevitable 1PB transit • Disaster recovery • High-availability • Backup

‣ Need at least 10Gb • Probably dedicated 10Gb per >1PB

storage facility: 40Gb min —> 1Tb backbone

‣ 1Gb will not cut it for that data size • ~97 days to transmit at saturation • 10Gb: ~9.7 days

41

Page 55: High-Performance Networking Use Cases in Life Sciences

And now, the real problem: topology and logical design

Storage networking: solutions

‣ Need a scaling internal topology

‣ One core switch doing all routing and packet transit == bad

‣ More advanced designs needed

‣ Also: prioritize performance over security • Nearly impossible for most orgs

‣ Most implemented option: Science DMZ

42

Page 56: High-Performance Networking Use Cases in Life Sciences

Sensitive data have policies and compliance issues, breaking them can be illegal

Science DMZ: not for everything

‣ Need logical topology flexible enough for security AND performance

‣ Best example: ISP model • Collapsed PE/CE on single router at edge • OSPF routing at edge, fast label

switching on dual 100Gb cores • VRF for network segments • MPLS for fast transit and bandwidth

guarantees

‣ Side benefit: trusted and untrusted Science DMZ

43

Page 57: High-Performance Networking Use Cases in Life Sciences

44

Analyzing the data

Page 58: High-Performance Networking Use Cases in Life Sciences

The pinnacle of data transit, the reason we store it in the first place

Compute == Answers!

‣ High performance computing: clusters, supercomputers, single servers, powerful workstations, etc.

‣ Mostly a datacenter issue

‣ Unless… • Storage not centralized or co-

located: data duplicated unless have a killer network

• New methods: data doesn’t move, compute moves to data

45

Page 59: High-Performance Networking Use Cases in Life Sciences

Assumes the use of central high-performance storage system

Use Case: Get data to cluster

‣ Easier problem within the same datacenter

‣ Large data needs large pipe

‣ Output of storage device needs to be fast • Needs to drive data to/from all

compute nodes simultaneously

‣ Large clusters: big problem • Needs parallel filesystems:

GPFS, Lustre46

Page 60: High-Performance Networking Use Cases in Life Sciences

Use of local disk in newer clustersInternal network esp. important

‣ Implementation of storage/analytics systems for Big Data/HDFS

‣ Hadoop, Gluster, local ZFS volumes, virtual disk pools

‣ Now storage can be both internal and external

‣ I/O throughput is critical47

Page 61: High-Performance Networking Use Cases in Life Sciences

Application characteristics

‣ Mostly single process apps

‣ Some SMP/threaded apps performance bound by IO and/or RAM

‣ Lots of Perl/Python/R

‣ Hundreds of apps, codes & toolkits

‣ 1TB - 2TB RAM “High Memory” nodes becoming essential

‣ MPI is rare • Well written MPI is even rarer

‣ Few MPI apps actually benefit from expensive low-latency interconnects* • *Chemistry, modeling and structure work is

the exception48

Page 62: High-Performance Networking Use Cases in Life Sciences

Genomics especiallyLife Science very I/O bound

‣ Sync time for data often takes longer than the job itself

‣ Have to load up to 300GB into memory, for 1min process

‣ Do this thousands of times

‣ Largely due to bad programming and improperly configured systems

49

Page 63: High-Performance Networking Use Cases in Life Sciences

Interconnects between the nodes and the cluster’s connection to the main network critical

Cluster networking Solutions

‣ Optimal cluster networks: fat tree and torus topologies • All layer 2, internally ‣ Most keep subscription to 1:4,

depending on usage

‣ Top-level switches connect at high speed to datacenter network • Newest are multiple 10Gb or 40Gb • Infiniband internal networks:

Mellanox ConnectX3 - ethernet and IB capable switch ports

50

Page 64: High-Performance Networking Use Cases in Life Sciences

51

Sharing the data: Collaboration

Page 65: High-Performance Networking Use Cases in Life Sciences

Fundamental to scienceCollaboration

‣ Now that data production is reaching petascale, collaboration is getting harder

‣ Projects are getting more complex, more data is being generated, takes more people to work on the science

‣ Journal authorships: common to see 40+ authors now

‣ Clearly a networking problem at its core

‣ Let’s face it, doing this right is expensive!52

Page 66: High-Performance Networking Use Cases in Life Sciences

The gist of collaborative data sharing in life sciencesData Movement & Data Sharing

‣ Peta-scale data movement needs • Within an organization • To/from collaborators • To/from suppliers • To/from public data repos

‣ Peta-scale data sharing needs • Collaborators and partners may

be all over the world53

Page 67: High-Performance Networking Use Cases in Life Sciences

54

Most common high-speed network: FedEx

Page 68: High-Performance Networking Use Cases in Life Sciences

Physical & NetworkWe Have Both Ingest Problems

‣ Significant physical ingest occurring in Life Science • Standard media: naked SATA drives

shipped via Fedex

‣ Cliche example: • 30 genomes outsourced means 30

drives will soon be sitting in your mail pile

‣ Organizations often use similar methods to freight data between buildings and among geographic sites

55

Page 69: High-Performance Networking Use Cases in Life Sciences

Physical Ingest Just Plain Nasty

‣ Easy to talk about in theory

‣ Seems “easy” to scientists and even IT at first glance

‣ Really really nasty in practice

• Incredibly time consuming • Significant operational burden • Easy to do badly / lose data

56

Page 70: High-Performance Networking Use Cases in Life Sciences

Science DMZ: making it easier to collaborateCollaboration Solutions

57Image source: “The Science DMZ: Introduction & Architecture” -- esnet

Page 71: High-Performance Networking Use Cases in Life Sciences

Internet2: making data accessible and affordableCollaboration Solutions

‣ Internet2 is bringing Research and Education together • High-speed, clean networking at its

core • Novel and advanced uses of SDN • Subsidized rates: national high-

performance networking affordable

‣ AL2S: quickly establish national networks at high-speed

‣ Combined with Science DMZ: platform for collaboration

58

Page 72: High-Performance Networking Use Cases in Life Sciences

Push for Cloud use: Most use Amazon Web Services, Google Cloud not far behind

Collaboration Solutions

‣ Many Orgs are pushing for cloud

‣ Unsupported scientists end up using cloud

‣ It’s fast, flexible, affordable, if done right

‣ Great place for large public datasets to live

‣ Has existing high(ish)-performance networking

‣ If done wrong, way more expensive than local compute

‣ Biggest problem: getting data to it!59

Page 73: High-Performance Networking Use Cases in Life Sciences

Hybrid HPC: Also known as hybrid cloudsCollaboration Solutions

‣ Relatively new idea • small local footprint • large, dynamic, scalable, orchestrated

public cloud component

‣ DevOps is key to making this work

‣ High-speed network to public cloud required

‣ Software interface layer acting as the mediator between local and public resources

‣ Good for tight budgets, has to be done right to work

‣ Not many working examples yet60

Page 74: High-Performance Networking Use Cases in Life Sciences

Central storage of knowledge with computeData Commons

‣ Common structure for data storage and indexing (a cloud?)

‣ Associated compute for analytics

‣ Development platform for application development (PaaS)

‣ Make discovery more possible

61

Page 75: High-Performance Networking Use Cases in Life Sciences

62

An Example of Progress

Page 76: High-Performance Networking Use Cases in Life Sciences

Huge Government Agency trying to make agriculture better in every way

USDA: Agricultural Research Service

‣ Researchers doing amazing research on how crops and animals can be better farmed

‣ Lower environmental impacts

‣ Better economic returns

‣ How to optimize how agriculture functions in the US

‣ But, there’s a problem…63

Page 77: High-Performance Networking Use Cases in Life Sciences

Every kind of high-throughput research talked about they are doing, and more, and on a massive scale

They’re doing all the things!

64

Page 78: High-Performance Networking Use Cases in Life Sciences

Just to list a few…

‣ Genomics (a lot of de novo assembly)

‣ Large scale imaging • LIDAR • Satellite

‣ Simulations

‣ Climatology

‣ Remote sensing

‣ Farm equipment sensors (IoT)65

Page 79: High-Performance Networking Use Cases in Life Sciences

Their current network

66

• Upgrading to DS3 • Still a lot of T1 • Won’t cut it for

science

Page 80: High-Performance Networking Use Cases in Life Sciences

Build a Science DMZ: SciNet, on an Internet2 AL2S Backbone

The new initiative

67

Page 81: High-Performance Networking Use Cases in Life Sciences

Hybrid HPC, Storage, Virtualization environmentSciNet to feature compute

68

Page 82: High-Performance Networking Use Cases in Life Sciences

69

What’s the Big Picture?

Page 83: High-Performance Networking Use Cases in Life Sciences

Utilizing scientific computing to enable discoveryProblems getting solved

70

Laboratory Knowledge

Page 84: High-Performance Networking Use Cases in Life Sciences

Utilizing scientific computing to enable discoveryProblems getting solved

70

Laboratory Knowledge

Page 85: High-Performance Networking Use Cases in Life Sciences

Utilizing scientific computing to enable discoveryProblems getting solved

70

Laboratory Knowledge

Page 86: High-Performance Networking Use Cases in Life Sciences

Utilizing scientific computing to enable discoveryProblems getting solved

70

Laboratory Knowledge

Page 87: High-Performance Networking Use Cases in Life Sciences

Utilizing scientific computing to enable discoveryProblems getting solved

70

Laboratory Knowledge

Page 88: High-Performance Networking Use Cases in Life Sciences

Utilizing scientific computing to enable discoveryProblems getting solved

70

Laboratory Knowledge

Page 89: High-Performance Networking Use Cases in Life Sciences

Converged Infrastructure

71

The meta issue

‣ Individual technologies and their general successful use are fine

‣ Unless they all work together as a unified solution, it all means nothing

‣ Creating an end-to-end solution based on the use case (science!): converged infrastructure

Page 90: High-Performance Networking Use Cases in Life Sciences

It’s what we do[Hyper-]convergence

72

Laboratory Knowledge

Page 91: High-Performance Networking Use Cases in Life Sciences

It’s what we do[Hyper-]convergence

72

Laboratory Knowledge

Converged Solution

Page 92: High-Performance Networking Use Cases in Life Sciences

It’s what we do[Hyper-]convergence

72

Laboratory Knowledge

Converged Solution

Page 93: High-Performance Networking Use Cases in Life Sciences

People matter tooConvergence

73

Laboratory Knowledge

Converged Solution

Page 94: High-Performance Networking Use Cases in Life Sciences

“The network IS the computer” - John Gage, Sun Microsystems

Universal Truth

‣ Convergence is not possible without networking

‣ Also not possible without GOOD networking

‣ Life Sciences is learning lessons learned by physics and astronomy 5-10 years ago

‣ Biggest problem is Org acceptance and investment in personnel and equipment

‣ Next-Gen biomedical research advancing too quickly: must invest now

74

Page 95: High-Performance Networking Use Cases in Life Sciences

75

end; Thanks! slides at http://www.slideshare.net/arieberman


Recommended