High-Performance Networking Use Cases in Life Sciences

  • View
    191

  • Download
    7

  • Category

    Science

Preview:

DESCRIPTION

Big data has arrived in the life science research domain and has driven the need for optimized high-performance networks in these research environments. Many petabytes of data transfer, storage and analytics are now a reality due to the fact that data is being produced cheaply and rapidly at unprecedented rates in academic, commercial and clinical laboratories. These data flows are complicated by the combination of high-frequency mouse flows as well as high-volume elephant flows, sometimes from the same application operating in parallel environments. Additional complicating factors include collaborative research efforts on large data stores that utilize both common and disparate compute resources, the need for high-performance data encryption in-flight to cover the transmission and handling of clinical data, and the relatively poor state of algorithm development from an IO standpoint throughout the industry. This presentation will cover representative advanced networking use cases from life sciences research, the challenges that they present in networking environments, some solutions that are being deployed with in both small and large institutions, and an overview of a few of the unresolved problems to date.

Citation preview

High-Performance NetworkingUse Cases in Life Sciences

1

2014 Internet2 Technology Exchange; Indianapolis, INSlides available at http://www.slideshare.net/arieberman

Who am I?

2

Director of Government Services, Principal Investigator

I’m a fallen scientist - Ph.D. Molecular Biology, Neuroscience, Bioinformatics

I’m an HPC/Infrastructure geek - 15 years

I help enable science!

I’m Ari

3

BioTeam

‣ Independent consulting shop

‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done

‣ Infrastructure, Informatics, Software Development, Cross-disciplinary Assessments

‣ 11+ years bridging the “gap” between science, IT & high performance computing

‣ Our wide-ranging work is what gets us invited to speak at events like this ...

What do we do?BioTeam

4

Laboratory Knowledge

What do we do?BioTeam

4

Laboratory Knowledge

What do we do?BioTeam

4

Laboratory Knowledge

What do we do?BioTeam

4

Laboratory Knowledge

What do we do?BioTeam

4

Laboratory Knowledge

What do we do?BioTeam

4

Laboratory Knowledge

What do we do?BioTeam

4

Laboratory Knowledge

Converged Solution

What do we do?BioTeam

4

Laboratory Knowledge

Converged Solution

Mostly work in Life SciencesOur domain coverage

• Government • Universities • Big pharma • Biotech • Private institutes • Diagnostic startups • Oil and Gas • Geospatial • Hollywood Animation • Law Enforcement

5

6

OK, so why am I here talking to you?

We have a unique perspective across much of life sciences

We’ve noticed a few things

‣ Big Data has arrived in Life Sciences

‣ Data is being generated at unprecedented rates

‣ Research and Biomedical Orgs were caught off guard

‣ IT running to catch up, limited budgets

‣ Money is tight, Orgs reluctant to invest in Bio-IT

7

25% of all Life Scientists will require HPC in 2015!

8

Big Picture / Meta Issue

‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed

‣ IT not a part of the conversation, running to catch up

Science progressing way faster than IT can refresh/change

The Central Problem Is ...

‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure

• Bench science is changing month-to-month ... • ... while our IT infrastructure only gets refreshed every

2-7 years

‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...)

9

10

It’s a risky time to be doing Bio-IT

11

What are the drivers in Bio-IT today?

11

Genomics: Next Generation Sequencing (NGS)

It’s like the hard drive of life

12

The big deal about DNA

‣ DNA is the template of life

‣ DNA is read --> RNA

‣ RNA is read --> Proteins

‣ Proteins are the functional machinery that make life possible

‣ Understanding the template = understanding basis for disease

Sequencing by SynthesisHow does NGS work?

13

Reference assembly, variant callingHow does NGS work?

14

Reference assembly, variant callingHow does NGS work?

14

Reference assembly, variant callingHow does NGS work?

14

Gateway to personalized medicineThe Human Genome

‣ 3.2 Gbp

‣ 23 chromosomes

‣ ~21,000 genes

‣ Over 55M known variations

15

...and why NGS is the primary driver

16

The Problem...

‣ Sequencers are now relatively cheap and fast

‣ Some can generate a human genome in 18 hours, for $2,000

‣ Everyone is doing it

‣ Can generate 3TB of data in that time

‣ First genome took 13 years and $2.7B to complete

‣ Know of 10 organizations: 100,000 genomes over 5 years

...and why NGS is the primary driver

16

The Problem...

‣ Sequencers are now relatively cheap and fast

‣ Some can generate a human genome in 18 hours, for $2,000

‣ Everyone is doing it

‣ Can generate 3TB of data in that time

‣ First genome took 13 years and $2.7B to complete

‣ Know of 10 organizations: 100,000 genomes over 5 years

That’s 14PB of data, folks

17

Other Methodologies Not Far Behind

High-throughput Imaging

‣ Robotics screening millions of compounds on live cells 24/7 • Not as much data as genomics in

volume, but just as complex • Data volumes in the 10’s TB/week

‣ Confocal Imaging • Scanning 100’s of tissue sections/

week, each with 10’s of scans, each with 20-40 layers and multiple florescent channels

• Data volumes in the 1’s - 10’s TB/week

18

High-power, dense detector MRI scanners in use 24/7 at large research hospitals

High-res medical imaging

‣ Creating 3D models of brains, comparing large datasets

‣ Using those models to perform detailed neurosurgery with real-time analytic feedback from supercomputer in the OR (cool stuff)

‣ Also generates 10’s of TB/week

19

20

This is a huge problem

‣ Causing a literal deluge of data, in the 10’s of Petabytes

‣ NIH generating 1.5PB of data/month

‣ First real case in life science where 100Gb networking might really be needed

‣ But, not enough storage or compute

21

And, just to make things more complicated

We have them allFile & Data Types

‣ Massive text files

‣ Massive binary files

‣ Flatfile ‘databases’

‣ Spreadsheets everywhere

‣ Directories w/ 6 million files

‣ Large files: 600GB+

‣ Small files: 30kb or smaller22

Why, giant meta-analyses, of course

23

What to do with all that data?

‣ Typical problem across all of big data: how do you use it?

‣ In life sciences: no real standards of data formats

‣ Data scattered all over, despite push for Data Commons

‣ Not always accessible

‣ Combining the data if you have it all is a real challenge

Scientists don’t like to share (really!)A Compounding Problem...

‣ The fear: • if someone sees data before it

is published, they might steal it and publish it themselves (getting scooped)

‣ Causes: • Long time to publication • Outdated methods of

assigning scientific credit • Not properly incentivized

24

Sharing requiredA Problem for Data Commons

‣ Data piling up (scientists are hoarders)

‣ Bad network infrastructures

‣ Few central analytics platforms

‣ Wild-west file formats/algorithms

‣ No sharing25

Sharing requiredA Problem for Data Commons

‣ Data piling up (scientists are hoarders)

‣ Bad network infrastructures

‣ Few central analytics platforms

‣ Wild-west file formats/algorithms

‣ No sharing25

Hyperscale analytics will only work if the data is accessible!

Every kind of flow imaginableClear issue for Networking

‣ Mouse —> Elephant

‣ Typical problem: firewalls not designed for this

‣ Potentially massive amount of constant data movement

‣ How are people handling all of this?

26

27

Use Cases in Life Sciences

28

Getting Data out of the Laboratory

Usually very little IT infrastructure in labsLaboratories not Integrated

‣ Tons of data generating equipment going in now

‣ Can generate 15GB of data in 50 hours

‣ Others can generate 64GB/day

‣ Labs are not designed to transmit data, lucky if wired for ethernet

29

Usually very little IT infrastructure in labsLaboratories not Integrated

‣ Tons of data generating equipment going in now

‣ Can generate 15GB of data in 50 hours

‣ Others can generate 64GB/day

‣ Labs are not designed to transmit data, lucky if wired for ethernet

29

Usually very little IT infrastructure in labsLaboratories not Integrated

‣ Tons of data generating equipment going in now

‣ Can generate 15GB of data in 50 hours

‣ Others can generate 64GB/day

‣ Labs are not designed to transmit data, lucky if wired for ethernet

29

OK, so write data over ethernet to network drive…Getting data out

‣ Sounds good, 64GB in 24 hours ~= 6Mb/s

‣ Problem: desktop class ethernet adaptors

‣ No error checking, no retries, no MD5, no local buffer

‣ If network goes, whole run is lost

30

Scientists have to get creative, but not in a good wayGetting data out

‣ Usually ends up going to local workstation

‣ Go buy the cheapest disks they can

‣ Carry it somewhere, transfer the data to a workstation

‣ Put the disk in a drawer under a sink (really)

‣ Works if lab only does one or two runs/month, fails if more

31

Unless you’re dealing with a bigger lab with lots of equipment, or a core facility

Lab data transit not huge!

‣ Fast networking not required, 100Mb OK

‣ Just GOOD networking

‣ ….for now (more later)

32

Some generalized network models that have successfully solved the problem

Successful models

‣ Most of it is protocol and topology

‣ Quality of Service (QoS)

‣ Appropriate segmentation (L2 and/or L3)

‣ MPLS paths

‣ Intermediate protocols (i.e., Aspera FASP)

‣ One way or another, guarantee transfer

33

34

Storing the Data

As storage needs increase, the need to transmit it goes up too

Storage: a networking problem

‣ Networking will quickly replace storage as #1 headache in Bio-IT

‣ Petascale storage is useless without high-performance networking

‣ Most enterprise networks won’t cut it

35

Most single laboratories don’t have an immediate need for peta-scale storage

Storage: an Org Problem

‣ BUT - labs need to be peta-capable

‣ Can’t predict how much or what kind of equipment

‣ Have to build for an indeterminate future

‣ Does it make sense for each lab to buy own storage? • Probably not, doesn’t scale well

financially36

Orgs that don’t invest will find themselves in a mess of storage support

Storage: an Org Problem

‣ This is when the storage problem becomes a networking problem

‣ Scientists need to share, collaborate

‣ Lab with 100TB of data, needs to share with offsite or onsite scientist

‣ Also: backups and disaster recovery: data is the new commodity

37

Without high-performance networking, petascale anything is useless

Storage: a networking problem

‣ Traditional enterprise networks don’t cut it

‣ Large single-stream flows get squashed through firewalls and IDS

‣ Centralized: 10’s of PBs

‣ Distributed: 100’s of PBs • Likely a lot of duplication

‣ Network becomes key

‣ Cloud use makes this an even bigger problem

38

Storage: options!

‣ There are a ton of options for storage • Local: small and large • Institutional: mostly large • Distributed Institutional: distributed NAS

(GPFS over WAN), Object store networks, iRODS

• Public clouds: block and object storage

‣ All require high-performance networking

‣ Anything external requires awesome external connection

39

External connections that make petascale storage useful to scientists

Storage networking: solutions

‣ OC-192 • Works for large institutions willing to

make investment • Cost prohibitive: $200-$300k/month • Start-up cost of at least $1-2M for

border equipment

‣ Internet2 10/100Gb Hybrid ports • Much better cost, fewer routing

options • $200k/year

‣ Google Fiber, AT&T Gigapower?40

Internal networking more critical than external for petascale storage

Storage networking: solutions

‣ Infrastructure must be able to support the inevitable 1PB transit • Disaster recovery • High-availability • Backup

‣ Need at least 10Gb • Probably dedicated 10Gb per >1PB

storage facility: 40Gb min —> 1Tb backbone

‣ 1Gb will not cut it for that data size • ~97 days to transmit at saturation • 10Gb: ~9.7 days

41

And now, the real problem: topology and logical design

Storage networking: solutions

‣ Need a scaling internal topology

‣ One core switch doing all routing and packet transit == bad

‣ More advanced designs needed

‣ Also: prioritize performance over security • Nearly impossible for most orgs

‣ Most implemented option: Science DMZ

42

Sensitive data have policies and compliance issues, breaking them can be illegal

Science DMZ: not for everything

‣ Need logical topology flexible enough for security AND performance

‣ Best example: ISP model • Collapsed PE/CE on single router at edge • OSPF routing at edge, fast label

switching on dual 100Gb cores • VRF for network segments • MPLS for fast transit and bandwidth

guarantees

‣ Side benefit: trusted and untrusted Science DMZ

43

44

Analyzing the data

The pinnacle of data transit, the reason we store it in the first place

Compute == Answers!

‣ High performance computing: clusters, supercomputers, single servers, powerful workstations, etc.

‣ Mostly a datacenter issue

‣ Unless… • Storage not centralized or co-

located: data duplicated unless have a killer network

• New methods: data doesn’t move, compute moves to data

45

Assumes the use of central high-performance storage system

Use Case: Get data to cluster

‣ Easier problem within the same datacenter

‣ Large data needs large pipe

‣ Output of storage device needs to be fast • Needs to drive data to/from all

compute nodes simultaneously

‣ Large clusters: big problem • Needs parallel filesystems:

GPFS, Lustre46

Use of local disk in newer clustersInternal network esp. important

‣ Implementation of storage/analytics systems for Big Data/HDFS

‣ Hadoop, Gluster, local ZFS volumes, virtual disk pools

‣ Now storage can be both internal and external

‣ I/O throughput is critical47

Application characteristics

‣ Mostly single process apps

‣ Some SMP/threaded apps performance bound by IO and/or RAM

‣ Lots of Perl/Python/R

‣ Hundreds of apps, codes & toolkits

‣ 1TB - 2TB RAM “High Memory” nodes becoming essential

‣ MPI is rare • Well written MPI is even rarer

‣ Few MPI apps actually benefit from expensive low-latency interconnects* • *Chemistry, modeling and structure work is

the exception48

Genomics especiallyLife Science very I/O bound

‣ Sync time for data often takes longer than the job itself

‣ Have to load up to 300GB into memory, for 1min process

‣ Do this thousands of times

‣ Largely due to bad programming and improperly configured systems

49

Interconnects between the nodes and the cluster’s connection to the main network critical

Cluster networking Solutions

‣ Optimal cluster networks: fat tree and torus topologies • All layer 2, internally ‣ Most keep subscription to 1:4,

depending on usage

‣ Top-level switches connect at high speed to datacenter network • Newest are multiple 10Gb or 40Gb • Infiniband internal networks:

Mellanox ConnectX3 - ethernet and IB capable switch ports

50

51

Sharing the data: Collaboration

Fundamental to scienceCollaboration

‣ Now that data production is reaching petascale, collaboration is getting harder

‣ Projects are getting more complex, more data is being generated, takes more people to work on the science

‣ Journal authorships: common to see 40+ authors now

‣ Clearly a networking problem at its core

‣ Let’s face it, doing this right is expensive!52

The gist of collaborative data sharing in life sciencesData Movement & Data Sharing

‣ Peta-scale data movement needs • Within an organization • To/from collaborators • To/from suppliers • To/from public data repos

‣ Peta-scale data sharing needs • Collaborators and partners may

be all over the world53

54

Most common high-speed network: FedEx

Physical & NetworkWe Have Both Ingest Problems

‣ Significant physical ingest occurring in Life Science • Standard media: naked SATA drives

shipped via Fedex

‣ Cliche example: • 30 genomes outsourced means 30

drives will soon be sitting in your mail pile

‣ Organizations often use similar methods to freight data between buildings and among geographic sites

55

Physical Ingest Just Plain Nasty

‣ Easy to talk about in theory

‣ Seems “easy” to scientists and even IT at first glance

‣ Really really nasty in practice

• Incredibly time consuming • Significant operational burden • Easy to do badly / lose data

56

Science DMZ: making it easier to collaborateCollaboration Solutions

57Image source: “The Science DMZ: Introduction & Architecture” -- esnet

Internet2: making data accessible and affordableCollaboration Solutions

‣ Internet2 is bringing Research and Education together • High-speed, clean networking at its

core • Novel and advanced uses of SDN • Subsidized rates: national high-

performance networking affordable

‣ AL2S: quickly establish national networks at high-speed

‣ Combined with Science DMZ: platform for collaboration

58

Push for Cloud use: Most use Amazon Web Services, Google Cloud not far behind

Collaboration Solutions

‣ Many Orgs are pushing for cloud

‣ Unsupported scientists end up using cloud

‣ It’s fast, flexible, affordable, if done right

‣ Great place for large public datasets to live

‣ Has existing high(ish)-performance networking

‣ If done wrong, way more expensive than local compute

‣ Biggest problem: getting data to it!59

Hybrid HPC: Also known as hybrid cloudsCollaboration Solutions

‣ Relatively new idea • small local footprint • large, dynamic, scalable, orchestrated

public cloud component

‣ DevOps is key to making this work

‣ High-speed network to public cloud required

‣ Software interface layer acting as the mediator between local and public resources

‣ Good for tight budgets, has to be done right to work

‣ Not many working examples yet60

Central storage of knowledge with computeData Commons

‣ Common structure for data storage and indexing (a cloud?)

‣ Associated compute for analytics

‣ Development platform for application development (PaaS)

‣ Make discovery more possible

61

62

An Example of Progress

Huge Government Agency trying to make agriculture better in every way

USDA: Agricultural Research Service

‣ Researchers doing amazing research on how crops and animals can be better farmed

‣ Lower environmental impacts

‣ Better economic returns

‣ How to optimize how agriculture functions in the US

‣ But, there’s a problem…63

Every kind of high-throughput research talked about they are doing, and more, and on a massive scale

They’re doing all the things!

64

Just to list a few…

‣ Genomics (a lot of de novo assembly)

‣ Large scale imaging • LIDAR • Satellite

‣ Simulations

‣ Climatology

‣ Remote sensing

‣ Farm equipment sensors (IoT)65

Their current network

66

• Upgrading to DS3 • Still a lot of T1 • Won’t cut it for

science

Build a Science DMZ: SciNet, on an Internet2 AL2S Backbone

The new initiative

67

Hybrid HPC, Storage, Virtualization environmentSciNet to feature compute

68

69

What’s the Big Picture?

Utilizing scientific computing to enable discoveryProblems getting solved

70

Laboratory Knowledge

Utilizing scientific computing to enable discoveryProblems getting solved

70

Laboratory Knowledge

Utilizing scientific computing to enable discoveryProblems getting solved

70

Laboratory Knowledge

Utilizing scientific computing to enable discoveryProblems getting solved

70

Laboratory Knowledge

Utilizing scientific computing to enable discoveryProblems getting solved

70

Laboratory Knowledge

Utilizing scientific computing to enable discoveryProblems getting solved

70

Laboratory Knowledge

Converged Infrastructure

71

The meta issue

‣ Individual technologies and their general successful use are fine

‣ Unless they all work together as a unified solution, it all means nothing

‣ Creating an end-to-end solution based on the use case (science!): converged infrastructure

It’s what we do[Hyper-]convergence

72

Laboratory Knowledge

It’s what we do[Hyper-]convergence

72

Laboratory Knowledge

Converged Solution

It’s what we do[Hyper-]convergence

72

Laboratory Knowledge

Converged Solution

People matter tooConvergence

73

Laboratory Knowledge

Converged Solution

“The network IS the computer” - John Gage, Sun Microsystems

Universal Truth

‣ Convergence is not possible without networking

‣ Also not possible without GOOD networking

‣ Life Sciences is learning lessons learned by physics and astronomy 5-10 years ago

‣ Biggest problem is Org acceptance and investment in personnel and equipment

‣ Next-Gen biomedical research advancing too quickly: must invest now

74

75

end; Thanks! slides at http://www.slideshare.net/arieberman

Recommended