Upload
arieberman
View
191
Download
7
Tags:
Embed Size (px)
DESCRIPTION
Big data has arrived in the life science research domain and has driven the need for optimized high-performance networks in these research environments. Many petabytes of data transfer, storage and analytics are now a reality due to the fact that data is being produced cheaply and rapidly at unprecedented rates in academic, commercial and clinical laboratories. These data flows are complicated by the combination of high-frequency mouse flows as well as high-volume elephant flows, sometimes from the same application operating in parallel environments. Additional complicating factors include collaborative research efforts on large data stores that utilize both common and disparate compute resources, the need for high-performance data encryption in-flight to cover the transmission and handling of clinical data, and the relatively poor state of algorithm development from an IO standpoint throughout the industry. This presentation will cover representative advanced networking use cases from life sciences research, the challenges that they present in networking environments, some solutions that are being deployed with in both small and large institutions, and an overview of a few of the unresolved problems to date.
Citation preview
High-Performance NetworkingUse Cases in Life Sciences
1
2014 Internet2 Technology Exchange; Indianapolis, INSlides available at http://www.slideshare.net/arieberman
Who am I?
2
Director of Government Services, Principal Investigator
I’m a fallen scientist - Ph.D. Molecular Biology, Neuroscience, Bioinformatics
I’m an HPC/Infrastructure geek - 15 years
I help enable science!
I’m Ari
3
BioTeam
‣ Independent consulting shop
‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done
‣ Infrastructure, Informatics, Software Development, Cross-disciplinary Assessments
‣ 11+ years bridging the “gap” between science, IT & high performance computing
‣ Our wide-ranging work is what gets us invited to speak at events like this ...
What do we do?BioTeam
4
Laboratory Knowledge
What do we do?BioTeam
4
Laboratory Knowledge
What do we do?BioTeam
4
Laboratory Knowledge
What do we do?BioTeam
4
Laboratory Knowledge
What do we do?BioTeam
4
Laboratory Knowledge
What do we do?BioTeam
4
Laboratory Knowledge
What do we do?BioTeam
4
Laboratory Knowledge
Converged Solution
What do we do?BioTeam
4
Laboratory Knowledge
Converged Solution
Mostly work in Life SciencesOur domain coverage
• Government • Universities • Big pharma • Biotech • Private institutes • Diagnostic startups • Oil and Gas • Geospatial • Hollywood Animation • Law Enforcement
5
6
OK, so why am I here talking to you?
We have a unique perspective across much of life sciences
We’ve noticed a few things
‣ Big Data has arrived in Life Sciences
‣ Data is being generated at unprecedented rates
‣ Research and Biomedical Orgs were caught off guard
‣ IT running to catch up, limited budgets
‣ Money is tight, Orgs reluctant to invest in Bio-IT
7
25% of all Life Scientists will require HPC in 2015!
8
Big Picture / Meta Issue
‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed
‣ IT not a part of the conversation, running to catch up
Science progressing way faster than IT can refresh/change
The Central Problem Is ...
‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure
• Bench science is changing month-to-month ... • ... while our IT infrastructure only gets refreshed every
2-7 years
‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...)
9
10
It’s a risky time to be doing Bio-IT
11
What are the drivers in Bio-IT today?
11
Genomics: Next Generation Sequencing (NGS)
It’s like the hard drive of life
12
The big deal about DNA
‣ DNA is the template of life
‣ DNA is read --> RNA
‣ RNA is read --> Proteins
‣ Proteins are the functional machinery that make life possible
‣ Understanding the template = understanding basis for disease
Sequencing by SynthesisHow does NGS work?
13
Reference assembly, variant callingHow does NGS work?
14
Reference assembly, variant callingHow does NGS work?
14
Reference assembly, variant callingHow does NGS work?
14
Gateway to personalized medicineThe Human Genome
‣ 3.2 Gbp
‣ 23 chromosomes
‣ ~21,000 genes
‣ Over 55M known variations
15
...and why NGS is the primary driver
16
The Problem...
‣ Sequencers are now relatively cheap and fast
‣ Some can generate a human genome in 18 hours, for $2,000
‣ Everyone is doing it
‣ Can generate 3TB of data in that time
‣ First genome took 13 years and $2.7B to complete
‣ Know of 10 organizations: 100,000 genomes over 5 years
...and why NGS is the primary driver
16
The Problem...
‣ Sequencers are now relatively cheap and fast
‣ Some can generate a human genome in 18 hours, for $2,000
‣ Everyone is doing it
‣ Can generate 3TB of data in that time
‣ First genome took 13 years and $2.7B to complete
‣ Know of 10 organizations: 100,000 genomes over 5 years
That’s 14PB of data, folks
17
Other Methodologies Not Far Behind
High-throughput Imaging
‣ Robotics screening millions of compounds on live cells 24/7 • Not as much data as genomics in
volume, but just as complex • Data volumes in the 10’s TB/week
‣ Confocal Imaging • Scanning 100’s of tissue sections/
week, each with 10’s of scans, each with 20-40 layers and multiple florescent channels
• Data volumes in the 1’s - 10’s TB/week
18
High-power, dense detector MRI scanners in use 24/7 at large research hospitals
High-res medical imaging
‣ Creating 3D models of brains, comparing large datasets
‣ Using those models to perform detailed neurosurgery with real-time analytic feedback from supercomputer in the OR (cool stuff)
‣ Also generates 10’s of TB/week
19
20
This is a huge problem
‣ Causing a literal deluge of data, in the 10’s of Petabytes
‣ NIH generating 1.5PB of data/month
‣ First real case in life science where 100Gb networking might really be needed
‣ But, not enough storage or compute
21
And, just to make things more complicated
We have them allFile & Data Types
‣ Massive text files
‣ Massive binary files
‣ Flatfile ‘databases’
‣ Spreadsheets everywhere
‣ Directories w/ 6 million files
‣ Large files: 600GB+
‣ Small files: 30kb or smaller22
Why, giant meta-analyses, of course
23
What to do with all that data?
‣ Typical problem across all of big data: how do you use it?
‣ In life sciences: no real standards of data formats
‣ Data scattered all over, despite push for Data Commons
‣ Not always accessible
‣ Combining the data if you have it all is a real challenge
Scientists don’t like to share (really!)A Compounding Problem...
‣ The fear: • if someone sees data before it
is published, they might steal it and publish it themselves (getting scooped)
‣ Causes: • Long time to publication • Outdated methods of
assigning scientific credit • Not properly incentivized
24
Sharing requiredA Problem for Data Commons
‣ Data piling up (scientists are hoarders)
‣ Bad network infrastructures
‣ Few central analytics platforms
‣ Wild-west file formats/algorithms
‣ No sharing25
Sharing requiredA Problem for Data Commons
‣ Data piling up (scientists are hoarders)
‣ Bad network infrastructures
‣ Few central analytics platforms
‣ Wild-west file formats/algorithms
‣ No sharing25
Hyperscale analytics will only work if the data is accessible!
Every kind of flow imaginableClear issue for Networking
‣ Mouse —> Elephant
‣ Typical problem: firewalls not designed for this
‣ Potentially massive amount of constant data movement
‣ How are people handling all of this?
26
27
Use Cases in Life Sciences
28
Getting Data out of the Laboratory
Usually very little IT infrastructure in labsLaboratories not Integrated
‣ Tons of data generating equipment going in now
‣ Can generate 15GB of data in 50 hours
‣ Others can generate 64GB/day
‣ Labs are not designed to transmit data, lucky if wired for ethernet
29
Usually very little IT infrastructure in labsLaboratories not Integrated
‣ Tons of data generating equipment going in now
‣ Can generate 15GB of data in 50 hours
‣ Others can generate 64GB/day
‣ Labs are not designed to transmit data, lucky if wired for ethernet
29
Usually very little IT infrastructure in labsLaboratories not Integrated
‣ Tons of data generating equipment going in now
‣ Can generate 15GB of data in 50 hours
‣ Others can generate 64GB/day
‣ Labs are not designed to transmit data, lucky if wired for ethernet
29
OK, so write data over ethernet to network drive…Getting data out
‣ Sounds good, 64GB in 24 hours ~= 6Mb/s
‣ Problem: desktop class ethernet adaptors
‣ No error checking, no retries, no MD5, no local buffer
‣ If network goes, whole run is lost
30
Scientists have to get creative, but not in a good wayGetting data out
‣ Usually ends up going to local workstation
‣ Go buy the cheapest disks they can
‣ Carry it somewhere, transfer the data to a workstation
‣ Put the disk in a drawer under a sink (really)
‣ Works if lab only does one or two runs/month, fails if more
31
Unless you’re dealing with a bigger lab with lots of equipment, or a core facility
Lab data transit not huge!
‣ Fast networking not required, 100Mb OK
‣ Just GOOD networking
‣ ….for now (more later)
32
Some generalized network models that have successfully solved the problem
Successful models
‣ Most of it is protocol and topology
‣ Quality of Service (QoS)
‣ Appropriate segmentation (L2 and/or L3)
‣ MPLS paths
‣ Intermediate protocols (i.e., Aspera FASP)
‣ One way or another, guarantee transfer
33
34
Storing the Data
As storage needs increase, the need to transmit it goes up too
Storage: a networking problem
‣ Networking will quickly replace storage as #1 headache in Bio-IT
‣ Petascale storage is useless without high-performance networking
‣ Most enterprise networks won’t cut it
35
Most single laboratories don’t have an immediate need for peta-scale storage
Storage: an Org Problem
‣ BUT - labs need to be peta-capable
‣ Can’t predict how much or what kind of equipment
‣ Have to build for an indeterminate future
‣ Does it make sense for each lab to buy own storage? • Probably not, doesn’t scale well
financially36
Orgs that don’t invest will find themselves in a mess of storage support
Storage: an Org Problem
‣ This is when the storage problem becomes a networking problem
‣ Scientists need to share, collaborate
‣ Lab with 100TB of data, needs to share with offsite or onsite scientist
‣ Also: backups and disaster recovery: data is the new commodity
37
Without high-performance networking, petascale anything is useless
Storage: a networking problem
‣ Traditional enterprise networks don’t cut it
‣ Large single-stream flows get squashed through firewalls and IDS
‣ Centralized: 10’s of PBs
‣ Distributed: 100’s of PBs • Likely a lot of duplication
‣ Network becomes key
‣ Cloud use makes this an even bigger problem
38
Storage: options!
‣ There are a ton of options for storage • Local: small and large • Institutional: mostly large • Distributed Institutional: distributed NAS
(GPFS over WAN), Object store networks, iRODS
• Public clouds: block and object storage
‣ All require high-performance networking
‣ Anything external requires awesome external connection
39
External connections that make petascale storage useful to scientists
Storage networking: solutions
‣ OC-192 • Works for large institutions willing to
make investment • Cost prohibitive: $200-$300k/month • Start-up cost of at least $1-2M for
border equipment
‣ Internet2 10/100Gb Hybrid ports • Much better cost, fewer routing
options • $200k/year
‣ Google Fiber, AT&T Gigapower?40
Internal networking more critical than external for petascale storage
Storage networking: solutions
‣ Infrastructure must be able to support the inevitable 1PB transit • Disaster recovery • High-availability • Backup
‣ Need at least 10Gb • Probably dedicated 10Gb per >1PB
storage facility: 40Gb min —> 1Tb backbone
‣ 1Gb will not cut it for that data size • ~97 days to transmit at saturation • 10Gb: ~9.7 days
41
And now, the real problem: topology and logical design
Storage networking: solutions
‣ Need a scaling internal topology
‣ One core switch doing all routing and packet transit == bad
‣ More advanced designs needed
‣ Also: prioritize performance over security • Nearly impossible for most orgs
‣ Most implemented option: Science DMZ
42
Sensitive data have policies and compliance issues, breaking them can be illegal
Science DMZ: not for everything
‣ Need logical topology flexible enough for security AND performance
‣ Best example: ISP model • Collapsed PE/CE on single router at edge • OSPF routing at edge, fast label
switching on dual 100Gb cores • VRF for network segments • MPLS for fast transit and bandwidth
guarantees
‣ Side benefit: trusted and untrusted Science DMZ
43
44
Analyzing the data
The pinnacle of data transit, the reason we store it in the first place
Compute == Answers!
‣ High performance computing: clusters, supercomputers, single servers, powerful workstations, etc.
‣ Mostly a datacenter issue
‣ Unless… • Storage not centralized or co-
located: data duplicated unless have a killer network
• New methods: data doesn’t move, compute moves to data
45
Assumes the use of central high-performance storage system
Use Case: Get data to cluster
‣ Easier problem within the same datacenter
‣ Large data needs large pipe
‣ Output of storage device needs to be fast • Needs to drive data to/from all
compute nodes simultaneously
‣ Large clusters: big problem • Needs parallel filesystems:
GPFS, Lustre46
Use of local disk in newer clustersInternal network esp. important
‣ Implementation of storage/analytics systems for Big Data/HDFS
‣ Hadoop, Gluster, local ZFS volumes, virtual disk pools
‣ Now storage can be both internal and external
‣ I/O throughput is critical47
Application characteristics
‣ Mostly single process apps
‣ Some SMP/threaded apps performance bound by IO and/or RAM
‣ Lots of Perl/Python/R
‣ Hundreds of apps, codes & toolkits
‣ 1TB - 2TB RAM “High Memory” nodes becoming essential
‣ MPI is rare • Well written MPI is even rarer
‣ Few MPI apps actually benefit from expensive low-latency interconnects* • *Chemistry, modeling and structure work is
the exception48
Genomics especiallyLife Science very I/O bound
‣ Sync time for data often takes longer than the job itself
‣ Have to load up to 300GB into memory, for 1min process
‣ Do this thousands of times
‣ Largely due to bad programming and improperly configured systems
49
Interconnects between the nodes and the cluster’s connection to the main network critical
Cluster networking Solutions
‣ Optimal cluster networks: fat tree and torus topologies • All layer 2, internally ‣ Most keep subscription to 1:4,
depending on usage
‣ Top-level switches connect at high speed to datacenter network • Newest are multiple 10Gb or 40Gb • Infiniband internal networks:
Mellanox ConnectX3 - ethernet and IB capable switch ports
50
51
Sharing the data: Collaboration
Fundamental to scienceCollaboration
‣ Now that data production is reaching petascale, collaboration is getting harder
‣ Projects are getting more complex, more data is being generated, takes more people to work on the science
‣ Journal authorships: common to see 40+ authors now
‣ Clearly a networking problem at its core
‣ Let’s face it, doing this right is expensive!52
The gist of collaborative data sharing in life sciencesData Movement & Data Sharing
‣ Peta-scale data movement needs • Within an organization • To/from collaborators • To/from suppliers • To/from public data repos
‣ Peta-scale data sharing needs • Collaborators and partners may
be all over the world53
54
Most common high-speed network: FedEx
Physical & NetworkWe Have Both Ingest Problems
‣ Significant physical ingest occurring in Life Science • Standard media: naked SATA drives
shipped via Fedex
‣ Cliche example: • 30 genomes outsourced means 30
drives will soon be sitting in your mail pile
‣ Organizations often use similar methods to freight data between buildings and among geographic sites
55
Physical Ingest Just Plain Nasty
‣ Easy to talk about in theory
‣ Seems “easy” to scientists and even IT at first glance
‣ Really really nasty in practice
• Incredibly time consuming • Significant operational burden • Easy to do badly / lose data
56
Science DMZ: making it easier to collaborateCollaboration Solutions
57Image source: “The Science DMZ: Introduction & Architecture” -- esnet
Internet2: making data accessible and affordableCollaboration Solutions
‣ Internet2 is bringing Research and Education together • High-speed, clean networking at its
core • Novel and advanced uses of SDN • Subsidized rates: national high-
performance networking affordable
‣ AL2S: quickly establish national networks at high-speed
‣ Combined with Science DMZ: platform for collaboration
58
Push for Cloud use: Most use Amazon Web Services, Google Cloud not far behind
Collaboration Solutions
‣ Many Orgs are pushing for cloud
‣ Unsupported scientists end up using cloud
‣ It’s fast, flexible, affordable, if done right
‣ Great place for large public datasets to live
‣ Has existing high(ish)-performance networking
‣ If done wrong, way more expensive than local compute
‣ Biggest problem: getting data to it!59
Hybrid HPC: Also known as hybrid cloudsCollaboration Solutions
‣ Relatively new idea • small local footprint • large, dynamic, scalable, orchestrated
public cloud component
‣ DevOps is key to making this work
‣ High-speed network to public cloud required
‣ Software interface layer acting as the mediator between local and public resources
‣ Good for tight budgets, has to be done right to work
‣ Not many working examples yet60
Central storage of knowledge with computeData Commons
‣ Common structure for data storage and indexing (a cloud?)
‣ Associated compute for analytics
‣ Development platform for application development (PaaS)
‣ Make discovery more possible
61
62
An Example of Progress
Huge Government Agency trying to make agriculture better in every way
USDA: Agricultural Research Service
‣ Researchers doing amazing research on how crops and animals can be better farmed
‣ Lower environmental impacts
‣ Better economic returns
‣ How to optimize how agriculture functions in the US
‣ But, there’s a problem…63
Every kind of high-throughput research talked about they are doing, and more, and on a massive scale
They’re doing all the things!
64
Just to list a few…
‣ Genomics (a lot of de novo assembly)
‣ Large scale imaging • LIDAR • Satellite
‣ Simulations
‣ Climatology
‣ Remote sensing
‣ Farm equipment sensors (IoT)65
Their current network
66
• Upgrading to DS3 • Still a lot of T1 • Won’t cut it for
science
Build a Science DMZ: SciNet, on an Internet2 AL2S Backbone
The new initiative
67
Hybrid HPC, Storage, Virtualization environmentSciNet to feature compute
68
69
What’s the Big Picture?
Utilizing scientific computing to enable discoveryProblems getting solved
70
Laboratory Knowledge
Utilizing scientific computing to enable discoveryProblems getting solved
70
Laboratory Knowledge
Utilizing scientific computing to enable discoveryProblems getting solved
70
Laboratory Knowledge
Utilizing scientific computing to enable discoveryProblems getting solved
70
Laboratory Knowledge
Utilizing scientific computing to enable discoveryProblems getting solved
70
Laboratory Knowledge
Utilizing scientific computing to enable discoveryProblems getting solved
70
Laboratory Knowledge
Converged Infrastructure
71
The meta issue
‣ Individual technologies and their general successful use are fine
‣ Unless they all work together as a unified solution, it all means nothing
‣ Creating an end-to-end solution based on the use case (science!): converged infrastructure
It’s what we do[Hyper-]convergence
72
Laboratory Knowledge
It’s what we do[Hyper-]convergence
72
Laboratory Knowledge
Converged Solution
It’s what we do[Hyper-]convergence
72
Laboratory Knowledge
Converged Solution
People matter tooConvergence
73
Laboratory Knowledge
Converged Solution
“The network IS the computer” - John Gage, Sun Microsystems
Universal Truth
‣ Convergence is not possible without networking
‣ Also not possible without GOOD networking
‣ Life Sciences is learning lessons learned by physics and astronomy 5-10 years ago
‣ Biggest problem is Org acceptance and investment in personnel and equipment
‣ Next-Gen biomedical research advancing too quickly: must invest now
74