59
1 Life Science HPC & Informatics: Trends from the trenches April 2014 Wednesday, April 9, 14

BioTeam Trends from the Trenches - NIH, April 2014

Embed Size (px)

DESCRIPTION

NIH Workshop on Advanced Networking for Data-Intensive Biomedical Research presentation - April 9, 2014

Citation preview

Page 1: BioTeam Trends from the Trenches - NIH, April 2014

1

Life Science HPC & Informatics: Trends from the trenches

April 2014

Wednesday, April 9, 14

Page 2: BioTeam Trends from the Trenches - NIH, April 2014

Who, What, Why ...

2

BioTeam

‣ Independent consulting shop

‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done

‣ 12+ years bridging the “gap” between science, IT & high performance computing

‣ Our wide-ranging work is what gets us invited to speak at events like this ...

Wednesday, April 9, 14

Page 3: BioTeam Trends from the Trenches - NIH, April 2014

Active at NIH since 2008

3

BioTeam & NIH

‣ Our primary goal: make science easier for researchers at NIH via scientific computing

‣ Recently involved in many projects:

• NIH-Wide HPC Assessment• NIAID HPC Assessment• NIMH Bioinformatics Assessment• NCATS IT/Informatics Assessment• NIH Network Modernization Project

Wednesday, April 9, 14

Page 4: BioTeam Trends from the Trenches - NIH, April 2014

4

Topic 1: Scariest thing first ...The biggest meta-issue facing life science informatics

Wednesday, April 9, 14

Page 5: BioTeam Trends from the Trenches - NIH, April 2014

5

It’s a risky time to be doing Bio-IT

Wednesday, April 9, 14

Page 6: BioTeam Trends from the Trenches - NIH, April 2014

6

Big Picture / Meta Issue

‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed

• Example: CCD sensor upgrade on that confocal microscopy rig just doubled storage requirements

• Example: The 2D ultrasound imager is now a 3D imager

• Example: Illumina HiSeq upgrade just doubled the rate at which you can acquire genomes. Massive downstream increase in storage, compute & data movement needs

‣ For the above examples, do you think IT was informed in advance?

Wednesday, April 9, 14

Page 7: BioTeam Trends from the Trenches - NIH, April 2014

Science progressing way faster than IT can refresh/changeThe Central Problem Is ...

‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure

• Bench science is changing month-to-month ...• ... while our IT infrastructure only gets refreshed every

2-7 years

‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...)

7Wednesday, April 9, 14

Page 8: BioTeam Trends from the Trenches - NIH, April 2014

The Central Problem Is ...

‣ The easy period is over

‣ 5 years ago we could toss inexpensive storage and servers at the problem; even in a nearby closet or under a lab bench if necessary

‣ That does not work any more; real solutions required

8Wednesday, April 9, 14

Page 9: BioTeam Trends from the Trenches - NIH, April 2014

9

The new normal for informatics

Wednesday, April 9, 14

Page 10: BioTeam Trends from the Trenches - NIH, April 2014

And a related problem ...

‣ It has never been easier to acquire vast amounts of data cheaply and easily

‣ Growth rate of data creation/ingest exceeds rate at which the storage industry is improving disk capacity

‣ Not just a storage lifecycle problem. This data *moves* and often needs to be shared among multiple entities and providers

• ... ideally without punching holes in your firewall or consuming all available internet bandwidth

10Wednesday, April 9, 14

Page 11: BioTeam Trends from the Trenches - NIH, April 2014

If we get it wrong ...

‣ Lost opportunity

‣ Missing capability

‣ Frustrated & very vocal scientific staff

‣ Slowed pace of scientific discovery

‣ Problems in recruiting, retention, publication & product development

11Wednesday, April 9, 14

Page 12: BioTeam Trends from the Trenches - NIH, April 2014

Up to a two line subtitle, generally used to describe the takeaway for the slide

12

Basic Bio/IT Landscape

Wednesday, April 9, 14

Page 13: BioTeam Trends from the Trenches - NIH, April 2014

Compute related design patterns largely static

13

Core Compute

‣ Linux compute clusters are still the baseline compute platform

‣ Even our lab instruments know how to submit jobs to common HPC cluster schedulers

‣ Compute is not hard. It’s a commodity that is easy to acquire & deploy in 2014

Wednesday, April 9, 14

Page 14: BioTeam Trends from the Trenches - NIH, April 2014

We have them allFile & Data Types

‣ Massive text files

‣ Massive binary files

‣ Flatfile ‘databases’

‣ Spreadsheets everywhere

‣ Directories w/ 6 million files

‣ Large files: 600GB+

‣ Small files: 30kb or smaller

14Wednesday, April 9, 14

Page 15: BioTeam Trends from the Trenches - NIH, April 2014

15

Application characteristics‣ Mostly SMP/threaded apps

performance bound by IO and/or RAM

‣ Hundreds of apps, codes & toolkits

‣ 1TB - 2TB RAM “High Memory” nodes becoming essential

‣ Lots of Perl/Python/R

‣ MPI is rare• Well written MPI is even rarer

‣ Few MPI apps actually benefit from expensive low-latency interconnects*

• *Chemistry, modeling and structure work is the exception

Wednesday, April 9, 14

Page 16: BioTeam Trends from the Trenches - NIH, April 2014

16

Storage & Data Management‣ LifeSci core requirement:

• Shared, simultaneous read/write access across many instruments, desktops & HPC silos

‣ NAS = “easiest” option • Scale Out NAS products are the

mainstream standard

‣ Parallel & Distributed storage for edge cases and large organizations with known performance needs

• Becoming much more common: GPFS has taken hold in LifeSci

Wednesday, April 9, 14

Page 17: BioTeam Trends from the Trenches - NIH, April 2014

17

Storage & Data Management

‣ Storage & data mgmt. is the #1 infrastructure headache in life science environments

‣ Most labs need “peta capable” storage due to unpredictable future

• Only a small % will actually hit 1PB• Often forced to trade away performance

in order to obtain capacity

‣ Object stores, ZFS and commodity “Nexentastor-style” methods are making significant inroads

Wednesday, April 9, 14

Page 18: BioTeam Trends from the Trenches - NIH, April 2014

18

Data Movement & Data Sharing

‣ Peta-scale data movement needs

• Within an organization• To/from collaborators• To/from suppliers• To/from public data repos

‣ Peta-scale data sharing needs• Collaborators and partners may be

all over the world

Wednesday, April 9, 14

Page 19: BioTeam Trends from the Trenches - NIH, April 2014

19

Networking

‣ Major 2014 focus

‣ May surpass storage as our #1 infrastructure headache

‣ Why?• Petascale storage meaningless

if you can’t access/move it• 10-Gig, 40-Gig and 100-Gig

networking will force significant changes elsewhere in the ‘bio-IT’ infrastructure

Wednesday, April 9, 14

Page 20: BioTeam Trends from the Trenches - NIH, April 2014

Physical & Network

20

We Have Both Ingest Problems

‣ Significant physical ingest occurring in Life Science

• Standard media: naked SATA drives shipped via Fedex

‣ Cliche example:• 30 genomes outsourced means 30

drives will soon be sitting in your mail pile

‣ Organizations often use similar methods to freight data between buildings and among geographic sites

Wednesday, April 9, 14

Page 21: BioTeam Trends from the Trenches - NIH, April 2014

21

Physical Ingest Just Plain Nasty

‣ Most common high-speed network: FedEx

‣ Easy to talk about in theory

‣ Seems “easy” to scientists and even IT at first glance

‣ Really really nasty in practice• Incredibly time consuming• Significant operational burden• Easy to do badly / lose data

Wednesday, April 9, 14

Page 22: BioTeam Trends from the Trenches - NIH, April 2014

And huge need for fast(er) research networks!

22

Huge Need For Network Ingest

1. Public data repositories have petabytes of useful data

2. Collaborators still need to swap data in serious ways

3. Amazon becoming an important repo of public and private sources

4. Many vendors now “deliver” to the cloud

Wednesday, April 9, 14

Page 23: BioTeam Trends from the Trenches - NIH, April 2014

23

It all boils down to this ...

Wednesday, April 9, 14

Page 24: BioTeam Trends from the Trenches - NIH, April 2014

24

Life Science In One Slide:

‣ Huge compute needs but not intractable and generally solved via Linux HPC farms. Most of our workloads are serial/batch in nature

‣ Ludicrous rate of innovation in lab drives a similar rate of change for our software and tool environment

‣ With science changing faster than IT, emphasis is on agility and flexibility - we’ll trade performance for some measure of future proofing

‣ Buried in data. Getting worse. Individual scientists can generate petascale data streams.

‣ We have all of the Information Lifecycle problems: Storing, Curating, Managing, Sharing, Ingesting and Moving

Wednesday, April 9, 14

Page 25: BioTeam Trends from the Trenches - NIH, April 2014

25

Trends: DevOps & Org Charts

Wednesday, April 9, 14

Page 26: BioTeam Trends from the Trenches - NIH, April 2014

26

The social contract betweenscientist and IT is changing forever

Wednesday, April 9, 14

Page 27: BioTeam Trends from the Trenches - NIH, April 2014

27

You can blame “the cloud” for this

Wednesday, April 9, 14

Page 28: BioTeam Trends from the Trenches - NIH, April 2014

28

DevOps & Scriptable Everything

‣ On (real) clouds, EVERYTHING has an API

‣ If it’s got an API you can automate and orchestrate it

‣ “scriptable infrastructure” is now a reality

‣ Driving capabilities that we will need in 2014 and beyond

Wednesday, April 9, 14

Page 29: BioTeam Trends from the Trenches - NIH, April 2014

29

DevOps & Scriptable Everything

‣ Incredible innovation in the past few years

‣ Driven mainly by companies with massive internet ‘fleets’ to manage

‣ ... but the benefits trickle down to us little people

Wednesday, April 9, 14

Page 30: BioTeam Trends from the Trenches - NIH, April 2014

... and conquer the enterprise

30

DevOps will enable hybrid HPC

‣ Cloud automation/orchestration methods have been trickling down into our local infrastructures

‣ Driving significant impact on careers, job descriptions and org charts

‣ These methods are necessary for emerging hybrid cloud models for HPC/sharing

Wednesday, April 9, 14

Page 31: BioTeam Trends from the Trenches - NIH, April 2014

2014: Continue to blur the lines between all these roles

31

Scientist/SysAdmin/Programmer

‣ IT jobs, roles and responsibilities are going to change significantly

‣ SysAdmins must learn to program in order to harness automation tools

‣ Programmers & Scientists can now self-provision and control sophisticated IT resources

Wednesday, April 9, 14

Page 32: BioTeam Trends from the Trenches - NIH, April 2014

2014: Continue to blur the lines between all these roles

32

Scientist/SysAdmin/Programmer

‣ My take on the future ...• SysAdmins (Windows & Linux) who

can’t code will have career issues • Far more control is going into the

hands of the research end user • IT support roles will radically change

-- no longer owners or gatekeepers

‣ IT will “own” policies, procedures, reference patterns, identity mgmt, security & best practices

‣ Research will control the “what”, “when” and “how big”

Wednesday, April 9, 14

Page 33: BioTeam Trends from the Trenches - NIH, April 2014

Research needing more and more compute

33

IT Orgs are Changing as well...

‣ 25% of researchers will need HPC this year

‣ 75% will need high-volume storage

‣ IT evolved from administrative need

• Science started grabbing resources

• IT either adapted or was replaced

Wednesday, April 9, 14

Page 34: BioTeam Trends from the Trenches - NIH, April 2014

Research needing more and more compute

34

IT Orgs are Changing as well...

‣ Three types of adaptations• IT evolved to include research

IT support• IT split into research IT and

corporate IT• IT became primarily research

org -> run by CSIO

‣ Orgs with scientific missions need adaptive IT with stake in research projects -> restrictions kill science

Wednesday, April 9, 14

Page 35: BioTeam Trends from the Trenches - NIH, April 2014

35

Trends: Compute

Wednesday, April 9, 14

Page 36: BioTeam Trends from the Trenches - NIH, April 2014

36

Compute:

‣ Kind of boring. Solved problem in 2014

‣ Compute power is a commodity

• Inexpensive relative to other costs

• Far less vendor differentiation than storage

• Easy to acquire; easy to deploy

Wednesday, April 9, 14

Page 37: BioTeam Trends from the Trenches - NIH, April 2014

Defensive hedge against Big Data / HDFS

37

Compute: Local Disk is Back

‣ We’ve started to see organizations move away from blade servers and 1U pizza box enclosures for HPC

‣ The “new normal” may be 4U enclosures with massive local disk spindles - not occupied, just available

‣ Why? Hadoop & Big Data

‣ This is a defensive hedge against future HDFS or similar requirements

• Remember the ‘meta’ problem - science is changing far faster than we can refresh IT. This is a defensive future-proofing play.

‣ Hardcore Hadoop rigs sometimes operate at 1:1 ratio between core count and disk count

Wednesday, April 9, 14

Page 38: BioTeam Trends from the Trenches - NIH, April 2014

New and refreshed HPC systems running many node types

38

Compute: Huge trend in ‘diversity’

‣ Accelerated trend since at least 2012 ...• HPC compute resources no longer homogenous;

many types and flavors now deployed in single HPC stacks

‣ Newer clusters mix-and-match to match the known use cases:

• GPU nodes for compute

• GPU nodes for visualization• Large memory nodes (512GB +)• Very Large memory nodes (1TB +)

• ‘Fat’ nodes with many CPU cores• ‘Thin’ nodes with super-fast CPUs• Analytic nodes with SSD, FusionIO, flash or large

local disk for ‘big data’ tasks

Wednesday, April 9, 14

Page 39: BioTeam Trends from the Trenches - NIH, April 2014

GPUs, Coprocessors & FPGAs

39

Compute: Hardware Acceleration

‣ Specialized hardware acceleration has it’s place but will not take over the world

• “... the activation energy required for a scientist to use this stuff is generally quite high ...”

‣ GPU, Phi and FPGA best used in large scale pipelines or as specific solution to a singular pain point

Wednesday, April 9, 14

Page 40: BioTeam Trends from the Trenches - NIH, April 2014

Also known as hybrid cloudsEmerging Trend: Hybrid HPC‣ Relatively new idea

• small local footprint• large, dynamic, scalable, orchestrated

public cloud component

‣ DevOps is key to making this work

‣ High-speed network to public cloud required

‣ Software interface layer acting as the mediator between local and public resources

‣ Good for tight budgets, has to be done right to work

‣ Not many working examples yet40

Wednesday, April 9, 14

Page 41: BioTeam Trends from the Trenches - NIH, April 2014

41

Trends: Network

Wednesday, April 9, 14

Page 42: BioTeam Trends from the Trenches - NIH, April 2014

42

Network: Speed @ Core and Edge

‣ Huge potential pain point

‣ May surpass storage as our #1 infrastructure headache

‣ Petascale data is useless if you can’t move it or access it fast enough

‣ Don’t be smug about 10 Gigabit - folks need to start thinking *now* about 40 and even 100 Gigabit Ethernet

Wednesday, April 9, 14

Page 43: BioTeam Trends from the Trenches - NIH, April 2014

43

Network: Speed @ Core and Edge

‣ Remember 2004 when research storage requirements started to dwarf what the enterprise was using?

‣ Same thing is happening now for networking

‣ Research core, edge and top-of-rack networking speeds may exceed what the rest of the organization has standardized on

Wednesday, April 9, 14

Page 44: BioTeam Trends from the Trenches - NIH, April 2014

Massive data movement needs are driving innovationNIH Tackling this now!

‣ Currently installing 100Gb research network

‣ Will tackle the petascale data movement head on

• NIH gaining ground on 1PB/month

• Collaboration, core compute, data commons, external data sources

• Science DMZ!44

Wednesday, April 9, 14

Page 45: BioTeam Trends from the Trenches - NIH, April 2014

Network: ‘ScienceDMZ’

‣ “ScienceDMZ” concept is real and necessary

‣ BioTeam will be building them in 2014 and beyond

‣ Central premise:• Legacy firewall, network and security methods

architected for “many small data flows” use cases• Not built to handle smaller #s of massive data

flows

• Also very hard to deploy ‘traditional’ security gear on 10Gigabit and faster networks

‣ More details, background & documents at http://fasterdata.es.net/science-dmz/

45

Background traffic or

competing bursts

DTN traffic with wire-speed

bursts

10GE

10GE

10GE

Wednesday, April 9, 14

Page 46: BioTeam Trends from the Trenches - NIH, April 2014

Network: ‘ScienceDMZ’

‣ Start thinking/discussing this sooner rather than later

‣ Just like “the cloud” this may fundamentally change internal operations and technology

‣ Will also require conscious buy-in and support from senior network, security and risk management professionals

• ... these talks take time. Best to plan ahead

46Wednesday, April 9, 14

Page 47: BioTeam Trends from the Trenches - NIH, April 2014

Network: ‘ScienceDMZ’

‣ A Science DMZ has 3 required components:1. Very fast “low-friction” network links and paths with

security policy and enforcement specific to scientific workflows

2. Dedicated, high performance data transfer nodes (“DTNs”) highly optimized for high speed data xfer

3. Dedicated network performance/measurement nodes

47Wednesday, April 9, 14

Page 48: BioTeam Trends from the Trenches - NIH, April 2014

48

Simple Science DMZ:

Image source: “The Science DMZ: Introduction & Architecture” -- esnet

Wednesday, April 9, 14

Page 49: BioTeam Trends from the Trenches - NIH, April 2014

More hype than useful reality at the moment

49

Network: SDN Hype vs. Reality

‣ Software Defined Networking (“SDN”) is the new buzzword

‣ It will become pervasive and will change how we build and architect things

‣ But ...‣ Not hugely practical at the moment

for most environments• We need far more than APIs that control

port forwarding behavior on switches• More time needed for all of the related

technologies and methods to coalesce into something broadly useful and usable

Wednesday, April 9, 14

Page 50: BioTeam Trends from the Trenches - NIH, April 2014

50

Trends: Storage

Wednesday, April 9, 14

Page 51: BioTeam Trends from the Trenches - NIH, April 2014

51

Storage

‣ Still the biggest expense, biggest headache and scariest systems to design in modern life science informatics environments

‣ Many of the pain points we’ve talked about for years are still in place:

• Explosive growth forcing tradeoffs in capacity over performance• Lots of monolithic single tiers of storage• Critical need to actively manage data through it’s full life cycle

(just storing data is not enough ...)• Need for post-POSIX solutions such as iRODS and other

metadata-aware data repositories

Wednesday, April 9, 14

Page 52: BioTeam Trends from the Trenches - NIH, April 2014

52

Storage Trends

‣ The large but monolithic storage platforms we’ve built up over the years are no longer sufficient

• Do you know how many people are running a single large scale-out NAS or parallel filesystem? Most of us!

‣ Tiered storage is now an absolute requirement• At a minimum we need an active storage tier plus something

far cheaper/deeper for cold files

‣ Expect the tiers to involve multiple vendors, products and technologies

• The Tier1 storage vendors tend to have unacceptable pricing for their “all in one” tiered data management solutions

Wednesday, April 9, 14

Page 53: BioTeam Trends from the Trenches - NIH, April 2014

The Tier 1 storage vendors may be too expensive ...

53

Storage: Disruptive stuff ahead

‣ BioTeam has built 1Petabyte ZFS-based storage pools from commodity whitebox hardware for about $100,000

‣ Infinidat “IZbox” provides 1Petabyte of usable NAS as a turnkey appliance for roughly $375,000

• Both of these would be a nice, cost-effective archive or “cold” tier for less-active file and data storage

• Solutions like these cost far, far less than what Tier 1 storage vendors would charge for a petabyte of usable storage

• ... of course they come with their own risks and operational burden. This is an area where proper research and due diligence is essential

‣ Companies like Avere Systems are producing boxes that unify disparate storage tiers and link them to cloud and object stores

• This is a route to unifying “tier 1” storage with the “cheap & deep” storage

Wednesday, April 9, 14

Page 54: BioTeam Trends from the Trenches - NIH, April 2014

54

The road ahead ...

Wednesday, April 9, 14

Page 55: BioTeam Trends from the Trenches - NIH, April 2014

Some final thoughts

55

Future Trends and Patterns

‣ Data generation out-pacing technology

‣ Cheap/easy laboratory assays taking over

• Researchers largely don’t know what to do with it all

• Holding on to the data until someone figures it out

• This will cause some interesting headaches for IT

• Huge need for real “Big Data” applications to be developed

Wednesday, April 9, 14

Page 56: BioTeam Trends from the Trenches - NIH, April 2014

Some final thoughts

56

Future Trends and Patterns‣ Unless there’s an investment

in ultra-high speed networking, need to change thought on analysis

‣ Data commons are becoming a precedent

• Need to minimize the movement of data

• Include compute power and analysis platform with data commons

‣ Move the analysis to the data, don’t move the data

• Requires sharing/Large core institutional resources

Wednesday, April 9, 14

Page 57: BioTeam Trends from the Trenches - NIH, April 2014

Some final thoughts

57

Future Trends & Patterns

‣ Compute continues to become easier

‣ Data movement (physical & network) gets harder

‣ Cost of storage will be dwarfed by “cost of managing stored data”

‣ We can see end-of-life for our current IT architecture and design patterns; new patterns will start to appear over next 2-5 years

‣ We’ve got a new headache to worry about ...

Wednesday, April 9, 14

Page 58: BioTeam Trends from the Trenches - NIH, April 2014

A new challenge ...

58

Future Trends & Patterns

‣ Responsible sharing of clinical and genomic data will be the grand challenge of the post human genome project era

‣ We HAVE to get it right

‣ The ‘Global Alliance’ whitepaper cosigned by 70+ organizations is a must read:

• Short link to whitepaper: http://biote.am/9j • Long link: https://www.broadinstitute.org/files/news/pdfs/

GAWhitePaperJune3.pdf• NIH will be critical in making this work for the world

Wednesday, April 9, 14

Page 59: BioTeam Trends from the Trenches - NIH, April 2014

Up to a two line subtitle, generally used to describe the takeaway for the slide

59

end; Thanks!

`

Wednesday, April 9, 14