35
Instrument ALL the things: Studying data- intensive workflows in the clowd. C. Titus Brown Michigan State University (See blog post)

2014 pycon-talk

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: 2014 pycon-talk

Instrument ALL the things:Studying data-intensive workflows in the clowd.

C. Titus BrownMichigan State University

(See blog post)

Page 2: 2014 pycon-talk

A few upfront definitions

Big Data, n: whatever is still inconvenient to compute on.

Data scientist, n: a statistician who lives in San Francisco.

Professor, n: someone who writes grants to fund people who do the work (c.f. Fernando Perez)

I am a professor (not a data scientist) who writes grants so that others can do data-intensive biology.

Page 3: 2014 pycon-talk

This talk dedicated to Terry Peppers

Titus, I no longer understand what you actually do…

Daddy, what do you do at work!?

Page 4: 2014 pycon-talk

I assemble puzzles for a living.

Well, ok, I strategize about solving multi-dimensional puzzles with billions of pieces and no box.

Page 5: 2014 pycon-talk

Three bioinformatic strategies in use

• Greedy: “if the piece sorta fits…”

• N2 – “Do these two pieces match? How about this next one?”

• The Dutch approach.

Page 6: 2014 pycon-talk

The Dutch Solution(De Bruijn assembly)

Find similarities within puzzle pieces

Page 7: 2014 pycon-talk

The Dutch Solution

Algorithmically:• Is linear in time with number of pieces

(Way better than N2!)

• Is linear in memory with volume of data (This is due to errors in digitization process.)

Page 8: 2014 pycon-talk

Practical memory measurements

Velvet measurements (Adina Howe)

GB RAM

(About $500 of data)

Page 9: 2014 pycon-talk

Our research challenges –

1. It costs only $10k & 1 week to generate enough sequence data that no commodity computer (and few supercomputers) can assemble it.

2. Hundreds -> thousands of such data sets are being generated each year.

Page 10: 2014 pycon-talk

Our research challenges –

1. It costs only $10k & 1 week to generate enough sequence data that no commodity computer (and few supercomputers) can assemble it.

2. Hundreds -> thousands of such data sets are being generated each year.

(Solved)

Page 11: 2014 pycon-talk

Our research (i) - CS

• Streaming lossy compression approach that discards pieces we’ve seen before.

• Low memory probabilistic data structures.(…see Pycon 2013 talk)

=> RAM now scales better: O(I) where I << N(I is sample dependent but typically I < N/20)

Page 12: 2014 pycon-talk

Our research (ii) - approach

• Open source, open data, open science, and reproducible computational research.– GitHub– Automated testing, CI, & literate reSTing– Blogging, Twitter– IPython Notebook for data analysis, figures.

• Protocols for assembling in the cloud.

Page 13: 2014 pycon-talk

Molgula oculata

Molgula occulta

Molgula oculata

Real solutions, tackling squishy biology!

Elijah Lowe & Billie Swalla

Page 14: 2014 pycon-talk

Doing things right => #awesomesauce

Page 15: 2014 pycon-talk

Benchmarking strategy

• Rent a bunch of cloud VMs from Amazon and Rackspace.

• Extract commands from tutorials using literate-resting.

• Use ‘sar’ (sysstat pkg) to sample CPU, RAM, and disk I/O.

Page 16: 2014 pycon-talk

Benchmarking output

Data subset; AWS m1.xlarge

Page 17: 2014 pycon-talk

Each protocol has many steps

Data subset; AWS m1.xlarge

Page 18: 2014 pycon-talk

Most interested in RAM-intensive bit

Data subset; AWS m1.xlarge

Page 19: 2014 pycon-talk

Most interested in RAM-intensive bit

Complete data; AWS m1.xlarge

Page 20: 2014 pycon-talk

Observation #1: Rackspace is faster

machine data disk working hours cost

rackspace-15gb 200 GB 100 GB 34.9 $23.70

m2.xlarge EBS ephemeral 44.7 $18.34

m1.xlarge EBS ephemeral 45.5 $21.82

m1.xlargeEBS, max

IOPS ephemeral 49.1 $23.56

m1.xlargeEBS, max

IOPS EBS, max IOPS 52.5 $25.20

Page 21: 2014 pycon-talk

Surprise #1: AWS ephemeral storage is FASTER

machine data disk working hours cost

rackspace-15gb 200 GB 100 GB 34.9 $23.70

m2.xlarge EBS ephemeral 44.7 $18.34

m1.xlarge EBS ephemeral 45.5 $21.82

m1.xlargeEBS, max

IOPS ephemeral 49.1 $23.56

m1.xlargeEBS, max

IOPS EBS, max IOPS 52.5 $25.20

Page 22: 2014 pycon-talk

Observation #2: NUMA costs

Same task done with varying memory sizes.

Page 23: 2014 pycon-talk

Observation #2: NUMA costs

Same task done with varying memory sizes.

Page 24: 2014 pycon-talk

Can’t we just use a faster computer?

• Demo data on m1.xlarge: 2789 s• Demo data on m3.xlarge: 1970 s – 30% faster!

(Why?m3.xlarge has 2x40 GB SSD drives & 40% faster

cores.)

Great! Let’s try it out!

Page 25: 2014 pycon-talk

Observation #3: multifaceted problem!

• Full data on m1.xlarge: 45.5 h• Full data on m3.xlarge: out of disk space.

We need about 200 GB to run the full pipeline.

You can have fast disk or lots of disk but not both, for the moment.

Page 26: 2014 pycon-talk

Future directions

1. Invest in cache-local data structures and algorithms.

2. Invest in streaming/in-memory approaches.

3. Not clear (to me) that straight code optimization or infrastructure engineering is worthwhile investment.

Page 27: 2014 pycon-talk

Frequently Offered Solutions

1. You should like, totally multithread that.(See: McDonald & Brown, POSA)

2. Hadoop will just crush that workload, dude.(Unlikely to be cost-effective.)

3. Have you tried <my proprietary Big Data technology stack>?

(Thatz Not Science)

Page 28: 2014 pycon-talk

Optimization vs scaling

• Linear time/memory improvements would not have addressed our core problem.(2 years, 20x improvement, 100x increase in data.)

• Puzzle problem is a graph problem with big data, no locality, small compute. Not friendly.

• We need(ed) to scale our algorithms.

• Can now run on single-chassis, in ~15 GB RAM.

Page 29: 2014 pycon-talk

Optimization vs scaling --

Page 30: 2014 pycon-talk

Scaling can be more important!

Page 31: 2014 pycon-talk

What are we losing by focusing our engineering on pleasantly parallel problems?

• Hadoop is fundamentally not that interesting.

• Research is about the 100x.

• Scaling new problems, evaluating/creating new data structures and algorithms, etc.

Page 32: 2014 pycon-talk

(From my PyCon 2011 talk.)

Theme: Life’s too short to tackle the easy problems – come to academia!

Page 33: 2014 pycon-talk

Thanks!

• Leigh Sheneman, for starting the benchmarking project.

• Labbies: Michael R. Crusoe, Luiz Irber, Likit Preeyanon, Camille Scott, and Qingpeng Zhang.

Page 34: 2014 pycon-talk

Thanks!• github.com/ged-lab/

– khmer – core project– khmer-protocols – tutorials/acceptance tests– literate-resting – script to pull out code from reST tutorials

• Blog post at: http://ivory.idyll.org/blog/2014-pycon.html

• Michael R. Crusoe, Likit Preeyanon, Camille Scott, and Qingpeng Zhang are here at PyCon.

…note, you can probably afford tobuy them off me :)

Page 35: 2014 pycon-talk

Different computational strategies for k-mer counting, revealed!

Khmer-counting paper pipeline; Qingpeng Zhang