28
Everything Comes in 3’s Angel Pizarro Director, ITMAT Bioinformatics Facility University of Pennsylvania School of Medicine

Everything comes in 3's

Embed Size (px)

DESCRIPTION

A talk given at BioIT World conference 2010 Cloud Computing Workshop

Citation preview

Page 1: Everything comes in 3's

Everything Comes in 3’s

Angel PizarroDirector, ITMAT Bioinformatics Facility

University of Pennsylvania School of Medicine

Page 2: Everything comes in 3's

Outline

• This talk looks at the practical aspects of Cloud Computing–We will be diving into specific examples

• 3 pillars of systems design

• 3 storage implementations

• 3 areas of bioinformatics – And how they are affected by clouds

• 3 interesting internal projectsThere are 2 hard problems in computer science: caching, naming, and off-by-1 errors

Page 3: Everything comes in 3's

Pillars of Systems Design

1. Provisioning– API access (AWS, Microsoft, RackSpace, GoGrid,

etc.)– Not discussing further, since this is the WHOLE

POINT of cloud computing.

2. Configuration– How to get a system up to the point you can do

something with it

3. Command and Control– How to tell the system what to do

Page 4: Everything comes in 3's

System Configuration with Chef

• Automatic installation of packages, service configuration and initialization

• Specifications use a real programming language with known behavior

• Bring the system to an idempotent state

• http://opscode.com/chef/

http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg

Page 5: Everything comes in 3's

Chef Recipes & Cookbooks

• The specification for installing and configuring a system component

• Able to support more than one platform• Has access to system-wide information– hostname, IP addr, RAM, # processors, etc.

• Contain templates, documentation, static files & assets

• Can define dependencies on other recipes• Executed in order, execution stops at first failure

Page 6: Everything comes in 3's

Simple Recipe : Rsync

• Install rsync to the system• Meta data file states what

platforms are supported• Note that Chef is a Linux

centric system• BUT, the WikiWiki is

MessyMessy– Look at Chef Solo &

Resources

Page 7: Everything comes in 3's

More Complex Recipe: Heartbeat

• Installs heartbeat package

• Registers the service and specifies that is can be restarted and provides a status message

• Finally it starts the service

Page 8: Everything comes in 3's

Command and Control

• Traditional grid computing– QSUB – SGE, PBS, Torque– Usually requires tightly coupled and static systems– Shared file systems, firewalls, user accounts, shared

exe & lib locations– Best for capability processes (e.g. MPI)

• Map-Reduce is the new hotness– Best for data-parallel processes– Assumes loosely coupled non-static components– Job staging is a critical component

Page 9: Everything comes in 3's

Map Reduce in a Nutshell

• Algorithm pioneered by Google for distributed data analysis– Data-parallel analysis fit

well into this model– Split data, work on each

part in parallel, then merge results

• Hadoop, Disco, CloudCrowd, …

Page 10: Everything comes in 3's

Serial Execution of Proteomics Search

Page 11: Everything comes in 3's

Parallel Proteomics Search

Page 12: Everything comes in 3's

Roll-Your-Own MR on AWS

• Define small scripts to– Split a FASTA file– Run a BLAT search– The first script make defines the inputs of the second

• Submit the input FASTA to S3• Start a master node as the central communication

hub• Start slave nodes, configured to ask for work from

master and save results back to S3• Press “Play”

Page 13: Everything comes in 3's

Workflow of Distributed BLAT

S3

PC

Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes

Master

Slave

Slave

Slave

Slave

Boot master & slaves

Upload inputs

Download results

Submit the BLAT job

Page 14: Everything comes in 3's

Master Node => Resque

• Github developed background job processing framework

• Jobs attached to a class from your application, stored as JSON

• Uses REDIS key-value store

• Simple front end for viewing job queue status, failed job

Resque can invoke any class that has a class method “perform()”

http://github.com/defunkt/resque

Page 15: Everything comes in 3's

The scripts

Page 16: Everything comes in 3's

Storage in the Cloud : S3

• Permanent storage for your data

• Pay as you go for transmission and holding

• Eliminates backups• Pretty good CDN

– Able to hook into better CDN SLA via CloudFront

• Can be slow at times– Reports of 10 second delay,

but average is 300ms response

S3

Your Data

Page 17: Everything comes in 3's

S3 CostsUsage Rates Usage Example

$0.15 GB / month 1,690 GB

$0.10 GB / month IN 100 GB IN

$0.15 GB / month OUT 100 GB OUT

$0.01 per 1,000 PUT/POST requests

1,000,000 requests

$0.01 per 10,000 GET requests

1,000,000 requests

$289.50 per month

$0.17 per GB per month

$2.06 per GB per year

$3,474.00 per 1690 GB per year

Page 18: Everything comes in 3's

Storage 2: Distributed FS on EC2

• Hadoop HDFS, Gigaspaces, etc.

• Network latency may be an issue for traditional DFSs– Gluster, GPFS, etc.

• Tighter integration with execution framework, better performance?

EC2 NodeEC2 Node

EC2 NodeEC2 Node

EC2 Node Disk

Your Data

Page 19: Everything comes in 3's

DFS on EC2 m1.xlarge CostsInitial cost Usage costs

$2,800.00 3-yr reserved instance fee

$0.24 ¢/hr

24 hours / day

365 days / yr

3 yrs

$9,107.20 Total 3 yr cost

$3,035.73 cost 1690 GB per year*

$1.80 cost per GB per year*

* Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3

Page 20: Everything comes in 3's

Storage 3: Memory Grids

• “RAM is the new Disk”• Application level RAM

clustering– Terracotta, Gemstone

Gemfire, Oracle, Cisco, Gigaspaces

• Performance for capability jobs?

EC2 RAMEC2 RAM

EC2 RAMEC2 RAM

EC2 RAMEC2 RAM

Your Data

* There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads

Page 21: Everything comes in 3's

Memory Grid CostInitial cost Usage costs

$9,800.00 3-yr reserved instance fee

$0.84 ¢/hr

24 hours / day

365 days / yr

3 yrs

$31,875.20 Total 3 yr cost

$10,625.07 cost per yr

$155.34 cost per GB per year

$262,519.92Cost 1690 GB per yr

Take home message: Unless your needs are small, you may be better off procuring bare-metal resources

Page 22: Everything comes in 3's

Cloud Influence on Bioinformatics

• Computational Biology– Algorithms will need to account for large I/O latency– Statistical tests will need to account for incomplete

information, or incremental results• Software Engineering– Built for the cloud algorithms are popping up

• CloudBurst is a feature example in AWS EMR!

• Application to Life Sciences– Deploy ready-made images for use

• Cycle Computing, ViPDAC, others soon to follow

Page 23: Everything comes in 3's

Algorithms need to be I/O centric

• Incur a slightly higher computational burden to reduce I/O across non-optimal networks

P. Balaji, W. Feng, H. Lin 2008

Page 24: Everything comes in 3's

Some Internal Projects• Resource Manager

– Service for on-demand provisioning and release of EC2 nodes– Utilizes Chef to define and apply roles (compute node, DB server, etc)– Terminates idle compute nodes at 52 minutes

• Workflow Manager– Defines and executes data analysis workflows– Relies on RM to provision nodes– Once appropriate worker nodes are available, acts as the central work queue

• RUM– RNA-Seq Ultimate Mapper– Map Reduce RNA-Seq analysis pipeline– Combines Bowtie + BLAT and feeds results into a decision tree for more

accurate mapping of sequence reads

Page 25: Everything comes in 3's

Bowtie Alone

74%

8%

18%

Mapping Efficiency

MappedAmbiguousUnmapped

38.0%

37.0%

25.0%

Mapping Breakdown

Unique PairedUnique SingleAmbiguous

Page 26: Everything comes in 3's

RUM (Bowtie + BLAT + processing)

70%

16%

14%

Mapping Breakdown

Unique PairedUnique SingleAmbiguous

81%

4% 15%

Mapping Efficiency

Mapped

Unmapped

Mapped Ambiguously

Significantly increases the confidence of your data

Page 27: Everything comes in 3's

RUM Costs

• Computational cost ~$100 - $200– 6-8 hours per lane on m2.4xlarge ($2.40 / hour)

• Cost of reagents ~= $10,000

1% of total

Page 28: Everything comes in 3's

Acknowledgements

• Garret FitzGerald• Ian Blair

• John Hogenesch• Greg Grant• Tilo Grosser

• NIH & UPENN for support

• My Team– David Austin– Andrew Brader– Weichen Wu

Rate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s