Clouds: All fluff and no substance?

1. Clouds: All fluff and no substance?

Guy Coates

2. Wellcome Trust Sanger Institute 3. [email_address]

Background 5. Cloud: Where are we at? 6. Good Fit: Web services 7. Bad Fit: HPTC compute 8. Better fit...? 9. Data management 10. Collaboration 11. Grids

Funded by Wellcome Trust.

2 ndlargest research charity in the world. 13. ~700 employees. 14. Based in Hinxton Genome Campus, Cambridge, UK.

Large scale genomic research.

Sequenced 1/3 of the human genome. (largest single contributor). 15. We have active cancer, malaria, pathogen and genomic variation / human health studies.

All data is made publicly available.

Websites, ftp, direct database. access, programmatic APIs.

As cost of sequencing halves every 12 months.

cfMoore's Law

The Human genome project:

13 years. 18. 23 labs. 19. $500 Million.

A Human genome today:

3 days. 20. 1 machine. 21. $10,000. 22. Large centres are now doing studies with 1000s and 10,000s of genomes.

Changes in sequencing technology are going to continue this trend.

Next-next generation sequencers are on their way. 23. $500 genome is probablewithin 5 years.

We have exponential growth in storage and compute.

Storage /compute doubles every 12 months.

2009 ~7 PB raw

Gigabase of sequenceGigbyte of storage.

16 bytes per base for for sequence data. 26. Intermediate analysis typically need 10x disk space of the raw data.

Moore's law will not save us.

Transistor/disk density:T d =18 months 27. Sequencing cost: T d =12 months

Informatician's view:

On demand, virtual machines.

Root access, total ownership.

Pay-as-you-go model.

Upper management view:

Free compute we can use to solve all of the hard problems thrown up by new sequencing.

(8cents/hour is almost free, right...?)

Twatter/friendface use it, so it must be good.

We currently have three areas of activity:

Web presence

HPTC workload

Active Data Warehousing

Ensembl is a system for genome Annotation. 36. Data visualisation (Web Presence)

www.ensembl.org 37. Provides web / programmatic interfaces to genomic data. 38. 10k visitors / 126k page views per day.

Compute Pipeline (HPTC Workload)

Take a raw genome and run it through a compute pipeline to find genes and other features of interest. 39. Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate genomes.

Software is Open Source (apache license). 40. Data is free for download.

We have done cloud experiments with both the web site and pipeline.

Ensembl has a worldwide audience. 44. Historically, web site performance was not great.

Pages were quite heavyweight. 45. Not properly cached etc.

Web team spent along time re-designing the code to make it more streamlined.

Greatly improved performance.

Coding can only get you so-far.

If we want the website to be responsive, we need low latency. 46. A canna' change the laws of physics.

240ms round trip time.

We need a set of geographically dispersed mirrors.

Traditional mirror: Real machines in aco-lo facility in California. 48. Hardware was initially configured on site.

16 servers, SAN storage, SAN switches, SAN management appliance, Ethernet switches, firewall, out-of-band management etc etc.

Shipped to the co-lo for installation.

Sent a person to California for 3 weeks. 49. Spent 1 week getting stuff into/out of customs.

****ing FCC paperwork!

Additional infrastructure work.

VPN between UK and US.

Incredibly time consuming.

Really don't want to end up having to send someone on a plane to the US to fix things.

Geo-IP database to point people to the nearest mirror: 51. US-West currently takes ~1/3 rd of total Ensembl web traffic.

Latency down from XXXMs to XXms.

We want an east coast US mirror to complement our west coast mirror. 55. Built the mirror in AWS.

Initially a proof of concept /test-bed for virtual co-location. 56. Plan for production real soon now.

No physical hardware.

Work can start as soon as we enter our credit card numbers...

Some software development / sysadmin work needed.

Preparation of OS images,software stack configuration. 58. West-coast was built as an extension of Sanger internal network via VPN. 59. AWS images built as standalone systems.

Significant amount of tuning required.

Initial mysql performance was pretty bad, especially for the large ensembl databases. (~1TB). 60. Lots of people doing Apache/mysql on AWS, so there is a good amount of best-practice etc available.

Lots of misleading cost statements made about cloud.

Our analysis only cost $500. 63. $0.085 / hr.

What are we comparing against?

Doing the analysis once? Continually? 64. Buying a $2000 server? 65. Leasing a $2000 server for 3 years? 66. Using $150 of time at your local supercomputing facility? 67. Buying a $2000 of server but having to build a $1Mdatacentre to put it in?

Requires the dreaded Total Cost of Ownership (TCO) calculation.

hardware+ power + cooling + facilities + admin/developers etc

Incredibly hard to do.

Comparing costs to the co-lo is simpler.

power, cooling costs are all included. 69. Admin costs are the same, so we can ignore them.

Same people responsible for both.

Cost for Co-location facility:

$120,000 hardware + $51,000 /yrcolo. 70. $91,000 per year (3 years hardware lifetime).

Cost for AWS :

$77,000 per year.

Result: Estimated 16% cost saving.

Good saving. 71. It is not free!

No need to deal with real hardware.

Faster implementation. 73. No need to ship server or deal with US customs.

Free hardware upgrades.

As faster machines become available we can take advantage of them immediately. 74. No need to get tin decommissioned /re-installed at Co-lo.

Website + code is packaged together.

Can be conveniently given away to end users in a ready-to-run config. 75. Simplifies configuration for other users wanting to run Ensembl sites. 76. Configuring an ensembl site is non-trivial for non-informaticians.

Cvs, mysql setup, apache configuration etc.

Packaging OS images and codes did take longer than expected.

Most of the web-code refactoring to make it mirror ready had been done for the initial real colo.

This needs to be re-done every ensembl release.

Now part of the ensembl software release process.

Management overhead does not necessarily go down.

But it does change.

Expect mirror to go live later this year.

Far-east Amazon availability zone is also of interest.

No timeframe so far.

Virtual Co-location concept will be useful for a number of other projects.

Other Sanger websites?

Disaster recovery.

Eg replicate critical databases / storage into AWS.

HPTC element of Ensembl.

Takes raw genomes and lays annotation on top.

Architecture:

OO perl pipeline manager. 87. Core algorithms are C. 88. 200 auxiliary binaries.

Workflow:

Investigator describes analysis at high level. 89. Pipeline manager splits the analysis into parallel chunks.

Typically 50k-100k jobs.

Sorts out the dependences and then submits jobs to a DRM.

Typically LSF or SGE.

Pipeline state and results are stored in a mysql database.

Workflow is embarrassingly parallel.

Integer, not floating point. 90. 64 bit memory address is nice, but not required.

64 bit file accessisrequired.

Single threaded jobs. 91. Very IO intensive.

Requires a significant amount ofdomain knowledge. 93. Software install is complicated.

Lots of perl modules and dependencies. 94. Apache wranging if you want to run a website.

Need a well tuned compute cluster.

Pipeline takes ~500 CPU days for a moderate genome.

Ensembl chewed up 160k CPU days last year.

Code is IO bound in a number of places. 95. Typically need a high performance filesystem.

Lustre, GPFS, Isilon, Ibrix etc.

Need large mysql database.

100GB-TB mysql instances, very high query load generated from the cluster.

Provides a good example for testing HPTC capabilities of the cloud.

Proof of concept

Is HTPC is even possible in Cloud infrastructures?

Coping with the big increase in data

Will we be able to provision new machines/datacentre space to keep up? 98. What happens if we need to out-source our compute? 99. Can we be in a position to shift peaks of demand to cloud facilities?

There are going to be lots of new genomes that need annotating.

Sequencers moving into small labs, clinical settings. 101. Limited informatics / systems experience.

Typically postdocs/PhD who have a real job to do.

They may want to run thegenebuild pipeline on their data, but they may not have the expertise to do so.

We have already done all the hard work on installing the software and tuning it.

Can we package up the pipeline, put it in the cloud?

Goal: End user should simply be able to upload their data, insert theircredit-card number, and pressGO .

Software stack / machine image.

Creating images with software is reasonably straightforward. 103. No big surprises

Queuing system

Pipeline requires a queueing system: (LSF/SGE) 104. Getting them to run took a lot of fiddling. 105. Machines need to find each other one they are inside the cloud. 106. Building an automated self discovering cluster takes

Hopefully others can re-use it.

Mysql databases

Lots of best practice on how to do that on EC2.

But it took time, even for experienced systems people.

(You will not be firing your system-administrators just yet!).

Data: 108. Moving data into the cloud is hard 109. Doing stuff with data once it is in the cloud is also hard 110. If you look closely, most successful cloud projects have small amounts of data (10-100 Mbytes).

Tools:

Commonly used tools (FTP,ssh/rsync) are not suited to wide-area networks. 112. WAN tools:gridFTP/FDT/Aspera.

Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).

Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s) 113. Cambridge -> EC2 Dublin: 25 Mbytes/s(200 Mbits/s) 114. 11 hours to move 1TB to Dublin. 115. 23 hours to move 1 TB to East coast.

What speedshouldwe get?

Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible.

Are our disks fast enough?

Do you have fast enough disks at each end to keep the network full?

How do we improve data transfers across the public internet?

CERN approach; don't. 117. Dedicated networking has been put in between CERN and the T1 centres who get all of the CERN data.

Our collaborations are different.

We have relatively short lived and fluid collaborations. (1-2 years, many institutions). 118. As more labs get sequencers, our potential collaborators also increase. 119. We need good connectivity to everywhere.

Compute nodes need to be able to see the data. 121. No viable global filesystems on EC2.

NFS has poor scaling at the best of times. 122. EC2 has poor inter-node networking. >8 NFS clients, everything stops.

The cloud way: store data in S3.

Web based object store.

Get, put, delete objects.

Not POSIX.

Code needs re-writing / forking.

Limitations; cannot store objects > 5GB.

Nasty-hacks:

Subcloud; commercial product that allows you to run a POSIX filesystem on top of S3.

Interesting performance, and you are paying by the hour...

Re-writing apps to use S3 or hadoop/HDFS is a real hurdle.

Nobody want to re-write existing applications.

They already work on our compute farm.

Not an issue for new apps. 126. But hadoop apps do not exist in isolation. 127. Barrier for entry seems much lower for file-systems.

We have a lot of non-expert users (this is a good thing).

Am I being a reactionary old fart?

15 years ago clusters of PCs were not real supercomputers. 128. ...then beowulf took over the world.

Big difference: porting applications between the two architectures was easy.

MPI/PVM etc.

Will the market provide traditional compute clusters in the cloud?

You cannot take an existing data-rich HPTC app and expect it to work.

IO architectures are too different.

There is some re-factoring going on for the ensembl pipeline.

Currently on a case-by-case basis. 131. For the less-data intensive parts.

Many cancer mutations are rare.

Low signal-to-noise ratio.

How do we find the rare but important mutations?

Sequence lots of cancer genomes.

International Cancer Genome Project.

Consortia of sequencing and cancer research centres in 10 countries.

Aim of the consortia.

Complete genomic analysis of 50 different tumor types. (50,000 genomes).

Large data volumes, flat files. 138. Federated access.

Data is not going to be in once place. 139. Single institute will have data distributed for DR / worldwide access.

Some parts of the data may be on cloud stores.

Controlled access.

Many archives will be public. 140. Some will have patient identifiable data. 141. Plan for it now.

Storing data in an archive is not particularly useful.

You need to be able to access the data and do something useful with it.

Data in current archives is dark.

You can put/get data, but cannot compute across it. 143. Is data in an inaccessible archive really useful?

We want to run out pipeline across 100TB of data currently in EGA/SRA. 145. We will need to de-stage the data to Sanger, and then run the compute.

Extra 0.5 PB of storage, 1000 cores of compute. 146. 3 month lead time. 147. ~$1.5M capex.

Can we move the compute to the data?

Upload workload onto VMs. 149. Put VMs on compute that is attached to the data.

Most of us are funded to hold data, not to fund everyone else's compute costs to.

Now need to budget for raw compute power as well as disk. 152. Implement virtualisation infrastructure, billing etc.

Are you legally allowed to charge? 153. Who underwrites it if nobody actually uses your service?

Strongly implies data has to be held on a commercial provider.

Amazon etc already have billing infrastructures; why not use it. 154. Directly exposed to costs.

Is the service cost effective?

Which identity management system to use for controlled access? 156. Culture shock. 157. Lots of solutions:

openID, shibboleth(aspis), globus/x509 etc.

What features are important?

How much security? 158. Single sign on? 159. Delegated authentication?

Finding consensus will be hard.

We still need to get data in.

Fixing the internet is not going to be cost effective for us.

Fixing the internet may be cost effective for big cloud providers.

Core to their business model. 161. All we need to do is get data into Amazon, and then everyone else can get the data from there.

Do we investin afast links to Amazon?

It changes the business dynamic. 162. We have effectively tied ourselves to a single provider.

Phil Butcher 165. ISG Team

James Beal 166. Gen-Tao Chiang 167. Pete Clapham 168. Simon Kelley

1k Genomes Project

Thomas Keane 169. Jim Stalker

Cancer Genome Project

Adam Butler 170. John Teague

Technology

Clouds: All fluff and no substance?