Keynote given at BOSC, 2010. Does the hype surrounding cloud match the reality? Can we use them to solve the problems in provisioning IT services to support next-generation sequencing?
- 1. Clouds: All fluff and no substance?
2. Wellcome Trust Sanger Institute 3. [email_address]
4. Outline
- Background 5. Cloud: Where are we at? 6. Good Fit: Web services
7. Bad Fit: HPTC compute 8. Better fit...? 9. Data management 10.
Collaboration 11. Grids
12. The Sanger Institute
- Funded by Wellcome Trust.
- 2 ndlargest research charity in the world. 13. ~700 employees.
14. Based in Hinxton Genome Campus, Cambridge, UK.
- Large scale genomic research.
- Sequenced 1/3 of the human genome. (largest single
contributor). 15. We have active cancer, malaria, pathogen and
genomic variation / human health studies.
- All data is made publicly available.
- Websites, ftp, direct database. access, programmatic APIs.
16. DNA sequencing 17. Economic Trends:
- As cost of sequencing halves every 12 months.
- The Human genome project:
- 13 years. 18. 23 labs. 19. $500 Million.
- A Human genome today:
- 3 days. 20. 1 machine. 21. $10,000. 22. Large centres are now
doing studies with 1000s and 10,000s of genomes.
- Changes in sequencing technology are going to continue this
trend.
- Next-next generation sequencers are on their way. 23. $500
genome is probablewithin 5 years.
24. The scary graph Instrument upgrades Peak Yearly capillary
sequencing 25. Managing Growth
- We have exponential growth in storage and compute.
- Storage /compute doubles every 12 months.
- Gigabase of sequenceGigbyte of storage.
- 16 bytes per base for for sequence data. 26. Intermediate
analysis typically need 10x disk space of the raw data.
- Moore's law will not save us.
- Transistor/disk density:T d =18 months 27. Sequencing cost: T d
=12 months
28. Cloud: Where are we at? 29. What is cloud?
- Informatician's view:
- On demand, virtual machines.
- Root access, total ownership.
- Upper management view:
- Free compute we can use to solve all of the hard problems
thrown up by new sequencing.
- (8cents/hour is almost free, right...?)
- Twatter/friendface use it, so it must be good.
30. Hype Cycle Awesome! Just works... 31. Lost in the clouds...
32. Victory! 33. Where are we? ? ? ? 34. Where are we?
- We currently have three areas of activity:
35. Ensembl
- Ensembl is a system for genome Annotation. 36. Data
visualisation (Web Presence)
- www.ensembl.org 37. Provides web / programmatic interfaces to
genomic data. 38. 10k visitors / 126k page views per day.
- Compute Pipeline (HPTC Workload)
- Take a raw genome and run it through a compute pipeline to find
genes and other features of interest. 39. Ensembl at Sanger/EBI
provides automated analysis for 51 vertebrate genomes.
- Software is Open Source (apache license). 40. Data is free for
download.
- We have done cloud experiments with both the web site and
pipeline.
41. Ensembl Website 42. 43. Web Presence
- Ensembl has a worldwide audience. 44. Historically, web site
performance was not great.
- Pages were quite heavyweight. 45. Not properly cached etc.
- Web team spent along time re-designing the code to make it more
streamlined.
- Greatly improved performance.
- Coding can only get you so-far.
- If we want the website to be responsive, we need low latency.
46. A canna' change the laws of physics.
- We need a set of geographically dispersed mirrors.
47. uswest.ensembl.org
- Traditional mirror: Real machines in aco-lo facility in
California. 48. Hardware was initially configured on site.
- 16 servers, SAN storage, SAN switches, SAN management
appliance, Ethernet switches, firewall, out-of-band management etc
etc.
- Shipped to the co-lo for installation.
- Sent a person to California for 3 weeks. 49. Spent 1 week
getting stuff into/out of customs.
- Additional infrastructure work.
- Incredibly time consuming.
- Really don't want to end up having to send someone on a plane
to the US to fix things.
50. Usage
- Geo-IP database to point people to the nearest mirror: 51.
US-West currently takes ~1/3 rd of total Ensembl web traffic.
- Latency down from XXXMs to XXms.
52. Usage 53. What has this got to do with clouds? 54.
useast.ensembl.org
- We want an east coast US mirror to complement our west coast
mirror. 55. Built the mirror in AWS.
- Initially a proof of concept /test-bed for virtual co-location.
56. Plan for production real soon now.
57. Building a mirror on AWS
- No physical hardware.
- Work can start as soon as we enter our credit card
numbers...
- Some software development / sysadmin work needed.
- Preparation of OS images,software stack configuration. 58.
West-coast was built as an extension of Sanger internal network via
VPN. 59. AWS images built as standalone systems.
- Significant amount of tuning required.
- Initial mysql performance was pretty bad, especially for the
large ensembl databases. (~1TB). 60. Lots of people doing
Apache/mysql on AWS, so there is a good amount of best-practice etc
available.
61. Does it work? 62. Is it cost effective?
- Lots of misleading cost statements made about cloud.
- Our analysis only cost $500. 63. $0.085 / hr.
- What are we comparing against?
- Doing the analysis once? Continually? 64. Buying a $2000
server? 65. Leasing a $2000 server for 3 years? 66. Using $150 of
time at your local supercomputing facility? 67. Buying a $2000 of
server but having to build a $1Mdatacentre to put it in?
- Requires the dreaded Total Cost of Ownership (TCO) calculation.
- hardware+ power + cooling + facilities + admin/developers etc
68. Lets do it anyway...
- Comparing costs to the co-lo is simpler.
- power, cooling costs are all included. 69. Admin costs are the
same, so we can ignore them.
- Same people responsible for both.
- Cost for Co-location facility:
- $120,000 hardware + $51,000 /yrcolo. 70. $91,000 per year (3
years hardware lifetime).
- Cost for AWS :
- Result: Estimated 16% cost saving.
- Good saving. 71. It is not free!
72. Additional Benefits
- No need to deal with real hardware.
- Faster implementation. 73. No need to ship server or deal with
US customs.
- Free hardware upgrades.
- As faster machines become available we can take advantage of
them immediately. 74. No need to get tin decommissioned
/re-installed at Co-lo.
- Website + code is packaged together.
- Can be conveniently given away to end users in a ready-to-run
config. 75. Simplifies configuration for other users wanting to run
Ensembl sites. 76. Configuring an ensembl site is non-trivial for
non-informaticians.
- Cvs, mysql setup, apache configuration etc.
77. Added benefits 78. Downsides
- Packaging OS images and codes did take longer than expected.
- Most of the web-code refactoring to make it mirror ready had
been done for the initial real colo.
- This needs to be re-done every ensembl release.
- Now part of the ensembl software release process.
- Management overhead does not necessarily go down.
79. Going forward
- Expect mirror to go live later this year.
- Far-east Amazon availability zone is also of interest.
- Virtual Co-location concept will be useful for a number of
other projects.
- Disaster recovery.
- Eg replicate critical databases / storage into AWS.
80. Hype Cycle Web services 81. Ensembl Pipeline
- HPTC element of Ensembl.
- Takes raw genomes and lays annotation on top.
82. Compute Pipeline
TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG
GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA
TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA
TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC
AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC
TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA
ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG
AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC
TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG
AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG
AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA
GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT
ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 83.
Raw Sequence -> Something useful 84. Example annotation 85. Gene
Finding DNA HMM PredictionAlignment with knownproteinsAlignment
with fragmentsrecoveredin vivo Alignment with other genes and other
species 86. Compute Pipeline
- Architecture:
- OO perl pipeline manager. 87. Core algorithms are C. 88. 200
auxiliary binaries.
- Workflow:
- Investigator describes analysis at high level. 89. Pipeline
manager splits the analysis into parallel chunks.
- Sorts out the dependences and then submits jobs to a DRM.
- Pipeline state and results are stored in a mysql database.
- Workflow is embarrassingly parallel.
- Integer, not floating point. 90. 64 bit memory address is nice,
but not required.
- 64 bit file accessisrequired.
- Single threaded jobs. 91. Very IO intensive.
92. Running the pipeline in practice
- Requires a significant amount ofdomain knowledge. 93. Software
install is complicated.
- Lots of perl modules and dependencies. 94. Apache wranging if
you want to run a website.
- Need a well tuned compute cluster.
- Pipeline takes ~500 CPU days for a moderate genome.
- Ensembl chewed up 160k CPU days last year.
- Code is IO bound in a number of places. 95. Typically need a
high performance filesystem.
- Lustre, GPFS, Isilon, Ibrix etc.
- Need large mysql database.
- 100GB-TB mysql instances, very high query load generated from
the cluster.
96. Why Cloud?
- Provides a good example for testing HPTC capabilities of the
cloud.
97. Why Cloud?
- Proof of concept
- Is HTPC is even possible in Cloud infrastructures?
- Coping with the big increase in data
- Will we be able to provision new machines/datacentre space to
keep up? 98. What happens if we need to out-source our compute? 99.
Can we be in a position to shift peaks of demand to cloud
facilities?
100. Expanding markets
- There are going to be lots of new genomes that need annotating.
- Sequencers moving into small labs, clinical settings. 101.
Limited informatics / systems experience.
- Typically postdocs/PhD who have a real job to do.
- They may want to run thegenebuild pipeline on their data, but
they may not have the expertise to do so.
- We have already done all the hard work on installing the
software and tuning it.
- Can we package up the pipeline, put it in the cloud?
- Goal: End user should simply be able to upload their data,
insert theircredit-card number, and pressGO .
102. Porting HPTC code to the cloud
- Software stack / machine image.
- Creating images with software is reasonably straightforward.
103. No big surprises
- Queuing system
- Pipeline requires a queueing system: (LSF/SGE) 104. Getting
them to run took a lot of fiddling. 105. Machines need to find each
other one they are inside the cloud. 106. Building an automated
self discovering cluster takes
- Hopefully others can re-use it.
- Mysql databases
- Lots of best practice on how to do that on EC2.
- But it took time, even for experienced systems people.
- (You will not be firing your system-administrators just
yet!).
107. The big problem...
- Data: 108. Moving data into the cloud is hard 109. Doing stuff
with data once it is in the cloud is also hard 110. If you look
closely, most successful cloud projects have small amounts of data
(10-100 Mbytes).
111. Moving data is hard
- Tools:
- Commonly used tools (FTP,ssh/rsync) are not suited to wide-area
networks. 112. WAN tools:gridFTP/FDT/Aspera.
- Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).
- Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s) 113.
Cambridge -> EC2 Dublin: 25 Mbytes/s(200 Mbits/s) 114. 11 hours
to move 1TB to Dublin. 115. 23 hours to move 1 TB to East
coast.
- What speedshouldwe get?
- Once we leave JANET (UK academic network) finding out what the
connectivity is and what we should expect is almost
impossible.
- Are our disks fast enough?
- Do you have fast enough disks at each end to keep the network
full?
116. Networking
- How do we improve data transfers across the public internet?
- CERN approach; don't. 117. Dedicated networking has been put in
between CERN and the T1 centres who get all of the CERN data.
- Our collaborations are different.
- We have relatively short lived and fluid collaborations. (1-2
years, many institutions). 118. As more labs get sequencers, our
potential collaborators also increase. 119. We need good
connectivity to everywhere.
120. Moving data in the cloud
- Compute nodes need to be able to see the data. 121. No viable
global filesystems on EC2.
- NFS has poor scaling at the best of times. 122. EC2 has poor
inter-node networking. >8 NFS clients, everything stops.
- The cloud way: store data in S3.
- Web based object store.
- Get, put, delete objects.
- Not POSIX.
- Code needs re-writing / forking.
- Limitations; cannot store objects > 5GB.
- Nasty-hacks:
- Subcloud; commercial product that allows you to run a POSIX
filesystem on top of S3.
- Interesting performance, and you are paying by the hour...
123. Compute architecture VS CPU CPU CPU Fat Network Posix
Global filesystem CPU CPU CPU CPU thin network Local storage Local
storage Local storage Local storage Batch schedular hadoop/S3
Data-store Data-store 124. Elephant in the room 125. Why not use
map-reduce?
- Re-writing apps to use S3 or hadoop/HDFS is a real hurdle.
- Nobody want to re-write existing applications.
- They already work on our compute farm.
- Not an issue for new apps. 126. But hadoop apps do not exist in
isolation. 127. Barrier for entry seems much lower for
file-systems.
- We have a lot of non-expert users (this is a good thing).
- Am I being a reactionary old fart?
- 15 years ago clusters of PCs were not real supercomputers. 128.
...then beowulf took over the world.
- Big difference: porting applications between the two
architectures was easy.
- Will the market provide traditional compute clusters in the
cloud?
129. Hype cycle HPTC 130. Where are we?
- You cannot take an existing data-rich HPTC app and expect it to
work.
- IO architectures are too different.
- There is some re-factoring going on for the ensembl pipeline.
- Currently on a case-by-case basis. 131. For the less-data
intensive parts.
132. Shared data archives 133. Past Collaborations Data
Sequencing Centre + DCC Sequencing centre Sequencing centre
Sequencing centre Sequencing centre 134. Future Collaborations
Collaborations are short term: 18 months-3 years. Sequencing Centre
3 Sequencing Centre 1 Sequencing Centre 2A Sequencing Centre 2B
Federated access 135. International Cancer Genome Project
- Many cancer mutations are rare.
- Low signal-to-noise ratio.
- How do we find the rare but important mutations?
- Sequence lots of cancer genomes.
- International Cancer Genome Project.
- Consortia of sequencing and cancer research centres in 10
countries.
- Aim of the consortia.
- Complete genomic analysis of 50 different tumor types. (50,000
genomes).
136. Genomics Data Unstructured data (flat files) Data size per
Genome Structured data (databases) Clinical Researchers,
non-infomaticians Sequencing informaticsspecialists Intensities /
raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB)
Variation data (1GB) Individualfeatures(3MB) 137. Sharing
Unstructured data
- Large data volumes, flat files. 138. Federated access.
- Data is not going to be in once place. 139. Single institute
will have data distributed for DR / worldwide access.
- Some parts of the data may be on cloud stores.
- Controlled access.
- Many archives will be public. 140. Some will have patient
identifiable data. 141. Plan for it now.
142. Dark Archives
- Storing data in an archive is not particularly useful.
- You need to be able to access the data and do something useful
with it.
- Data in current archives is dark.
- You can put/get data, but cannot compute across it. 143. Is
data in an inaccessible archive really useful?
144. Last week's bombshell
- We want to run out pipeline across 100TB of data currently in
EGA/SRA. 145. We will need to de-stage the data to Sanger, and then
run the compute.
- Extra 0.5 PB of storage, 1000 cores of compute. 146. 3 month
lead time. 147. ~$1.5M capex.
148. Cloud / Computable archives
- Can we move the compute to the data?
- Upload workload onto VMs. 149. Put VMs on compute that is
attached to the data.
Data CPU CPU CPU CPU Data CPU CPU CPU CPU VM 150. Practical
Hurdles 151. Where does it live?
- Most of us are funded to hold data, not to fund everyone else's
compute costs to.
- Now need to budget for raw compute power as well as disk. 152.
Implement virtualisation infrastructure, billing etc.
- Are you legally allowed to charge? 153. Who underwrites it if
nobody actually uses your service?
- Strongly implies data has to be held on a commercial provider.
- Amazon etc already have billing infrastructures; why not use
it. 154. Directly exposed to costs.
- Is the service cost effective?
155. Identity management
- Which identity management system to use for controlled access?
156. Culture shock. 157. Lots of solutions:
-
- openID, shibboleth(aspis), globus/x509 etc.
- What features are important?
- How much security? 158. Single sign on? 159. Delegated
authentication?
- Finding consensus will be hard.
160. Networking:
- We still need to get data in.
- Fixing the internet is not going to be cost effective for
us.
- Fixing the internet may be cost effective for big cloud
providers.
- Core to their business model. 161. All we need to do is get
data into Amazon, and then everyone else can get the data from
there.
- Do we investin afast links to Amazon?
- It changes the business dynamic. 162. We have effectively tied
ourselves to a single provider.
163. Summary 164. Acknowledgements
- Phil Butcher 165. ISG Team
- James Beal 166. Gen-Tao Chiang 167. Pete Clapham 168. Simon
Kelley
- 1k Genomes Project
- Thomas Keane 169. Jim Stalker
- Cancer Genome Project
- Adam Butler 170. John Teague
171. Backup