Upload
enis-afgan
View
96
Download
0
Embed Size (px)
Citation preview
The Pulse of Cloud Computingwith Bioinformatics as an example
Nuwan Goonasekera†, Enis Afgan*
† University of Melbourne, Melbourne Bioinformatics, Australia* Johns Hopkins University, Taylor Lab, USA
@ University of ColomboFeb 2017
Overview
• The key characteristics of Cloud Computing• Using Cloud Computing for bioinformatics
Source: http://dilbert.com/strips/comic/2012-05-25/
Data center use before cloud computing
source: http://www.rackspace.com/knowledge_center/whitepaper/revolution-not-evolution-how-cloud-computing-differs-from-traditional-it-and-why-it
Cloud Computing: A Definition
• NIST definition: “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”
» National Institute of Standards and Technology(http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf)
The Cloud Model
Private Community Public HybridDeployment Models
Delivery Models
Essential Characteristics
Software as a Service (SaaS)
Platform as a Service (PaaS)
Infrastructure as a Service (IaaS)
• On-demand self-service• Broad network access• Resource pooling• Rapid elasticity• Measured service
Delivery Models
source: http://www.businessinsider.com.au/10-most-important-in-cloud-computing-2013-4?op=1#a-word-about-clouds-1
Infrastructure-as-a-Service (IaaS)
• Amazon Web Services (Market leader)• Rackspace Cloud• NeCTAR/OpenStack Research Cloud• Joyent Cloud• GoGrid• FlexiScale
Public PaaS Examples
Cloud Name Language and Developer Tools
Programming Models Supported by Provider
Target Applications and Storage Options
Google App Engine Python, Java, Go, PHP + JVM languages (scala, groovy, jruby)
MapReduce, Web, DataStore, Storage and other APIs
Web applications and BigTable storage
Salesforce.com’s Force.com
Apex, Eclipsed-based IDE, web-based wizard
Workflow, excel-like formula, web programming
Business applications such as CRM
Microsoft Azure .NET, Visual Studio, Azure tools
Unrestricted model Enterprise and web apps
Amazon Elastic MapReduce
Hive, Pig, Java, Ruby etc.
MapReduce Data processing and e-commerce
Aneka .NET, stand-alone SDK
Threads, task, MapReduce
.NET enterprise applications, HPC
Public SaaS examples
• Gmail• Sharepoint• Salesforce.com CRM• On-live• Gaikai• Microsoft Office 365• Some definitions include those that do not require payment.
E.g. ad-supported sites
Things we find most interesting
• Accessibility• Infrastructure as code• Elasticity• Programming models that fit the cloud
Accessibility
● Global availability via public clouds● On-demand self-service● A platform for democratisation of computing ● Access is enabled via point-and-click interfaces (blends with the Internet)
Elasticity
• Rapidly expand and shrink based on demand• “Infinite” scaling• Cost-driven architecture• Ties in with infrastructure-as-code
Programming models that fit the cloud
• Fault-tolerant models• Massively scalable• Distributed algorithms
Bioinformatics
A multi-disciplinary science using computers for acquiring, managing and analyzing biological data.
It is a data-driven science.
It is a tool for genomics research.
Biology Medicine
Math & Physics
Computer Science
Bioinformatics
Genomics
Oxford dictionaries
“The branch of molecular biology concerned with the structure, function, evolution, and mapping of genomes.”
Where are the genes and other interesting pieces?
How do sequences change over evolutionary time?
What does all the DNA do?
What are the physical shapes of the genome and its products?
Genomics: contrast with biology and genetics
Biology and genetics
Targeted studies of one or a few genes
Targeted, low-throughput
experiments
Clever experimental design, painstaking experimentation
Genomics
Studies considering all genes in a genome
Global, high-throughput
experiments
Tons of data, uncertainty, computation
scope
technology
hard part
* Everything on this slide is a generalization
Where is genomics used?
Basic science
● What is the DNA sequence of the genome?● Where are the genes?● What does all the DNA in the genome do?● How did history shape our ethnicities and populations?
Medicine
● What’s the difference between DNA in a tumor vs DNA in healthy tissue?● Can genomic data help predict what drugs might be appropriate for:
○ a particular cancer patient?○ a particular genetic disorder?
● Can genomic data help us predict what flu strains will prevail next year?
Genome
Oxford dictionaries
“The complete set of genes or genetic material present in a cell or organism.”
“Blueprint” or “recipe” of life.
Self-copying store of read-only information about how to develop and maintain an organism.
Where do genomes live?
All the trillions of cells in a person have same genomic DNA in the nucleus.
Picture source: https://publications.nigms.nih.gov/insidethecell/preface.html
Genome
How do we obtain genome data? Sequencing!
First methods developed in the mid-1970’s, called Sanger sequencing.
In the 1990’s, the international Human Genome Project took 13 years to sequence the human genome.
In the 2000’s, massively parallel Next Generation Sequencers (NGS) were developed that took days to sequence a human genome at a much lesser cost.
Today, nanopore sequencers are emerging, offering real time sequencing.
There are many public data repositories with free access to data (e.g., TCGA, 1000 genomes, GenBank).
Two unrelated humans have genomes that are ~99.8% similar by sequence. There are about 3-4 million differences. Most are small, e.g. Single Nucleotide Polymorphisms (SNPs).
Human and chimpanzee genomes are about 96% similar.
Genome variation
Apply data transformations to extract useful information
This is not always a well-defined process
This is typically done with existing tools, or by developing one’s own
Tools can be chained into workflows
Making sense of the data through data manipulations
ResultsRaw data
Some computers + reliable persistent data storage + bioinf tools + reference data + workflow system
100-1000's GB
few GB
Indexed genomes
10-100's GB
AugSepOctNov...
A real-world infrastructure requirements
A Data analysis and integration tool
A (free for everyone) web service integrating a
wealth of tools, compute resources, terabytes of
reference data and permanent storage
Open source software that makes integrating your
own tools and data and customizing for your own
site simple
Three ways to use Galaxy
1. Download and run locally
2. Public website (http://usegalaxy.org)
3. Run on the Cloud
Bringing cloud resources to genomics
Cloud resources need to be provisioned and configured for use in genomics.
A Cloud Manager that orchestrates all of the steps required to provision, manage, and share a compute platform on a cloud infrastructure, all through a web
browser.
Manage it programmatically
Create a new CloudMan compute cluster
Manage an existing CloudMan instance
Architectural stack
CloudLaunch.usegalaxy.org
C L O U D A P P S
CloudBridge
CloudMan
cloudbridge.readthedocs.orggithub.com/gvlproject/cloudbridge
beta.launch.usegalaxy.orggithub.com/galaxyproject/cloudlaunch-uigithub.com/galaxyproject/cloudlaunch
wiki.galaxyproject.org/CloudMangithub.com/galaxyproject/cloudman
Everything talked about here is an effort from a large community!
Come talk to us; get involved.