46
The Pulse of Cloud Computing with Bioinformatics as an example Nuwan Goonasekera , Enis Afgan* University of Melbourne, Melbourne Bioinformatics, Australia * Johns Hopkins University, Taylor Lab, USA @ University of Colombo Feb 2017

The pulse of cloud computing with bioinformatics as an example

Embed Size (px)

Citation preview

The Pulse of Cloud Computingwith Bioinformatics as an example

Nuwan Goonasekera†, Enis Afgan*

† University of Melbourne, Melbourne Bioinformatics, Australia* Johns Hopkins University, Taylor Lab, USA

@ University of ColomboFeb 2017

The answer to everything?

Overview

• The key characteristics of Cloud Computing• Using Cloud Computing for bioinformatics

Source: http://dilbert.com/strips/comic/2012-05-25/

A modern data-center

Source: http://www.businessinsider.com/google-data-centers-2014-10?op=1

Data center use before cloud computing

source: http://www.rackspace.com/knowledge_center/whitepaper/revolution-not-evolution-how-cloud-computing-differs-from-traditional-it-and-why-it

Cloud Computing: A Definition

• NIST definition: “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”

» National Institute of Standards and Technology(http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf)

The Cloud Model

Private Community Public HybridDeployment Models

Delivery Models

Essential Characteristics

Software as a Service (SaaS)

Platform as a Service (PaaS)

Infrastructure as a Service (IaaS)

• On-demand self-service• Broad network access• Resource pooling• Rapid elasticity• Measured service

Delivery Models

source: http://www.businessinsider.com.au/10-most-important-in-cloud-computing-2013-4?op=1#a-word-about-clouds-1

Infrastructure-as-a-Service (IaaS)

• Amazon Web Services (Market leader)• Rackspace Cloud• NeCTAR/OpenStack Research Cloud• Joyent Cloud• GoGrid• FlexiScale

Public PaaS Examples

Cloud Name Language and Developer Tools

Programming Models Supported by Provider

Target Applications and Storage Options

Google App Engine Python, Java, Go, PHP + JVM languages (scala, groovy, jruby)

MapReduce, Web, DataStore, Storage and other APIs

Web applications and BigTable storage

Salesforce.com’s Force.com

Apex, Eclipsed-based IDE, web-based wizard

Workflow, excel-like formula, web programming

Business applications such as CRM

Microsoft Azure .NET, Visual Studio, Azure tools

Unrestricted model Enterprise and web apps

Amazon Elastic MapReduce

Hive, Pig, Java, Ruby etc.

MapReduce Data processing and e-commerce

Aneka .NET, stand-alone SDK

Threads, task, MapReduce

.NET enterprise applications, HPC

Public SaaS examples

• Gmail• Sharepoint• Salesforce.com CRM• On-live• Gaikai• Microsoft Office 365• Some definitions include those that do not require payment.

E.g. ad-supported sites

Things we find most interesting

• Accessibility• Infrastructure as code• Elasticity• Programming models that fit the cloud

Accessibility

● Global availability via public clouds● On-demand self-service● A platform for democratisation of computing ● Access is enabled via point-and-click interfaces (blends with the Internet)

Infrastructure as Code

• Programmable• Captures knowledge• DevOps

Elasticity

• Rapidly expand and shrink based on demand• “Infinite” scaling• Cost-driven architecture• Ties in with infrastructure-as-code

Programming models that fit the cloud

• Fault-tolerant models• Massively scalable• Distributed algorithms

Cloud computing is a valuable resource - but what do we use it for?

Bioinformatics

A multi-disciplinary science using computers for acquiring, managing and analyzing biological data.

It is a data-driven science.

It is a tool for genomics research.

Biology Medicine

Math & Physics

Computer Science

Bioinformatics

Genomics

Oxford dictionaries

“The branch of molecular biology concerned with the structure, function, evolution, and mapping of genomes.”

Where are the genes and other interesting pieces?

How do sequences change over evolutionary time?

What does all the DNA do?

What are the physical shapes of the genome and its products?

Genomics: contrast with biology and genetics

Biology and genetics

Targeted studies of one or a few genes

Targeted, low-throughput

experiments

Clever experimental design, painstaking experimentation

Genomics

Studies considering all genes in a genome

Global, high-throughput

experiments

Tons of data, uncertainty, computation

scope

technology

hard part

* Everything on this slide is a generalization

Where is genomics used?

Basic science

● What is the DNA sequence of the genome?● Where are the genes?● What does all the DNA in the genome do?● How did history shape our ethnicities and populations?

Medicine

● What’s the difference between DNA in a tumor vs DNA in healthy tissue?● Can genomic data help predict what drugs might be appropriate for:

○ a particular cancer patient?○ a particular genetic disorder?

● Can genomic data help us predict what flu strains will prevail next year?

Genome

Oxford dictionaries

“The complete set of genes or genetic material present in a cell or organism.”

“Blueprint” or “recipe” of life.

Self-copying store of read-only information about how to develop and maintain an organism.

Where do genomes live?

All the trillions of cells in a person have same genomic DNA in the nucleus.

Picture source: https://publications.nigms.nih.gov/insidethecell/preface.html

Genome

How do we obtain genome data? Sequencing!

First methods developed in the mid-1970’s, called Sanger sequencing.

In the 1990’s, the international Human Genome Project took 13 years to sequence the human genome.

In the 2000’s, massively parallel Next Generation Sequencers (NGS) were developed that took days to sequence a human genome at a much lesser cost.

Today, nanopore sequencers are emerging, offering real time sequencing.

There are many public data repositories with free access to data (e.g., TCGA, 1000 genomes, GenBank).

Two unrelated humans have genomes that are ~99.8% similar by sequence. There are about 3-4 million differences. Most are small, e.g. Single Nucleotide Polymorphisms (SNPs).

Human and chimpanzee genomes are about 96% similar.

Genome variation

Apply data transformations to extract useful information

This is not always a well-defined process

This is typically done with existing tools, or by developing one’s own

Tools can be chained into workflows

Making sense of the data through data manipulations

What does all of this have to do with Cloud Computing?

omicsmaps.com

World’s clouds

bit.ly/worldclouds

Results

External reference data

Raw data

Data analysis

100-1000's GB few GB

Typical genomics flow

ResultsRaw data

Some computers + reliable persistent data storage + bioinf tools + reference data + workflow system

100-1000's GB

few GB

Indexed genomes

10-100's GB

AugSepOctNov...

A real-world infrastructure requirements

A Data analysis and integration tool

A (free for everyone) web service integrating a

wealth of tools, compute resources, terabytes of

reference data and permanent storage

Open source software that makes integrating your

own tools and data and customizing for your own

site simple

Galaxy: accessible analysis system

Three ways to use Galaxy

1. Download and run locally

2. Public website (http://usegalaxy.org)

3. Run on the Cloud

Bringing cloud resources to genomics

Cloud resources need to be provisioned and configured for use in genomics.

A Cloud Manager that orchestrates all of the steps required to provision, manage, and share a compute platform on a cloud infrastructure, all through a web

browser.

AccessibilityGet started at https://launch.usegalaxy.org/

Elasticity

Manage it programmatically

Create a new CloudMan compute cluster

Manage an existing CloudMan instance

How is it all achieved?

Impact?

http://www.citeulike.org/group/16008/tag/usecloud

Acknowledgments

Everything talked about here is an effort from a large community!

Come talk to us; get involved.

[email protected] or [email protected]