31
Cloud Computing Technologies for Genomic Big Data Analysis Fabrício A. B. Silva, Alberto Davila FIOCRUZ {fabs,davila}@fiocruz.br

Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Embed Size (px)

DESCRIPTION

Talk by Fabricio Silva on the 1st Symposium of Big Data and Public Health, 2013

Citation preview

Page 1: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Cloud Computing Technologies for Genomic

Big Data Analysis

Fabrício A. B. Silva, Alberto Davila

FIOCRUZ

{fabs,davila}@fiocruz.br

Page 2: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Big Data – A Definition

“Big data is a term used to describe information assemblages that make conventional data, or database, processing problematic due to any combination of their size (volume), frequency of update (velocity), or diversity (variety)”

Hay SI, George DB, Moyes CL, Brownstein JS (2013) Big Data Opportunities for Global Infectious Disease Surveillance. PLoS Med 10(4): e1001413. doi:10.1371/journal.pmed.1001413

Page 3: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

The Data Deluge

“In the last five years, more scientific data has been generated than in the entire history of mankind. You can imagine what’s going to happen in the next five.” Winston Hide, associate professor of bioinformatics

Harvard School of Public Health.

The promise of big data. HSPH News, Spring/Summer 2012

Page 4: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Exemple: Genbank

http://www.ncbi.nlm.nih.gov/genbank/statistics Accessed on Oct 22, 2013

Page 5: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

DNA Sequencing Evolution

Stein, L. D. (2010). The case for cloud computing in genome informatics. Genome Biol, 11(5), 207.

Page 6: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Interesting Facts...

• Sequencing a human genome has decreased in cost from US$ 1 million in 2007 to US$1 thousand in 2012

• An human DNA has 3 billion bp ~ 100 GB of raw data

• NCI’s million genomes project: 1 million TB, or 1000 petabyte, or 1 Exabyte

Driscoll, A. O., Daugelaite, J., & Sleator, R. D. (2013). ‘Big data’, Hadoop and cloud computing in genomics. Journal of biomedical informatics.

Page 7: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

The Processing Bottleneck

SoftwareNumber of Cores Start Finish Processing Time File sizes

Flash 24 9/12/13 22:48 9/12/13 22:48 0:00:53 2 files: 237 Mb and 238 Mb

Velveth 1 9/12/13 22:50 9/12/13 22:52 0:01:39 3 files: 100 Mb, 166 Mb and 165 Mb

Velvetg 1 9/12/13 22:54 9/12/13 22:59 0:04:53 2 files: 250 Mb and 75 Mb

Mira 24 9/12/13 23:11 9/12/13 23:32 0:21:21 2 files: 69 Mb and 6 Mb

Glimmer3 1 9/12/13 23:40 9/12/13 23:40 0:00:40 2 files: 6 Mb and 1.4 Mb

Blastx 24 9/12/13 23:46 9/13/13 9:23 9:36:15 Against RefSeq (17.411.217 enries)

Pipeline processed @ Computational and Systems Biology Lab, Bioinformatics Platform, Instituto Oswaldo Cruz, FIOCRUZ – Input Data size: 500MB

Page 8: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

NGS: Expect Much More Data

Linha 1 Linha 2 Linha 3 Linha 40

2

4

6

8

10

12

Coluna 1

Coluna 2

Coluna 3

Page 9: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

What Then?

Page 10: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Cloud Computing: a Definition

• “Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction”

NIST – Available at http://www.nist.gov/itl/cloud/upload/cloud-def-v15.pdf

Page 11: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Cloud Computing:Advantages

• Flexibility– Use of virtualization technology

• Scalability– Large number of nodes with local speed

connection

• Availability/Accessibility– Even small labs can harness the power of the

Cloud

Page 12: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Cloud Scalability: Example

Schadt, E. E., Linderman, M. D., Sorenson, J., Lee, L., & Nolan, G. P. (2011). Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology. Nature Reviews Genetics, 12(3), 224-224.

Page 13: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Cloud Computing: Challenges

• Bandwidth Limits– Large data sets needs to be moved to the

cloud

• Security/Privacy Issues– Limited control over remote storage

• Expertise– Adapting new applications to the cloud still

requires some technical expertise

Page 14: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

MapReduce

• MapReduce/Hadoop– MapReduce: Parallel distributed framework

invented by Google for processing large data sets

– Data and computations are spread over thousands of computers, processing petabytes of data each day

– Hadoop is the leading open-source implementation

Page 15: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

MapReduce

• MapReduce/Hadoop: Advantages– Scalable, Efficient, Reliable– Easy to program– Runs on commodity computers

• MapReduce/Hadoop: Challenges– Redesigning, retooling applications

Page 16: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Cloud Computing in Genomics

• Crossbow – Scalable software pipeline for whole genome

resequencing analysis over Hadoop

• CloudBurst– Highly sensitive short read mapping over Hadoop

• Myrna– Tool for calculating differential gene expression in large

RNA-seq datasets over Hadoop

Page 17: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Cloud Computing in Genomics

• Contrail– De novo assembly of large genomes over Hadoop

• CloudBlast– Scalable BLAST over Hadoop

• Quake– DNA sequence error detection and correction in sequence

reads over Hadoop

Page 18: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Cloud Computing in Genomics

• More examples of Hadoop based apps:– CloudAligner– BlastReduce– CloudBrush– GATK– Nephele– BlueSNP– Etc…

Page 19: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Crossbow: Hadoop Streaming

Langmead, B., Schatz, M. C., Lin, J., Pop, M., & Salzberg, S. L. (2009). Searching for SNPs with cloud computing. Genome Biol, 10(11), R134.

Page 20: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Crossbow: Hadoop Streaming

1. Map (Bowtie): many sequencing reads are mapped to the reference genome in parallel.

2. Shuffle: the sequence alignments are aggregated so that all alignments on the same chromosome or locus are grouped together and sorted by position.

3. Reduce/Scan (SOAPsnp): the sorted alignments are scanned to identify SNPs (Single Nucleotide Polymorphism) within each region.

Page 21: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Cloud-enabled Technologies

• Apache HBase– Open source, non-relational,

distributed database modeled after Google's BigTable. It runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop

Page 22: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Cloud-enabled Technologies

• Apache Cassandra– Linear scalable and high available

database that can run on commodity hardware or cloud infrastructure, with support for replication across multiple datacenters.

• Google's Pregel/Apache Giraph– Iterative graph processing system

built for high scalability

Page 23: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Cloud-enabled Technologies

• Apache Hive– data warehouse system for Hadoop

that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets

• Apache Pig– high-level language for expressing

data analysis programs, coupled with evaluation infrastructure over Hadoop

Page 24: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Parallel Patterns for the Cloud

• Stream-oriented– Farm– Farm with feedback– Pipeline

• Data-parallel– Map– Reduce

Page 25: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Pipeline Pattern: Stingray@Galaxy

Page 26: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Multiple Parallel Patterns

Aldinucci, Marco, et al. Parallel stochastic systems biology in the cloud. Briefings in Bioinformatics (2013).

Page 27: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

But...Our group do not have the expertise to develop our own Cloud applications...

Can we still use the Cloud/Mapreduce for genomic processing?

Page 28: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Galaxy Cloudman

Page 29: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Cloudgene

Schönherr, S. et al. (2012). Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds. BMC bioinformatics, 13(1), 200.

Page 30: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

What's Next?• Beyond Hadoop– Adoption of new technologies/parallel

patterns for genomic data analysis in the cloud

• Scalable Data Storage– High Availability/Support for replication– Preliminary work on HBase by Intel

• Private/Hybrid/Corporate Clouds– Privacy/security issues– Data tenancy

Page 31: Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Thank You!!!

Acknowledgements: Nelson Kotowski, Rodrigo Jardim (FIOCRUZ)