19
ADAM: Fast, Scalable Genome Analysis Frank Austin Nothaft AMPLab, University of California, Berkeley with: Matt Massie, André Schumacher, Timothy Danford, Chris Hartl, Jey Kottalam, Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher, Anthony Joseph, and Dave Patterson https://github.com/bigdatagenomics Wednesday, February 12, 14

Strata Big Data Science Talk on ADAM

Embed Size (px)

Citation preview

Page 1: Strata Big Data Science Talk on ADAM

ADAM: Fast, Scalable Genome Analysis

Frank Austin NothaftAMPLab, University of California, Berkeley

with: Matt Massie, André Schumacher, Timothy Danford, Chris Hartl, Jey Kottalam, Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher, Anthony Joseph, and Dave

Patterson

https://github.com/bigdatagenomics

Wednesday, February 12, 14

Page 2: Strata Big Data Science Talk on ADAM

Problem

• Whole genome files are large

• Biological systems are complex

• Population analysis requires petabytes of data

• Analysis time is often a matter of life and death

Wednesday, February 12, 14

Page 3: Strata Big Data Science Talk on ADAM

Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities

Text printed on 5 long spools

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14

Page 4: Strata Big Data Science Talk on ADAM

Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities

Text printed on 5 long spools

It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …

Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14

Page 5: Strata Big Data Science Talk on ADAM

Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities

Text printed on 5 long spools

• How can he reconstruct the text?– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments– The short fragments from every copy are mixed together– Some fragments are identical

It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …

Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14

Page 6: Strata Big Data Science Talk on ADAM

Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities

Text printed on 5 long spools

• How can he reconstruct the text?– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments– The short fragments from every copy are mixed together– Some fragments are identical

It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …

Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14

Page 7: Strata Big Data Science Talk on ADAM

What is ADAM?

• File formats: columnar file format that allows efficient parallel access to genomes

• API: interface for transforming, analyzing, and querying genomic data

• CLI: a handy toolkit for quickly processing genomes

Wednesday, February 12, 14

Page 8: Strata Big Data Science Talk on ADAM

The Broad Institute “Best Practices” Pipeline

SNAPADAM Avocado

MLBaseGraphX

SiRen CAGE

Wednesday, February 12, 14

Page 9: Strata Big Data Science Talk on ADAM

Design Goals

• Develop processing pipeline that enables efficient, scalable use of cluster/cloud

• Provide data format that has efficient parallel/distributed access across platforms

• Enhance semantics of data and allow more flexible data access patterns

Wednesday, February 12, 14

Page 10: Strata Big Data Science Talk on ADAM

Whole GenomeData Sizes

Input PipelineStage

Output

SNAP 1GB Fasta150GB Fastq

Alignment 250GB BAM

ADAM 250GB BAM Pre-processing

200GB ADAM

Avocado 200GB ADAM

Variant Calling 10MB ADAM

Variants found at about 1 in 1,000 loci

Wednesday, February 12, 14

Page 11: Strata Big Data Science Talk on ADAM

ADAM Stack

Physical

File/Block

Record/Split

‣Commodity Hardware‣Cloud Systems - Amazon, GCE, Azure

‣Hadoop Distributed Filesystem‣Local Filesystem

‣Schema-driven records w/ Apache Avro‣Store and retrieve records using Parquet‣Read BAM Files using Hadoop-BAM

RDD

‣Transform records using Apache Spark‣Query with SQL using Shark‣Graph processing with GraphX‣Machine learning using MLBase

Wednesday, February 12, 14

Page 12: Strata Big Data Science Talk on ADAM

Implementation

• Work accelerated at end of September

• 15K lines of Scala code

• 100% Apache-licensed open-source

• Code contributions from Mt. Sinai, GenomeBridge, The Broad Institute

• Proof of concept implementations at The Broad Institute, Duke, Harvard, UCSC

• Global Alliance looking at ADAM as potential standard

Commits

Wednesday, February 12, 14

Page 13: Strata Big Data Science Talk on ADAM

AWS EC2 Performance

Picard

ADAM 1 node

ADAM 32 nodes

ADAM 100 nodes

1 10 100 1000 10000

28

70

1222

20

32

536

1064

Minutes (log scale)

Sort Mark Duplicates

NA12878 Whole Genome234GB as BAM, 229 GB as ADAM

2x

33x

53x

Wednesday, February 12, 14

Page 14: Strata Big Data Science Talk on ADAM

Mt. Sinai Performance

Picard

ADAM 16 nodes

ADAM 32 nodes

ADAM 64 nodes

ADAM 82 nodes

1 10 100 1000 10000

14

15

34

50

1222

8

11

17

29

1064

Minutes (log scale)

Sort Mark Duplicates

NA12878 Whole Genome234GB as BAM, 229 GB as ADAM

36x

62x

96x

133x

Wednesday, February 12, 14

Page 15: Strata Big Data Science Talk on ADAM

Scalability

HG00096 à 16GBNA12878 à 234GB

Wednesday, February 12, 14

Page 16: Strata Big Data Science Talk on ADAM

Acknowledgements

• UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher

• GenomeBridge: Timothy Danford, Carl Yeksigian

Wednesday, February 12, 14

Page 17: Strata Big Data Science Talk on ADAM

Acknowledgements

This research is supported in part by NSF CISE Expeditions award CCF-1139158 and DARPA

XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, Apple, Inc.,

Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, GameOnTalis, General Electric,

Hortonworks, Huawei, Intel, Microsoft, NetApp, Oracle, Samsung, Splunk, VMware, WANdisco and

Yahoo!.

Wednesday, February 12, 14

Page 18: Strata Big Data Science Talk on ADAM

“Before seeing your data, I would not think sorting could be done that fast.”

- research scientist at the Broad Institute

Wednesday, February 12, 14

Page 19: Strata Big Data Science Talk on ADAM

Call for contributions

• As an open source project, we welcome contributions

• We maintain a list of open enhancements at our Github issue tracker

• Enhancements tagged with “Pick me up!” don’t require a genomics background

• Github: https://github.com/bigdatagenomics/

Wednesday, February 12, 14