Strata Big Data Science Talk on ADAM

ADAM: Fast, Scalable Genome Analysis

Frank Austin NothaftAMPLab, University of California, Berkeley

with: Matt Massie, André Schumacher, Timothy Danford, Chris Hartl, Jey Kottalam, Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher, Anthony Joseph, and Dave

Patterson

https://github.com/bigdatagenomics

Wednesday, February 12, 14



Problem

• Whole genome files are large

• Biological systems are complex

• Population analysis requires petabytes of data

• Analysis time is often a matter of life and death


Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities

Text printed on 5 long spools

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …





Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14

http://schatzlab.cshl.edu




It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …






• How can he reconstruct the text?– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments– The short fragments from every copy are mixed together– Some fragments are identical











• How can he reconstruct the text?– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments– The short fragments from every copy are mixed together– Some fragments are identical









What is ADAM?

• File formats: columnar file format that allows efficient parallel access to genomes

• API: interface for transforming, analyzing, and querying genomic data

• CLI: a handy toolkit for quickly processing genomes


The Broad Institute “Best Practices” Pipeline

SNAPADAM Avocado

MLBaseGraphX

SiRen CAGE


Design Goals

• Develop processing pipeline that enables efficient, scalable use of cluster/cloud

• Provide data format that has efficient parallel/distributed access across platforms

• Enhance semantics of data and allow more flexible data access patterns


Whole GenomeData Sizes

Input PipelineStage

Output

SNAP 1GB Fasta150GB Fastq

Alignment 250GB BAM

ADAM 250GB BAM Pre-processing

200GB ADAM

Avocado 200GB ADAM

Variant Calling 10MB ADAM

Variants found at about 1 in 1,000 loci


ADAM Stack

Physical

File/Block

Record/Split

‣Commodity Hardware‣Cloud Systems - Amazon, GCE, Azure

‣Hadoop Distributed Filesystem‣Local Filesystem

‣Schema-driven records w/ Apache Avro‣Store and retrieve records using Parquet‣Read BAM Files using Hadoop-BAM

RDD

‣Transform records using Apache Spark‣Query with SQL using Shark‣Graph processing with GraphX‣Machine learning using MLBase


Implementation

• Work accelerated at end of September

• 15K lines of Scala code

• 100% Apache-licensed open-source

• Code contributions from Mt. Sinai, GenomeBridge, The Broad Institute

• Proof of concept implementations at The Broad Institute, Duke, Harvard, UCSC

• Global Alliance looking at ADAM as potential standard

Commits


AWS EC2 Performance

Picard

ADAM 1 node

ADAM 32 nodes

ADAM 100 nodes

1 10 100 1000 10000

28

70

1222

20

32

536

1064

Minutes (log scale)

Sort Mark Duplicates

NA12878 Whole Genome234GB as BAM, 229 GB as ADAM

2x

33x

53x


Mt. Sinai Performance

Picard

ADAM 16 nodes

ADAM 32 nodes

ADAM 64 nodes

ADAM 82 nodes

1 10 100 1000 10000

14

15

34

50

1222

8

11

17

29

1064

Minutes (log scale)

Sort Mark Duplicates

NA12878 Whole Genome234GB as BAM, 229 GB as ADAM

36x

62x

96x

133x


Scalability

HG00096 à 16GBNA12878 à 234GB


Acknowledgements

• UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher

• GenomeBridge: Timothy Danford, Carl Yeksigian


Acknowledgements

This research is supported in part by NSF CISE Expeditions award CCF-1139158 and DARPA

XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, Apple, Inc.,

Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, GameOnTalis, General Electric,

Hortonworks, Huawei, Intel, Microsoft, NetApp, Oracle, Samsung, Splunk, VMware, WANdisco and

Yahoo!.


“Before seeing your data, I would not think sorting could be done that fast.”

- research scientist at the Broad Institute


http://www.broadinstitute.org/

http://www.broadinstitute.org/

Call for contributions

• As an open source project, we welcome contributions

• We maintain a list of open enhancements at our Github issue tracker

• Enhancements tagged with “Pick me up!” don’t require a genomics background

• Github: https://github.com/bigdatagenomics/


https://github.com/bigdatagenomics/adam

https://github.com/bigdatagenomics/adam

Health & Medicine

Strata Big Data Science Talk on ADAM