Transcript
Page 1: Genomics Is Not Special: Towards Data Intensive Biology

Genomics IsNot Special

Uri Laserson // [email protected] // 13 November 2014

Toward Data-Intensive Biology

Page 2: Genomics Is Not Special: Towards Data Intensive Biology

2© 2014 Cloudera, Inc. All rights reserved.http://omicsmaps.com/

>25 Pbp / year

Page 3: Genomics Is Not Special: Towards Data Intensive Biology

3© 2014 Cloudera, Inc. All rights reserved.

Carr and Church, Nat. Biotech. 27: 1151 (2009)

Page 4: Genomics Is Not Special: Towards Data Intensive Biology

4© 2014 Cloudera, Inc. All rights reserved.

For every “-ome” there’s a “-seq”

Genome DNA-seq

Transcriptome

RNA-seq

FRT-seq

NET-seq

Methylome Bisulfite-seq

Immunome Immune-seq

ProteomePhIP-seq

Bind-n-seq

http://liorpachter.wordpress.com/seq/

Page 5: Genomics Is Not Special: Towards Data Intensive Biology

5© 2014 Cloudera, Inc. All rights reserved.

For every “-ome” there’s a “-seq”

Genome DNA-seq

Transcriptome

RNA-seq

FRT-seq

NET-seq

Methylome Bisulfite-seq

Immunome Immune-seq

ProteomePhIP-seq

Bind-n-seq

http://liorpachter.wordpress.com/seq/

Page 6: Genomics Is Not Special: Towards Data Intensive Biology

6© 2014 Cloudera, Inc. All rights reserved.

Based on IMGT/LIGM release 201111-6

Page 7: Genomics Is Not Special: Towards Data Intensive Biology

7© 2014 Cloudera, Inc. All rights reserved.

Page 8: Genomics Is Not Special: Towards Data Intensive Biology

8© 2014 Cloudera, Inc. All rights reserved.

Page 9: Genomics Is Not Special: Towards Data Intensive Biology

9© 2014 Cloudera, Inc. All rights reserved.

Developer/computational efficiency becoming

paramount

Genome Biology 12: 125 (2011)

Page 10: Genomics Is Not Special: Towards Data Intensive Biology

10© 2014 Cloudera, Inc. All rights reserved.

Software and data management around

since 1970s

• Version control/reproducibility

• Testing/automation/integration

• Databases and data formats

• API design

• Lots (most?) of big data innovation happening in industry

Page 11: Genomics Is Not Special: Towards Data Intensive Biology

11© 2014 Cloudera, Inc. All rights reserved.

Example query

For each variant that is• overlapping a DNase HS site

• predicted to be deleterious

• absent from dbSNP

compute the MAF by subpopulation

using samples in Framing Heart Study

PARTNER LOGO

CH

R

POS RE

F

AL

T

POP MAF POLYPHEN

7 122892

37

A G Plain 0.01 possibly

damaging

7 122892

37

A G Star-

bellied

0.03 possibly

damaging

12 228833

2

T C Plain 0.00

3

probably

damaging

12 228833

2

T C Star-

bellied

0.09 probably

damaging

Page 12: Genomics Is Not Special: Towards Data Intensive Biology

12© 2014 Cloudera, Inc. All rights reserved.

Available data

Data set Format Size

Population genotypes VCF 10-100s of

billions

Dnase HS sites

(ENCODE)

narrowPeak

(BED)

<1 million

dbSNP CSV 10s of millions

Sample phenotypes JSON thousands

Page 13: Genomics Is Not Special: Towards Data Intensive Biology

13© 2014 Cloudera, Inc. All rights reserved.

Why text data is a bad idea

• Text is highly inefficient• Compresses poorly

• Values must be parsed

• Text is semi-structured at best• Flexible schemas make parsing difficult

• Difficult to make assumptions on data structure

• Text poorly separates the roles of delimiters and data• Requires escaping of control characters

• (ASCII actually includes RS 0x1E and FS 0x1F, but they’re never used)

• But still almost always better than Excel

Page 14: Genomics Is Not Special: Towards Data Intensive Biology

14© 2014 Cloudera, Inc. All rights reserved.

Some reasons VCF in particular is bad

• Number of records (variants) grows with new variants, rather than new genotypes• difficult to write data

• adding a sample requires rewrite of entire file

• Data must be sorted

• Semi-structured: need to build a parser for each file

• Conflates two functions:• catalogue of variation

• repository actual observed genotypes

• If gzipped, it’s not splittable

• Variants are not encoded uniquely by the VCF spec

Page 15: Genomics Is Not Special: Towards Data Intensive Biology

15© 2014 Cloudera, Inc. All rights reserved.

Manually executing query in Python

class IntervalTree(object):

def update(self, feature):

pass # ...implement tree update

def overlaps(self, feature):

return True or False

dnase_sites = IntervalTree()

with open('path/to/dnase.narrowPeak', 'r') as ip:

for line in ip:

feature = parse_feature(line)

dnase_sites.update(feature)

samples = {}

with open('path/to/samples.json', 'r') as ip:

for line in ip:

sample = json.loads(line)

if is_framingham(sample):

samples[sample['name']] = sample

dbsnp = set()

with open('path/to/dbsnp.csv', 'r') as ip:

for line in ip:

snp = tuple(line.split()[:3])

dbsnp.add(snp)

Page 16: Genomics Is Not Special: Towards Data Intensive Biology

16© 2014 Cloudera, Inc. All rights reserved.

Additional metadata must fit in memory

class IntervalTree(object):

def update(self, feature):

pass # ...implement tree update

def overlaps(self, feature):

return True or False

dnase_sites = IntervalTree()

with open('path/to/dnase.narrowPeak', 'r') as ip:

for line in ip:

feature = parse_feature(line)

dnase_sites.update(feature)

samples = {}

with open('path/to/samples.json', 'r') as ip:

for line in ip:

sample = json.loads(line)

if is_framingham(sample):

samples[sample['name']] = sample

dbsnp = set()

with open('path/to/dbsnp.csv', 'r') as ip:

for line in ip:

snp = tuple(line.split()[:3])

dbsnp.add(snp)

Page 17: Genomics Is Not Special: Towards Data Intensive Biology

17© 2014 Cloudera, Inc. All rights reserved.

Can only read from POSIX filesystem

class IntervalTree(object):

def update(self, feature):

pass # ...implement tree update

def overlaps(self, feature):

return True or False

dnase_sites = IntervalTree()

with open('path/to/dnase.narrowPeak', 'r') as ip:

for line in ip:

feature = parse_feature(line)

dnase_sites.update(feature)

samples = {}

with open('path/to/samples.json', 'r') as ip:

for line in ip:

sample = json.loads(line)

if is_framingham(sample):

samples[sample['name']] = sample

dbsnp = set()

with open('path/to/dbsnp.csv', 'r') as ip:

for line in ip:

snp = tuple(line.split()[:3])

dbsnp.add(snp)

Page 18: Genomics Is Not Special: Towards Data Intensive Biology

18© 2014 Cloudera, Inc. All rights reserved.

Manually executing query in Python

genotype_data = {}

reader = vcf.Reader('path/to/genotypes.vcf')

for variant in reader:

if (dnase_sites.overlaps(variant) and is_deleterious(call)

and not in_dbsnp(variant)):

for call in variant.samples:

if call.sample in samples:

pop = samples[call.sample]['population']

genotype_data.setdefault((variant, pop), []).append(call)

mafs = {}

for (variant, pop) in genotype_data.iter_keys():

mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])

Page 19: Genomics Is Not Special: Towards Data Intensive Biology

19© 2014 Cloudera, Inc. All rights reserved.

Genotype data may be split across files

genotype_data = {}

reader = vcf.Reader('path/to/genotypes.vcf')

for variant in reader:

if (dnase_sites.overlaps(variant) and is_deleterious(call)

and not in_dbsnp(variant)):

for call in variant.samples:

if call.sample in samples:

pop = samples[call.sample]['population']

genotype_data.setdefault((variant, pop), []).append(call)

mafs = {}

for (variant, pop) in genotype_data.iter_keys():

mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])

Page 20: Genomics Is Not Special: Towards Data Intensive Biology

20© 2014 Cloudera, Inc. All rights reserved.

Manually executing query in Python

• If file is gzipped, cannot split file without decompressing (use Snappy)

• Reading files required access to POSIX-style file system

• Probably want to split VCF file into pieces to parallelize• Requires manual scatter-gather

• Samples may be scattered among multiple VCF files (difficult to append to VCF)

• Manually implementing broadcast join• Build side must fit into memory

Page 21: Genomics Is Not Special: Towards Data Intensive Biology

21© 2014 Cloudera, Inc. All rights reserved.

Manually executing query in Python on HPC

$ bsub –q shared_12h python split_genotypes.py

$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv

$ bsub –q shared_12h python merge_maf.py

Page 22: Genomics Is Not Special: Towards Data Intensive Biology

22© 2014 Cloudera, Inc. All rights reserved.

Manually executing query in Python on HPC

$ bsub –q shared_12h python split_genotypes.py

$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv

$ bsub –q shared_12h python merge_maf.py

How to serialize intermediate output?

Manually specify requested resources

Manually split and mergeBabysit and

check for errors/failures

Page 23: Genomics Is Not Special: Towards Data Intensive Biology

23© 2014 Cloudera, Inc. All rights reserved.

HPC separates compute from storage

HPC is about compute.Hadoop is about data.

Storage infrastructure• Proprietary,

distributed file system

• Expensive

Compute cluster• High-perf, reliable

hardware• Expensive

Big network

pipe ($$$)

User typically works by manually submitting jobs to scheduler(e.g., LSF, Grid Engine, etc.)

Page 24: Genomics Is Not Special: Towards Data Intensive Biology

24© 2014 Cloudera, Inc. All rights reserved.

HPC is lower-level than Hadoop

• HPC only exposes job scheduling

• Parallelization typically through MPI• Very low-level communication primitives

• Difficult to horizontally scale by simply adding nodes

• Large data sets must be manually split

• Failures must be dealt with manually

Page 25: Genomics Is Not Special: Towards Data Intensive Biology

25© 2014 Cloudera, Inc. All rights reserved.

HPC uses file system as DB; text file as LCD

• All tools assume flat files with POSIX semantics

• Sharing data/collaboration involves copying large files

• Broad joint caller with 25k genomes hits file handle limits

• Files always streamed over network (HPC architecture)

Page 26: Genomics Is Not Special: Towards Data Intensive Biology

26© 2014 Cloudera, Inc. All rights reserved.

HPC uses job scheduler as workflow tool

• Submitting jobs to scheduler is low level

• Workflow engines/execution models provide high level execution graphs with built-in fault tolerance• e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pig, Hive

Page 27: Genomics Is Not Special: Towards Data Intensive Biology

27© 2014 Cloudera, Inc. All rights reserved.

Prepping data for local analysis in R/Python

• Manual script to prepare CSV file for working locally

• Same issues as above

• Requires working set of data to fit into memory of a single machine

• Visualization

Page 28: Genomics Is Not Special: Towards Data Intensive Biology

28© 2014 Cloudera, Inc. All rights reserved.

Domain-specific tools (e.g., PLINK/Seq)

$ pseq path/to/project v-stats --mask phe=framingham locset=dnase ref.ex=dbsnp

one of a limited set of specific, useful tasks

(yet another) custom query specification

Page 29: Genomics Is Not Special: Towards Data Intensive Biology

29© 2014 Cloudera, Inc. All rights reserved.

Domain-specific tools (e.g., PLINK/Seq)

• Works great if your problem fits into the pre-designed computations

• Only works if your problem fits into the pre-designed computations

• How to do stats by subpopulation?• Probably possible, but need to learn new notation

• Must work to get data in to begin-with

• Not obviously parallelizable for performance on large data sets

• Built on SQLite underneath

Page 30: Genomics Is Not Special: Towards Data Intensive Biology

30© 2014 Cloudera, Inc. All rights reserved.

RDBMS and SQL (e.g., MySQL)

SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)

FROM genotypes g

INNER JOIN samples s

ON g.sample = s.sample

INNER JOIN dnase d

ON g.chr = d.chr

AND g.pos >= d.start

AND g.pos < d.end

LEFT OUTER JOIN dbsnp p

ON g.chr = p.chr

AND g.pos = p.pos

AND g.ref = p.ref

AND g.alt = p.alt

WHERE

s.study = "framingham"

p.pos IS NULL AND

g.polyphen IN ( "possibly damaging", "probably damaging" )

GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop

Page 31: Genomics Is Not Special: Towards Data Intensive Biology

31© 2014 Cloudera, Inc. All rights reserved.

RDBMS and SQL (e.g., MySQL)

• Feature-rich and very mature

• Highly optimized and allows indexing

• Declarative (and abstracted) language for data

• Hassle to get data in; data end up formatted one way

• No clear scalability story

• SQL-only

Page 32: Genomics Is Not Special: Towards Data Intensive Biology

32© 2014 Cloudera, Inc. All rights reserved.

Problems with old way

• Expensive

• No fault-tolerance

• No horizontal scalability

• Poor separation of data modeling and storage formats• File format proliferation

• Inefficient text formats

Page 33: Genomics Is Not Special: Towards Data Intensive Biology

33© 2014 Cloudera, Inc. All rights reserved.

Page 34: Genomics Is Not Special: Towards Data Intensive Biology

34© 2014 Cloudera, Inc. All rights reserved.

Indexing the web

• Web is Huge• Hundreds of millions of pages in 1999

• How do you index it?• Crawl all the pages

• Rank pages based on relevance metrics

• Build search index of keywords to pages

• Do it in real time!

Page 35: Genomics Is Not Special: Towards Data Intensive Biology

35© 2014 Cloudera, Inc. All rights reserved.

Page 36: Genomics Is Not Special: Towards Data Intensive Biology

36© 2014 Cloudera, Inc. All rights reserved.

Databases in 1999

• Buy a really big machine

• Install expensive DBMS on it

• Point your workload on it

• Hope it doesn’t fail

• Ambitious: buy another big machine as backup

Page 37: Genomics Is Not Special: Towards Data Intensive Biology

37© 2014 Cloudera, Inc. All rights reserved.

Page 38: Genomics Is Not Special: Towards Data Intensive Biology

38© 2014 Cloudera, Inc. All rights reserved.

Database limitations

• Didn’t scale horizontally• High marginal cost ($$$)

• No real fault-tolerance story

• Vendor lock-in ($$$)

• SQL unsuited for search ranking• Complex analysis (PageRank)

• Unstructured data

Page 39: Genomics Is Not Special: Towards Data Intensive Biology

39© 2014 Cloudera, Inc. All rights reserved.

Page 40: Genomics Is Not Special: Towards Data Intensive Biology

40© 2014 Cloudera, Inc. All rights reserved.

Google does something different

• Designed their own storage and processing infrastructure• Google File System (GFS) and MapReduce (MR)

• Goals: cheap, scalable, reliable

• General framework for large-scale batch computation

• Powered Google Search for many years• Still used internally to this day (millions of jobs)

Page 41: Genomics Is Not Special: Towards Data Intensive Biology

41© 2014 Cloudera, Inc. All rights reserved.

Google benevolent enough to publish

2003 2004

Page 42: Genomics Is Not Special: Towards Data Intensive Biology

42© 2014 Cloudera, Inc. All rights reserved.

Birth of Hadoop at Yahoo!

• 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR

• 2006: Spun out as Apache Hadoop

• Named after Doug’s son’s yellow stuffed elephant

Page 43: Genomics Is Not Special: Towards Data Intensive Biology

43© 2014 Cloudera, Inc. All rights reserved.

Open-source proliferation

Google Open-source Function

GFS HDFS Distributed file system

MapReduce MapReduce Batch distributed data

processing

Bigtable HBase Distributed DB/key-value store

Protobuf/Stubb

y

Thrift or Avro Data serialization/RPC

Pregel Giraph Distributed graph processing

Dremel/F1 Impala Scalable interactive SQL (MPP)

FlumeJava Crunch Abstracted data pipelines on

Hadoop

Page 44: Genomics Is Not Special: Towards Data Intensive Biology

44© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides:

• Data centralization on HDFS• No rewriting data for each tool/application

• Data-local execution to avoid moving terabytes

• High-level execution engines• SQL (Impala, Hive)

• Relational algebra (Spark, MapReduce)

• Bulk synchronous parallel (GraphX)

• Distributed in-memory

• Built-in horizontal scalability and fault-tolerance

• Hadoop-friendly, evolvable serialization formats/RPC

Page 45: Genomics Is Not Special: Towards Data Intensive Biology

45© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides serialization/RPC formats

(Avro)

• Specify schemas/services in user-friendly IDLs

• Code-generation to multiple languages (wire-compatible/portable)

• Compact, binary formats

• Support for schema evolution

• Like binary JSONrecord Feature {

union { null, string } featureId = null;union { null, string } featureType = null; // e.g., DNase HSunion { null, string } source = null; // e.g., BED, GFF fileunion { null, Contig } contig = null;union { null, long } start = null;union { null, long } end = null;union { null, Strand } strand = null;union { null, double } value = null;array<Dbxref> dbxrefs = [];array<string> parentIds = [];map<string> attributes = {};

}

Page 46: Genomics Is Not Special: Towards Data Intensive Biology

46© 2014 Cloudera, Inc. All rights reserved.

APIs instead of file formats

• Service-oriented architectures (SOA) ensure stable contracts

• Allows for implementation changes with new technologies

• Software community has lots of experience with SOA, along with mature tools

• Can be implemented in language-independent fashion

Page 47: Genomics Is Not Special: Towards Data Intensive Biology

47© 2014 Cloudera, Inc. All rights reserved.

Current file format hairball

Page 48: Genomics Is Not Special: Towards Data Intensive Biology

48© 2014 Cloudera, Inc. All rights reserved.

API-oriented architecture

Page 49: Genomics Is Not Special: Towards Data Intensive Biology

49© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides columnar storage

(Parquet)

• Designed for general data storage

• Columnar format• read fewer bytes

• compression more efficient

• Splittable

• Avro/Thrift-compatible

• Predicate pushdown

• RLE, dictionary-encoding

Page 50: Genomics Is Not Special: Towards Data Intensive Biology

50© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides columnar storage

(Parquet)

Page 51: Genomics Is Not Special: Towards Data Intensive Biology

51© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides columnar storage

(Parquet)

Vertical partitioning(projection pushdown)

Horizontal partitioning(predicate pushdown)

Read only the data you

need!

+ =

Page 52: Genomics Is Not Special: Towards Data Intensive Biology

52© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides abstractions for data

processing

HDFS (scalable, distributed storage)

YARN (resource management)

MapReduc

e

Impala

(SQL)

Solr

(search)Spark

ADAMquince guacamole …

bdg-f

orm

ats

(A

vro

/Parq

uet)

Page 53: Genomics Is Not Special: Towards Data Intensive Biology

53© 2014 Cloudera, Inc. All rights reserved.

Hadoop examples: filesystem

[laserson@bottou01-10g ~]$ hadoop fs –ls /user/lasersonFound 16 itemsdrwx------ - laserson laserson 0 2014-11-12 16:00 .Trashdrwxr-xr-x - laserson laserson 0 2014-11-12 00:29 .sparkStagingdrwx------ - laserson laserson 0 2014-06-07 13:27 .stagingdrwxr-xr-x - laserson laserson 0 2014-10-30 14:15 1kgdrwxr-xr-x - laserson laserson 0 2014-05-08 17:29 bigmldrwxr-xr-x - laserson laserson 0 2014-10-30 14:14 bookdrwxrwxr-x - laserson laserson 0 2014-06-16 12:59 editingdrwxr-xr-x - laserson laserson 0 2014-06-06 13:49 gdelt-rw-r--r-- 3 laserson laserson 0 2014-10-27 16:24 hg19_textdrwxr-xr-x - laserson laserson 0 2014-06-12 19:53 madlibportdrwxr-xr-x - laserson laserson 0 2014-03-20 18:09 rock-health-pythondrwxr-xr-x - laserson laserson 0 2014-05-15 13:25 test-udfdrwxr-xr-x - laserson laserson 0 2014-08-21 17:58 test_pymcdrwxr-xr-x - laserson laserson 0 2014-10-27 22:25 tmpdrwxr-xr-x - laserson laserson 0 2014-10-07 20:30 udf-scratchdrwxr-xr-x - laserson laserson 0 2014-03-02 13:50 udfs

Page 54: Genomics Is Not Special: Towards Data Intensive Biology

54© 2014 Cloudera, Inc. All rights reserved.

Hadoop examples: batch MapReduce job

hadoop jar vcf2parquet-0.1.0-jar-with-dependencies.jar \com.cloudera.science.vcf2parquet.VCFtoParquetDriver \hdfs:///path/to/variants.vcf \hdfs:///path/to/output.parquet

Page 55: Genomics Is Not Special: Towards Data Intensive Biology

55© 2014 Cloudera, Inc. All rights reserved.

Hadoop examples: interactive Spark shell

[laserson@bottou01-10g ~]$ spark-shell --master yarnWelcome to

____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 1.1.0/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)Type in expressions to have them evaluated.[...]

scala>

Page 56: Genomics Is Not Special: Towards Data Intensive Biology

56© 2014 Cloudera, Inc. All rights reserved.

Hadoop examples: interactive Spark shell

def inDbSnp(g: Genotype): Boolean = true or false

def isDeleterious(g: Genotype): Boolean = g.getPolyPhen

val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()

val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()

val genotypesRDD = sc.adamLoad("path/to/genotypes")

val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase")

val filteredRDD = genotypesRDD

.filter(!inDbSnp(_))

.filter(isDeleterious(_))

.filter(isFramingham(_))

val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)

val maf = joinedRDD

.keyBy(x => (x.getVariant, getPopulation(x)))

.groupByKey()

.map(computeMAF(_))

.saveAsNewAPIHadoopFile("path/to/output")

Page 57: Genomics Is Not Special: Towards Data Intensive Biology

57© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides abstractions for data

processing

HDFS (scalable, distributed storage)

YARN (resource management)

MapReduc

e

Impala

(SQL)

Solr

(search)Spark

ADAMquince guacamole …

bdg-f

orm

ats

(A

vro

/Parq

uet)

Page 58: Genomics Is Not Special: Towards Data Intensive Biology

58© 2014 Cloudera, Inc. All rights reserved.

Genomics ETL

.fastq .bam .vcf

.bed/.gtf/etc

short read

alignment

genotype calling analysis

Page 59: Genomics Is Not Special: Towards Data Intensive Biology

59© 2014 Cloudera, Inc. All rights reserved.

Hadoop variant store architecture

Impala shell (SQL)

REST API

JDBC

SQL query

Impala engine

Hive metastoreResult

set

.parquet.vcf

ETL

Page 60: Genomics Is Not Special: Towards Data Intensive Biology

60© 2014 Cloudera, Inc. All rights reserved.

Data denormalization

##fileformat=VCFv4.1

##fileDate=20090805

##source=myImputationProgramV3.1

##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

##phasing=partial

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">

##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">

##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">

##FILTER=<ID=q10,Description="Quality below 10">

##FILTER=<ID=s50,Description="Less than 50% of samples have data">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">

##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3

20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4

20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

• Amortize join cost up-front• Replace joins with predicates

(allowing predicate pushdown)

Page 61: Genomics Is Not Special: Towards Data Intensive Biology

61© 2014 Cloudera, Inc. All rights reserved.

Hadoop solution characteristics

• Data stored as Parquet columnar format for performance and compression

• Impala/Hive metastore provide unified, flexible data model

• Impala implements RDBMS-style operations (by experts in distributed systems)

• Spark offers flexible relational algebra operators (and in-memory computing)

• Built-in fault tolerance for computations and horizontal scalability

Page 62: Genomics Is Not Special: Towards Data Intensive Biology

62© 2014 Cloudera, Inc. All rights reserved.

Example variant-filtering query

• “Give me all SNPs that are:• chromosome 16

• absent from dbSNP

• present in COSMIC

• observed in breast cancer samples”

• On full 1000 Genome data set• ~37 billion genotypes

• 14 node cluster

• query completion in several seconds

SELECT cosmic as snp_id,

vcf_chrom as chr,

vcf_pos as pos,

sample_id as sample,

vcf_call_gt as genotype,

sample_affection as phenotype

FROM

hg19_parquet_snappy_join_cached_partitioned

WHERE

COSMIC IS NOT NULL AND

dbSNP IS NULL AND

sample_study = ”breast_cancer" AND

VCF_CHROM = "16";

PARTNER LOGO

Page 63: Genomics Is Not Special: Towards Data Intensive Biology

63© 2014 Cloudera, Inc. All rights reserved.

Other queries/use cases

• All-vs-all eQTL integrated with ENCODE• >120 billion p-values

• “Top 20 eQTLs for 5 genes of interest”: interactive

• “Find all cis-eQTLs”: several minutes

• Population genetics queries (e.g., backend for PLINK)

• Interval arithmetic on large ENCODE data sets

• Duke CHGV• ATAV DSL for preparing data for GWAS

• Week-long queries now take a few hours by parallelizing on Spark

Page 64: Genomics Is Not Special: Towards Data Intensive Biology

64© 2014 Cloudera, Inc. All rights reserved.

Computational biologists are reinventing the

wheel

• e.g., CRAM (columnar storage)

• e.g., workflow managers (Galaxy)

• e.g., GATK (scatter-gather)

Page 65: Genomics Is Not Special: Towards Data Intensive Biology

65© 2014 Cloudera, Inc. All rights reserved.

Large-scale data analysis has been solved*

• Cheaper in terms of hardware

• Easier in terms of productivity

• Built-in horizontal scaling

• Built-in fault tolerance

• Layered abstractions for data modeling

• Hadoop!

Page 66: Genomics Is Not Special: Towards Data Intensive Biology

66© 2014 Cloudera, Inc. All rights reserved.

Science on Hadoop

• ADAM project for genomics on Spark• http://bdgenomics.org/

• Guacamole for somatic variation on Spark• https://github.com/hammerlab/guacamole/

• Thunder project for neuroimaging on Spark• http://thefreemanlab.com/thunder/

• Quince for variant store on Impala• currently barebones, but with examples

• https://github.com/laserson/quince

Page 67: Genomics Is Not Special: Towards Data Intensive Biology

67© 2014 Cloudera, Inc. All rights reserved.

Suggestions/resources

• Everyone should learn Python• (also, everyone should try some experiments)

• Everyone should use version control (e.g., git)• GitHub enables easy collaboration

• See Titus Brown’s blog

• Use the IPython Notebook (Jupyter) for productivity

• Big data is often about engineering; use the best tools

• For getting industry jobs:• Show people you know how to code: put your projects on GitHub

• You should feel lucky if others will start using your code

Page 68: Genomics Is Not Special: Towards Data Intensive Biology

68© 2014 Cloudera, Inc. All rights reserved.

Page 69: Genomics Is Not Special: Towards Data Intensive Biology

69© 2014 Cloudera, Inc. All rights reserved.

Acknowledgements

• Cloudera• Sandy Ryza (Spark development)

• Nong Li (Impala)

• Skye Wanderman-Milne (Impala)

• Impala genomics collaborators• Kiran Mukhyala

• Slaton Lipscomb

• ADAM project• Matt Massie

• Frank Nothaft

• Timothy Danford

• Mount Sinai School of Medicine• Jeff Hammerbacher (+ lab)

• Duke CHGV• Jonathan Keebler

Page 70: Genomics Is Not Special: Towards Data Intensive Biology

Thank you.


Recommended