16
The ENCODE DCC Eurie L. Hong, Ph.D. • Project Manager, ENCODE DCC PI: J. Michael Cherry, Ph.D. Department of Genetics • Stanford University School of Medicine https://www.encodeproject.org/

Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Embed Size (px)

DESCRIPTION

An overview of the assays performed and distributed by the ENCODE DCC as well as a summary of the uniform processing pipelines that are being implemented by the ENCODE Consortium. Here, we talk about the impact using GPUs has on speed of running the ChIP-seq pipeline.

Citation preview

Page 1: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

The ENCODE DCC

Eurie L. Hong, Ph.D. • Project Manager, ENCODE DCCPI: J. Michael Cherry, Ph.D.

Department of Genetics • Stanford University School of Medicine

https://www.encodeproject.org/

Page 2: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

What is the ENCODE Consortium?

Image credit: NHGRI

Page 3: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Production labsAnalysis groups

Role: Data generation Data organization Data access

Tasks: Perform assays Define submission process Web-based searchesPerform analyses Data processing & validation Data downloadsValidate data Data file storageSubmit data files Metadata curationSubmit metadata

Genome Browser

ENCODE portal(DCC)

Role of the Data Coordination Center

Data files

Metadata DCCDCCIntegrative

websites

Scientific

community

Page 4: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Transparency of methods• How was the experiment performed?• What software was used to analyze the data?

Reproducibility of results• What files were used?• What software and parameters were used for the pipelines?

Interoperability with other genomic projects• Can the pipeline software we use be used by other projects?• Can the metadata allow easy integration with other data?

DCC goals for implementation

Page 5: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Data volume: diversity of assays

Modified from PLoS Biol 9-e1001046,2011(M. Pazin)

Approximately ~30 different assays

Page 6: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Data volume: number of assays

(includes mouse & human, from https://www.encodeproject.org/, 10/12/2014)

Page 7: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Transparency & reproducibility:Capture the experimental design

Biological replicate 1

Technicalreplicate 1

Biological replicate 2

Raw datafile (fastq)

Processed file (bam)

Experiment

Software & pipelines

Technicalreplicate 1

Raw datafile (fastq)

Processed file (bam)

Software & pipelines

Biological replicate 1

Technicalreplicate 1

Raw datafile (fastq)

Processed file (bam)

Controlexperiment

Software & pipelines

Processed file (peak calls)

Software & pipelines

Page 8: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Data interoperability:uniform processing pipelines

(includes mouse & human, from https://www.encodeproject.org/, 10/12/2014)

Page 9: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Processing of TF ChIP-seq assays

FASTQ (SE/PE)ReplicatesControls

Map ReadsFilterPool

SubsamplePseudoreplicates Call Peaks

IDR

Signal Tracks

BAMReplicates

Pooled RepsControls

BAM2 Pseudoreplicates

per replicate2 Pseudoreplicates

per pool

peakReplicates

PseudoreplicatesPools

peakIDR-thresholded

Peak Calls

bigWigReplicates

Pooled Replicates

Specification document (Anshul Kundaje):https://docs.google.com/document/d/1lG_Rd7fnYgRpSIqrIfuVlAz2dW1VaSQThzk836Db99c/edit?usp=sharing

Page 10: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Relative CPU time for ChIP-seq (original)

Map

Signal Tracks

Subsample

Call Peaks

IDR

Relative CPU time per step for a typical transcription factor ChIP-seq experimentIDR can take much longer if there are many regions, as in a typical histone ChIP

IDR

Peak Calling

Nikhil Podduturi

Page 11: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Data volume: TF ChIp-seq

(includes mouse & human)

Page 12: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

1 10 100 1000 10000

CPU

NVIDIAGPU

Clock Time (Seconds) Log10 scale

Performance Comparison:IDR analysis CPU (re-engineered) vs GPU

~120x Speed Increase

60 min

30 sec

Nikhil Podduturi

Page 13: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Impact on use for data processing

Re-engineered • improved stability• tests!• ability to run on CPU or GPU

Faster processing• recalculation of entire data corpus against new genome build• allow determination of data-based thresholds and cut-offs

Public availability• Can be run on GPU instances available at AWS• GPU implementation of IDR: https://github.com/ENCODE-DCC/idr-GPU• TF ChIP-seq: https://github.com/ENCODE-DCC/tf_chipseq• Others available: https://github.com/ENCODE-DCC

Page 14: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Next Steps

Data validation• GPU vs CPU results

Pipeline release• Integration into ChIP-seq pipeline• Deployment via AWS instances and at DNAnexus

Adapt additional software components• SPP: https://github.com/nikhilRP/spp-GPU• Hotspots: https://github.com/nikhilRP/hotspot-GPU

Page 15: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

15

ENCODE DCC

Nikhil Podduturi, Laurence Rowe, Forrest Tanaka

Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, Seth Strattan

Eurie Hong, Mike Cherry (PI), Jim Kent (co-PI), Ben Hitz

Brian Lee, Stuart Miyasato, Matt Simison, Zhenhua Wang, Marcus Ho

@encodedcc [email protected]

Data Wranglers

Software engineers

QA, sysadmins, admin, biocurator

assistant

https://github.com/ENCODE-DCC/

The ENCODE DCC is funded by NHGRI Grant U41HG006992

Page 16: Implementation of GPU-based bioinformatic tools at the ENCODE DCC

ENCODE Uniform Processing Pipeline Work

DNAnexus (PaaS): Brett Hannigan, Andrey Kislyuk, Mike Lin, Singer Ma, Ohad RodehNVIDIA Corporation: NVIDIA Academic Hardware donation program

donation of two Kepler K40 GPU; NVIDIA’s NVBIO framework

Ben Hitz, Seth Strattan, Nikhil Podduturi

ChIP-seq against transcription factors: Anshul KundajeChIP-seq against histone marks: Anshul KundajeRNA-seq: ENCODE RNA working groupWhole genome bisulfite sequencing: Junko Tsuji, Zhiping WengDNAse-seq: Alvin Qin, Shirley Liu