Upload
paulina-johnston
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Cluster-based SNP Calling on Large Scale Genome Sequencing Data
Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and
Engineering The Ohio State University
CCGrid 2014, Chicago, IL
CCGrid 2014 2
What is SNP?
• Stands for Single-Nucleotide Polymorphism
• DNA sequence variation that occurs when a single nucleotide differs between members of biological species.
• Essential for medical researches and developing personalized-medicine.
• A single SNP may cause a Mendelian disease.
*Adapted from Wikipedia
3
Motivation
• The sequencing costs are decreasing
CCGrid 2014
*Adapted from genome.gov/sequencingcosts
4
• Big data problem– 1000 Human Genome Project already produced 200 TB data
– Parallel processing is inevitable!*Adapted from https://www.nlm.nih.gov/about/2015CJ.html
Motivation
CCGrid 2014
CCGrid 2014 5
Outline
• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion
CCGrid 2014 6
General Idea of SNP Calling Algorithms
Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C CAl
ignm
ent F
ile-1
Reference A G C G T A C C
Sequences 1 2 3 4 5 6 7 8
Read-1 A G A G
Read-2 A G A G T
Read-3 G A G T
Read-4 G T T C CAlig
nmen
t File
-2
✖ ✓✖ Two main observations:• In order to detect an SNP
at a certain location, we have to check the alignments in ALL genomes at that location.
• The existence of an SNP is independent than others
CCGrid 2014 7
Parallel SNP Calling
How to distribute data among nodes?
Processor 1
Location-based Sample-based
CCGrid 2014
Proc 2
Proc1
Processor 2
Processor 3
Processor 4
Proc 3
Proc 4
Proc 1
Checkerboard
Proc2
Proc3
Proc4
Genome files Requires communication among processes
CCGrid 2014 8
Challenges
• Load Imbalance due to nature of genomic data– It is not just an array of
A, G, C and T characters
• I/O contention• High overhead of
random access to a particular region
8
1 3 4
Coverage Variance
CCGrid 2014 9
Histogram Showing Coverage Variance
• Chromosome: 1• Locations: 1-200M• Number of
samples: 256• Interval size: 1M
CCGrid 2014 10
Outline
• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion
CCGrid 2014 11
Proposed Scheduling Schemes
• Dynamic Scheduling• Static Scheduling• Combined Scheduling
…Each scheduling scheme uses location-based data division. That is, the genome is divided into regions and each task is responsible for a region.
CCGrid 2014 12
Dynamic Scheduling
• Master & Worker Approach• Tasks are assigned dynamically• Two types of data-chunks are
used– Big chunk: covers B locations– Small chunk: cover S locations– B > S
B• Big chunks are assigned first,
then small chunks are assignedB
Alig
nmen
t File
-1Al
ignm
ent F
ile -2
13
Static Scheduling• Pre-processing step
– We count the number of alignments for each region and generate a histogram
• Estimated Cost– We use an estimation function and our histogram
for data partitioning.
– k : histogram interval k– TR : cost of accessing/reading the region
– TP: processing an alignment– N(l): Number of alignments in location l
– Each task is responsible for regions having same estimated cost.
CCGrid 2014Al
ignm
ent F
ile -1
Alig
nmen
t File
-2
• Tasks are scheduled statically. No master & Slave approach
CCGrid 2014 14
Combined Scheduling
• Combination of Static and Dynamic Scheduling
• We use small and big chunks as in dynamic scheduling
• The size of the chunks are determined according to histogram
• Master-Worker approach
Alig
nmen
t File
-1Al
ignm
ent F
ile -2
Big chunks Small chunks
CCGrid 2014 15
Parameters of Scheduling Schemes
• Our proposed scheduling schemes have user-defined parameters– Dynamic Scheduling
• Length of big and small chunks– Static Scheduling
• Histogram interval size• Estimation function parameters
– Combined Scheduling• All parameters for dynamic and static scheduling
• All parameters can be determined with a offline training phase
CCGrid 2014 16
Outline
• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion
CCGrid 2014 17
Experiments• Local cluster with nodes• 2 quad-core 2.53 GHz Xeon(R) processors with 12 GB RAM
• We obtained genomes of 256 samples from 1000 Human Genome Project
• The data is replicated to all local disks unless noted otherwise
• Parallel implementation:– We implemented VarScan in C programming language
• We also modified VarScan such that BAM files can be read directly.
– Used MPI library for parallelization
CCGrid 2014 18
Experiments: Scalability
Scheduling Scheme
Scalability
Basic 8.4x
Dynamic 10.9x
Static 19.7x
Combined 23.5x
First 192M location of Chr.1
CCGrid 2014 19
Experiments: Data Size Impact
128 cores are allocated
CCGrid 2014 20
Experiments: I/O Contention Impact
128 cores are allocated
Scheduling Scheme
IO Contention Impact (Sec)
Basic 174
Dynamic 229
Static 251
Combined 220
I/O
Con
tenti
on Im
pact
CCGrid 2014 21
Comparison with Hadoop
- First 192M location of Chr.2 in 512 samples are analyzed
- Lower (dark) portions of the bars show pre-processing time.
IPDPS'14 22
Scheduling With Replication
• Data-Intensive Processing Motivates New Schemes• Replicate each chunk fixed/variable number of times• Dynamic scheduling while processing only local
chunks • Interesting new tradeoffs • Under submission
IPDPS'14 23
Other Work
• PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications (IPDPS 2014)
• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language
IPDPS'14 24
PAGE vs. State-of-the-Art
• A middleware system– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data
formats – Allows use of existing programs
CCGrid 2014 25
Conclusion
• We have developed a methodology for parallel identification of variants in large-scale genome sequencing data.
• Coverage variance and I/O contetion are two main problems
• We proposed 3 scheduling schemes• Combined scheduling gives best results.• Our approach has good speedup and outperforms
Hadoop