Upload
torsten-seemann
View
159
Download
4
Tags:
Embed Size (px)
Citation preview
Parallel computing
in bioinformatics
Dr Torsten Seemann
Ideal world
● A single computer witho one really fast processor
o huge amount of really fast memory
● Compromise #1: a single computer witho lots of processors
o huge memory fast enough for all processors
● Compromise #2: a bunch of computers witho lots of fast processors on each node
o lots of memory on each node
o really fast, low latency inter-node communication
The real world
● None of these exist :-(
● Computer nodeso Good: CPU & RAM on the increase
o Bad: CPU is competing for RAM
● Node:Node communicationo Good: getting faster
o Bad: latency gets worse with more nodes
Types of parallelism
● Clustero distribute workload across networked computers
● SMPo symmetric multiple processing
o use multiple cores on a single computer
o (we’ll ignore NUMA)
● SIMDo single instruction, multiple data
o same machine code instruction on vector of values
o (we’ll ignore MIMD, GPU)
Clusters
Clusters
● Can be ad hoco bunch of PCs over Ethernet (Beowulf)
● Cluster specifico high density, fast interconnect (Blade)
● Highly specialisedo high density, low power, very fast interconnect, low
latency, many switches (eg. IBM BlueGene)
Using clusters
Break task into subtasks:
● Independent tasks○ “pleasantly parallel” is a good situation!
○ Submit these to cluster queue
o Combine results
● Dependent taskso Need to communicate during run
o Various ways to do this (more later)
SMP
SMP: symmetric multi processing
Use multiple cores on one node:
● Simple case○ run multiple subtasks, one per core
● Multi-threading○ use tools that support multiple cores
■ BWA, bowtie, samtools 0.18+
○ use languages that support native threading
■ Java
■ C, C++, Perl, Haskell - with standard libraries
■ Python has issues here
Using SMP
● POSIX threadso standard “C” Unix interface
o a library of functions
● OpenMPo standard “C” Unix interface
o functions and #pragmas to help compiler parallelize
● Unix Shello use job control and ‘&’ and ‘wait’
o Makefiles, GNU parallel, pipelines (more later)
● Use tools that do this natively for you
SMP communication
● Sometimes threads needs to talko Just like cluster nodes need to talk
● IPCo Inter-Process Communication
● Methodso files, time-stamped “touch files”
o pipes, sockets, message passing
o shared memory
o semaphores
o signals
SIMD
Machine code 101
● CPUs run “machine code” instructions:
○ load R0 , [years] # put var in reg
mul R0 , 365 # mult by 365
add R0 , 1 # add 1
store [days], R0 # put reg in mem
● Each instruction does one atomic operation○ to change one piece of data
■ memory location (RAM variable - slow)
■ register (CPU variable - fast)
● Example
○ vector dot product: x ∙ y = Σi=1..|x| xi × yi
● Pseudo-code○ var x, y : integer[8]
var sum : integer
sum := 0;
for i in 0..7:
sum := sum + x[i] * y[i]
● Operations○ 1 + 8 * 3 = 25 ops
Vector operations
Vector operations
● Vector registers and instructions○ assume 8-element operations (actually common!)
● SIMD○ load V0, [x] # put x[] in vec register
load V1, [y] # same for y[]
mult V0, V1 # vector multiply!
vsum R7, V0 # vec sum into scalar reg
● Operations○ 1 + 1 + 1 + 1 = 4 ops
SIMD Instruction Sets
● Specialised since 1970s○ MASPAR
○ Connection Machine
○ Cray super-scalar
○ DEC Alpha MVI
● Consumer grade○ Intel MMX / AMD 3DNow! (integer) [x86]
○ Intel SSE, SSE2, SSE3, SSE4.x (floating point) [x86]
○ IBM Altivec (both) [BlueGene,POWER]
● GPUs also, but they do MIMD too.
Using SIMD
● Not accessible from scripting languageso they are too many layers away from machine code
● Some libraries exploit ito Numpy (uses some SSE in CoreFunc)
o GSL - Gnu Scientific Library
o BLAS - Linear algebra
● Find the tools that use ito HMMER (profile:sequence alignment)
o FASTA 35+, SWIFT (full local/global/semi alignment)
o BWA, Bowtie (short read alignment)
Automatic SIMD vectorization
● Some compilers can recognise patterns that
can be converted into SIMD instructions○ Simple loops
○ Array operations
○ Data copying
● Re-compile your C/C++ code○ GCC (GNU C Compiler)
■ gcc -march=native -O3
○ ICC (Intel C Compiler)■ vectorization is automatic
Using SMP
Spawn multiple jobs
# run 23 alignments, 1 core per chromosome
for CHR in $(seq 1 1 23); do
bwa mem $CHR.fasta reads.fq.gz \
1> $CHR.sam 2> $CHR.err &
done
# wait until all background jobs finish
wait
Use a Makefile
% ls
1.fasta 2.fasta 3.fasta
% vi Makefile
all: 1.sam 2.sam 3.sam
%.sam: %.fasta reads.fq.gz
bwa mem $< reads.fq.gz > $@
% make -j 8 # use 8 cores
% ls
1.fasta 2.fasta 3.fasta 1.sam 2.sam 3.sam
GNU Parallel
% ls
1.fasta 2.fasta 3.fasta
% parallel -j 8 \
“bwa mem {} reads.fq.gz > {.}.sam” \
::: *.fasta
% ls
1.fasta 2.fasta 3.fasta 1.sam 2.sam 3.sam
{} replaced by each *.fasta in turn
{.} is {} but with file extension removed
Underused multi-threaded tools
● pigz
○ parallel gzip
○ if you have fast disks, scales to 64 cores easily
○ compression better than decompression○ command line option: --processes=16 or -p 16
● pbzip2
○ parallel bzip2
● sort
○ yes, good ol’ Unix sort!
○ command line option: --parallel=16
Dedicated pipeline system
● Ruffus / Rubra
● BPIPE
● Nesoni
.... and so many more
.......... and so many more still coming!
Implicit Unix SMP
Pipes
● When you pipe two commands together○ two separate processes are started: A and B
○ a “pipe” connects A:stdout to B:stdin (A | B)
● Example○ frequency distribution of initial 4-mers in English
cat /usr/dict/words # already sorted
| cut -c 1-4 # first 4 characters
| tr ‘A-Z’ ‘a-z’ # canonicalize to lc
| uniq -c # count dupes
| sort -n -r # most freq first
| head -10 # top 10
Pipes (result)
428 over
410 inte
300 comp
272 unde
262 cons
261 tran
248 cont
211 disc
197 comm
171 fore
Sub-shells
● Use case:○ software alignerX only accepts .fastq files
○ you have compressed .fastq.gz files
○ your disk is slow and has no space left
● Sub-shells to the rescue!
alignerX ref.fa R1.fq R2.fq
alignerX ref.fa <(zcat R1.fq.gz) \
<(zcat R2.fq.gz)
Sub shells + Pipes
● Use case:○ software alignerX only accepts .fasta files
○ you have compressed .fastq.gz files
● Sub-shells can be pipes too!
alignerX ref.fa \
<(zcat R1.fq.gz | paste - - - - | cut -f 1,2
| sed 's/^@/>/' | tr "\t" "\n") \
<(zcat R2.fq.gz | paste - - - - | cut -f 1,2
| sed 's/^@/>/' | tr "\t" "\n") \
Nested sub shells
HC SVNT DRACONES
(here be dragons)
Putting it all
together
Making BAMs
● Align FASTQ to referenceo bwa mem ref R1.fq.gz R2.fq.gz > SAM
● Convert to BAMo samtools view SAM > BAM
● Sort BAMo samtools sort BAM > SORTBAM
● Remove dupes
o samtools rmdup SORTBAM > SORTBAM
Making BAMs
Look mum! No intermediate files! Less idle CPUs!
% bwa mem -t 16 ref.fa R1.fq.gz R2.fq.gz
| samtools view -@ 16 -S -b -u -T ref.fa -
| samtools sort -@ 16 -m 1G -o -
| samtools rmdup - out.bam
-t 16 16 threads for bwa
-@ 16 16 threads for samtools 0.18+
-m 1G 1 GB RAM per thread for RAM sorting
-u pipe an uncompressed BAM
-o use stdout instead of writing to a file
Conclusions
Conclusions
● The “cluster” level○ we are pretty good at that now
● The “SIMD” level○ too low level, depend on others to exploit
○ thankfully many of our key tools already use it
● The “SMP” level○ our pipelines still have single-threaded bottlenecks
○ always check if your tool has --threads option
○ exploit pipes and sub-shells wherever possible
○ and use GNU Parallel - it’s awesome (and Perl)