Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
CISC 879 : Software Support for Multicore Architectures
Presentation by: Keyur MalaviyaDept of Computer & Information Sciences
University of Delaware
Exploring the Viability of the Cell BroadbandEngine for Bioinformatics ApplicationsAuthors: Vipin Sachdeva, Michael Kistler, Evan Speight
and Tzy-Hwa Kathy Tzeng
CISC 879 : Software Support for Multicore Architectures
Overview• Problem:
Genomic data/computational requirements growingGeneral-purpose processors cannot handle this
• Approach:Parallelize & port existing applications to multicoreUse multicore architecture like Cell (PS3)Steps: Pick applications, Perform profiling
Make necessary code changes for porting Publish results
• Goal:Validate performance gain on PS3
CISC 879 : Software Support for Multicore Architectures
Bioinformatics Quick Intro• Terminology:
Gemones ≈ DNA strand ≈ Proteins• A typical problem:
Machine learning, Prediction, Data miningSupport vector based, HMM, Clustered
• Few examples:Determine biological functions of proteinsUnderstand biochemical pathwaysAssemble strings to make Genome
• Approach:Compare sequence data with known genomes
CISC 879 : Software Support for Multicore Architectures
Experimental Setup
Two applications selected
1) Sequence alignment : FASTA (ssearch34)Smith-Waterman algorithm [ O(nm) ]Pairwise alignment of gene sequencesUses dynamic programming algorithm
2) Homology detection: ClustalW(clustalw)
Multiple sequence alignment
CISC 879 : Software Support for Multicore Architectures
Sequence alignment basics• Pairwise alignment: Most commonly
performed tasks in bioinformatics• To align two sequences:i. Alignment score Matrix (e.g: Blossom, PAM)ii. Comparing them, assign scoresiii. Insert gaps in one or both sequencesiv. Traceback Dynamic programming table to
Produce an optimal score
CISC 879 : Software Support for Multicore Architectures
Sequence alignment basics• To align two sequences:• Alignment score Matrix (e.g: Blossom, PAM)• Comparing them, assign scores• Insert gaps in one or both sequences• Traceback & produce an optimal score
Scoring table
Aligned Sequences:
Two sequences:
CISC 879 : Software Support for Multicore Architectures
ClustalW basics• In three steps:
i. All sequences are compared pairwise (Smith-Waterman algorithm)
ii. Create a hierarchy for alignment (guide tree) bycluster analysis (distance matrix) for each pair
iii. Progressive add one sequence according to theguide tree to get multiple sequence alignment
CISC 879 : Software Support for Multicore Architectures
Guided tree orPhylogenetic tree
ClustalW: Multiple seqalignment
…..
CISC 879 : Software Support for Multicore Architectures
Applications’ characteristics
• Embarrassingly parallel computation
• Small critical time-consuming code size
• Regular memory accesses
• Vectorized code
CISC 879 : Software Support for Multicore Architectures
Perform profiling (gprof)o More than half the
execution time istaken by a singlefunction:
o FASTA: dropgswo ClustalW: forward
passo HMMER: P7Viterbi
Figure: Execution profile from gprof forthree applications & the BLASTapp.
CISC 879 : Software Support for Multicore Architectures
Porting to Cell Architecture• Altivec: is a PowerPC from AIM alliance
• Altivec SSE implementations:o dropgsw for FASTA and P7Viterbi for
HMMER
• IBM Life Sciences:o Vectorized kernel for forward_pass of
ClustalW
• Ported with few modifications to CELL
CISC 879 : Software Support for Multicore Architectures
Porting to Cell Architectureo Advantage on CELL:
Used 9 cores: PPU also as a processing element
• Compilation of programs:For CELL: XLC v8.1 with -O3PowerPC G5: -O3 -mcpu=G5 -mtune=G5Opteron & Woodcrest: -O3
CISC 879 : Software Support for Multicore Architectures
FASTA (Smith-waterman)on CELL
• FASTA package includes Altivec-enabled Smith-Waterman
• Smith_waterman_altivec_word kernal was portedwith simple changes to CELL
• Altivec APIs: vec_max, vec_subs were written forSPUs
• Pairwise alignment of 8 pairs of sequences, usingone SPU for each pairwise alignment
• Limitation in current implementation:Size of both sequences <= SPU local store (256 KB)Sequence size <= 2048 characters
CISC 879 : Software Support for Multicore Architectures
Long sequencecomparisons:
• To do genome-wide or long sequencecomparisons:
o Implement pipelined approach among SPUso Each SPU performs Smith-Waterman alignment for a
block, notifies the next SPU through a mailboxmessage
o Later SPU uses boundary results of previous SPU forits own block computation
• Future research: Support of bigger sequences on the Cell
CISC 879 : Software Support for Multicore Architectures
Performance of Smith-Waterman
Alignmentexecution timefor differentprocessors:
1: Sequencelength = 10242: Sequencelength = 2048
CISC 879 : Software Support for Multicore Architectures
Porting ClustalW on CELL• pairalign function: All-to-all pairwise
comparisons for ‘n’ sequences performsn(n−1)/2 alignments
• Takes 60%-80% of the execution time
• pairalign function is made of 4 functions
• 1) Forward_pass computes the maximumscore and is the most time-consuming step ofpairalign
CISC 879 : Software Support for Multicore Architectures
Application’s Architecture• Input to ClustalW:o ‘n’ sequences in a query sequence fileo n(n−1)/2 computations
• Mfc DMA in/out from 16-byte boundaries:Pack all sequences in a single arrayEach sequence begin at a multiple of 16-bytes
CISC 879 : Software Support for Multicore Architectures
Application’s Architecture• PPU creates threads, passes max sequence size
• SPUs wait for PPU to send a message to pull in thecontext data & begin computation
• Work distribution Round-robin strategy: o Each SPU is assigned a number from 0 to 7o if ‘i mod 8=k’ SPU ‘k’ compares seq no. ‘i’ against all
sequences ‘i+1’ to ‘n’[ i=9 => k=0 ] [i=15 => k=7] [ so on…]
CISC 879 : Software Support for Multicore Architectures
Issues in porting ClustalWand bottlenecks
• IBM Life Sciences vectorized version of forward_pass
• Altivec APIs: vec_max, vec_adds written for SPUs
• SPUs don’t support 16-bit: Altivec used vectorstatus & control register to detect overflow on it
• SPU use of 32-bit (int) lowers vector computingefficiency (only 4 values can be packed in a vector)
• SPU has only vector registers: Reading Alignmentmatrix score – a scalar operation suffers on SPU
CISC 879 : Software Support for Multicore Architectures
Issues in porting ClustalWand bottlenecks
• SPUs only have static branch prediction:o Fails on a branch with multiple conditions oro More loop variables handling boundary caseso Such branches are difficult to predict for the SPU
• Solution:o Make branch depend on a single loop variableo Break inner alignment loop into several loopso Explicit handling of boundary cases
• 2X performance gains
CISC 879 : Software Support for Multicore Architectures
Code changes behavior
Improvement of performance of ClustalW alignment functionwith different code changes.Best implementation: Using integer datatypes with no branches
CISC 879 : Software Support for Multicore Architectures
Performance offorward_pass (ClustalW)
Comparison of Cell Performance with other processors for onlyalignment function with simple round-robin strategy
Two inputs from BioPerfsuite:1) 1290.seq has 66sequences of averagelength 10822) 6000.seq has 318sequences of averagelength 1043
NOTE: Opteron and theWoodcrest performanceis non-vectorized
CISC 879 : Software Support for Multicore Architectures
Performance of ClustalWComparison of Cell Performance with other processors for totaltime of execution
Two inputs from BioPerfsuite:1) 1290.seq has 66sequences of averagelength 10822) 6000.seq has 318sequences of averagelength 1043
NOTE: Opteron and theWoodcrest performanceis non-vectorized
CISC 879 : Software Support for Multicore Architectures
Computing the finalalignment
• Performance gain in Forward_pass• Final step executing on PPU is much slower
compared to other superscalar processors
• Two approaches to this problem:Execute more code on the SPUs, orUse Cell as an accelerator, along with a superscalarprocessor
• Future work:Increasing PPU performance (RoadRunner project isexploring such hybrid architectures)
CISC 879 : Software Support for Multicore Architectures
PPU penalty:Cell performance is marginally better in overall execution time dueto performance of the PPU
Forward_pass: Cell (8 SPUs) 7.5 times to Power G5Overall: Cell (8 SPUs) 1.2 times to Power G5
Forward_pass
CISC 879 : Software Support for Multicore Architectures
Current progress in porting• Many bioinformatics applications are being ported
to the Cell processor
• 2500 playstations used to parallelize gene-findingand sequence alignment softwares
• Protein folding on distributed computing and GPUs
• Charm++ runtime system, used for NAMDsimulations
• FASTA, ClustalW, and HMMER
• …
CISC 879 : Software Support for Multicore Architectures
Conclusions &Future Work
• Cell’s total power consumption < half of asuperscalar processor
• Cell is price & power-efficient platform for futurebioinformatics computing
• Future Work:o Removing limitations & increasing optimizationo Porting HMMER to Cell processoro Handling size that exceeds 256 KBo Use of partitioning input among 8 SPUso Porting protein docking, RNA interference, medical
imaging and few more applications
CISC 879 : Software Support for Multicore Architectures
QUESTIONS