Direct approaches to exploit many-core architecture in bioinformatics

Future Generation Computer Systems 29 (2013) 15–26

Contents lists available at SciVerse ScienceDirect

Future Generation Computer Systems

journal homepage: www.elsevier.com/locate/fgcs

Direct approaches to exploit many-core architecture in bioinformaticsFrancisco J. Esteban a,1, David Díaz b,2, Pilar Hernández c,3, Juan A. Caballero d,4, Gabriel Dorado e,5,6,Sergio Gálvez b,∗,6

a Servicio de Informática, Edificio Ramón y Cajal, Campus Rabanales, Universidad de Córdoba, 14071 Córdoba, Spainb Dep. Lenguajes y Ciencias de la Computación, ETSI Informática, Campus de Teatinos, Universidad de Málaga, Bulevar Louis Pasteur 35, 29071 Málaga, Spainc Instituto de Agricultura Sostenible (IAS-CSIC), Alameda del Obispo s/n, 14080 Córdoba, Spaind Dep. Estadística, Campus Rabanales C2-20N, Universidad de Córdoba, 14071 Córdoba, Spaine Dep. Bioquímica y Biología Molecular, Campus Rabanales C6-1-E17, Campus de Excelencia Internacional Agroalimentario, Universidad de Córdoba, 14071 Córdoba, Spain

a r t i c l e i n f o

Article history:Received 23 January 2012Received in revised form15 March 2012Accepted 22 March 2012Available online 5 April 2012

Keywords:System-on-chipMultiprocessor systemCode migrationDomain-specific architectureProgramming methodologySequence alignment and assembly

a b s t r a c t

Current trends in computer programming look for solutions in the challenging task of porting andoptimizing existing algorithms to many-core architectures with tens of Central Processing Units(CPUs). Yet, the lack of standardized general-purpose parallel programming and porting methodologiesrepresents the main bottleneck on these developments. We have focused on bioinformatics appliedto genomics in general and the so-called ‘‘Next-Generation’’ Sequencing (NGS) in particular, in orderto study the viability and cost of porting and optimizing well known algorithms to a many-corearchitecture. Three different methods are tackled in order to implement existing algorithms in Tile64,corresponding to a microprocessor containing 64 CPUs, each of them being capable of executing anindependent Linux operating system. Three different approaches have been explored: (i) implementationof the Needleman–Wunsch/Smith–Waterman pairwise aligner from scratch; (ii) direct translation of theMessage Passing Interface (MPI) C++ ABySS assembly algorithm with changes on the communicationlayer; and (iii) migration of the ClustalW tool, parallelizing only the most time-consuming stage. Theperformance-gain/development-cost tradeoffs indicate that the Tile64 microprocessor has the potentialto increase the performance of bioinformatics in an unprecedented way for a standalone PersonalComputer (PC). Yet, the effective exploitation of these parallel implementations requires a detailedunderstanding of the peculiar many-core characteristics when migrating previous non-parallel sourcecodes.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction

Nowadays, high-performance processing cannot be under-stood without new Chip Multiprocessors (CMPs), which are be-ing actively developed. Amongst such chips are the GraphicsProcessing Units (GPUs) with hundreds of cores [1] and theSony–IBM–Toshiba Cell Broadband Engine Architecture (CBEA) [2],

∗ Corresponding author. Tel.: +34 952133312; fax: +34 952131397.E-mail addresses: [email protected] (F.J. Esteban), [email protected]

(D. Díaz), [email protected] (P. Hernández), [email protected](J.A. Caballero), [email protected] (G. Dorado), [email protected], [email protected](S. Gálvez).1 Tel.: +34 957213005; fax: +34 957218116.2 Tel.: +34 952133312; fax: +34 952131397.3 Tel.: +34 957499277; fax: +34 957499252.4 Tel.: +34 957211068; fax: +34 957218116.5 Tel.: +34 957218689; fax: +34 957218592.6 Authors who contributed to the project leadership.

0167-739X/$ – see front matter© 2012 Elsevier B.V. All rights reserved.doi:10.1016/j.future.2012.03.018

which allow us to render complex animations and provide aswell enough computing power to perform other calculus-intensivetasks [3]. This line is exploited by supercomputing blade sys-tems like the IBM BladeCenter server platform [4] and theNvidia Tesla [5]. Yet, such products require detailed programmingmethodologies and architecture optimizations, which are muchmore complex than the ones of the single Central Processing Units(CPUs) [6,7]. A comparison between these different programmingapproaches and an attempt to automate such processes can befound at [8]. Other companies are working in CMP and multi-threadedmany-coremicroprocessors, like the Intel Tera-scale Pro-cessor with 80 cores [9], Single-chip Cloud Computing (SCC) [10]with 48 cores, the ‘‘Knights Ferry’’ [11] with 32 cores, the Xeon E7with 10 cores and two threads per core [12], the SunMicrosystemsUltraSPARC T2 Prowith eight cores and eight threads per core [13],the Adapteva Epiphany IV (64 cores) [14] and the Tilera Tile64 mi-croprocessor, as explained below.

The TilExpress-20G cards include a many-core Tile64 micro-processor with 64 tiles (cores) at 866 MHz, 8 GB of RAM and

http://dx.doi.org/10.1016/j.future.2012.03.018

http://www.elsevier.com/locate/fgcs

http://www.elsevier.com/locate/fgcs

mailto:[email protected]








16 F.J. Esteban et al. / Future Generation Computer Systems 29 (2013) 15–26

two 10 GB Ethernet ports. Though they were initially designed fornetworking applications, video encoding [15] and streamingbroadcasts (thanks to their high communication bandwidth andscalability), we have already demonstrated the usefulness of sucharchitecture for bioinformatics [16,17]. We are applying such de-velopments for quality control, traceability and fraud prevention ofolive oil [18], as well a other genomics approaches of our researchgroup on Agri-Food Biotechnology.

Other works have focused on fine-tuning intensive processingbioinformatics algorithms to platforms like GPU [19–21] or theabove mentioned CBEA [22], obtaining better performances thanwith single CPU implementations, as a result of their paralleliza-tion factor. The main difference between these developments andthe present work is that Tile64 actually containsmany CPUs (in thesense that each one of them is able to execute a standalone operat-ing system). In contrast, the so-calledmany-coreGPUcontains veryrestricted processing units, unable to execute a whole complex al-gorithm on their own. Thus, many of the bioinformatics algorithmsported to theGPUarchitectures showextraordinary speed-ups but,due to the restricted resources, usually only work well with somespecific data, sharply decreasing their performance when appliedto a wider range of input data. For instance, in the case of sequencealignments, only short sequences (typically, peptides) are allowed,precluding their application for larger projects involving nucleicacid data (e.g., genomics). On the other hand, the CBEA integratestwo kinds ofmicroprocessors: one Power Processing Element (PPE)and eight Synergistic Processing Elements (SPE); whereas Tile64contains a homogenous matrix of 8 × 8 tiles. This architecture of-fers the opportunity to evaluate such a system to evaluate the effortthat must be dedicated to obtain a working parallelized algorithmon it, and the achieved performance according to the developmentand code migration approach used, as explained below. Indeed,this work focuses more on evaluating the possible migration ap-proaches, and less on getting the maximum possible performance(which in any casemay require amore time-consuming approach).

In order to further evaluate the suitability of the Tile64 archi-tecture in the field of bioinformatics, we have tested three com-monly used algorithms by life-science researchers: (i) pairwisesequence alignments: (ii) multiple sequence alignments; and(iii) de novo genome assembly. Taking into account the Tile64 soft-ware development characteristics [16], three different approachesregarding porting efforts have been considered; namely: (i) an im-plementation from scratch of a dynamic programming algorithmlike the Fast Linear-Space Alignment (FastLSA) [23], with a parallelstrategy for pairwise alignments; (ii) a half-way solution, wherea customized communication layer replaces the Message PassingInterface (MPI) in the Assembly By Short Sequences (ABySS) algo-rithm [24]; and (iii) a slightly modified implementation of a par-allel multiple sequence aligner (ClustalW) [25]. Each proposal isdiscussed in terms of the trade-offs between the develop-ment efforts and the achieved performance. Such strategy al-lows noting several mandatory principles for efficient many-coredevelopments.

On the other hand, the term many-core is used as a synonymof many-core CPU and this should not be confused with many-core and General-Purpose GPU (GPGPU). To avoid confusions, themanufacturers of many-core CPUs have coined a new term: tile.This term relates to the geometry that can be visually observed inthe die of the chip; it is a reminiscence of how a tile contributesto complete a mosaic. The shape of the die provides a preliminaryidea of how different are the many-core CPUs vs the GPUs.Table 1 shows the main differences between these technologiesfrom a programming and performance point of view. The lastrow of the table visually notes these differences by comparingdie-shots from these platforms. This may help to understandthat the parallelization strategies are indeed drastically different.

Thus, GPGPU cores are specialized in executing small threadsrather than huge tasks (such as a whole operating system) andthis simplicity allows an easier integration of hundreds of thesecomputing elements in a single die; whereas the many-core CPUshave just tens of tiles nowadays. So then, the strategywhen portingalgorithms to GPUs is to generate as much execution threads aspossible, in order to maximize the number of cores working at agiven time.

These large differences among many-core CPUs and GPUsmake very difficult any comparison between the performance andcapabilities between such architectures. Actually, they should beconsidered as complementary rather than competitors, as is notedin Table 1. Tilera’s approach is not the only one following this newtile-based computing paradigm. Intel has joined the bandwagonof this technology with Single-chip Cloud Computing (SCC) [10], aplatformwith hardware-based message-passing capabilities and awhole operating system running in each tile. Nevertheless, Intel’splatform is nowadays in an academic and research stage, Tilera’splatform being the only one commercially available. Anotherexample is the Adapteva Epiphany IV chip, which should beavailable by later 2012.

2. The Tile64 microprocessor

The Tilera microprocessors are available as standalone chipsor in System-on-Chip (SoC) many-core Peripheral ComponentInterconnect express (PCIe) cards. Each of the 64 tiles integratedinto a single Tile64 microprocessor die is capable of running afull operating system, as previously indicated, managing severalindependent threads [26]. Each tile is a 32-bit Reduced InstructionSet Computing (RISC)machine (with no floating point instructions)running at 500–866 MHz, with an exclusive 8 kB L1 cache and 64kB and 4 MB L2 and L3 caches, respectively (all the L2 caches areshared between all the tiles in the so-called L3 cache). In the bestscenario, each tile can execute three 32-bit instructions with animmediate 16-bit operandper clock cycle. Running at 866MHz, thetheoretical maximum speedup may be calculated as Eq. (1) shows,where IPS stands for Instructions Per Second, and IPC is the numberof 32-bit Instructions Per Cycle allowed by RISC pipelines:

IPS = 64 µp · 3 IPC · 866 MHz = 0.166 TeraIPS. (1)

Yet, this computing power is not completely available to the pro-grammer because several tiles must be reserved or shared forinternal card-host-card communication tasks (see below). Theseshared tiles should not be used in parallel algorithms, due to theprohibitive delays that they may introduce in random tiles. So,in practice, the maximum number of effective tiles available forthe programmer is 60 (see [17]). Our study has been developedand tested on Dell Precision T5400 Personal Computers (PCs) withan Intel Quad Core Xeon 2.0 GHz microprocessor and 8 GB ofDDR2 memory in quad-channel, executing CentOS 5.3. The Tile64microprocessor is boarded on a TilExpress-20G card with 8 GBRAM. Such storage can be used as a Solid State Disk (SSD), aswell as main memory, with empirically tested limits of 7.8 GBfor SSD, 2.8 GB for local memory and 1.9 GB for shared mem-ory [17]. Additionally, the Tilera cards work independently of thehost CPU.

The algorithm developments using the Tile64 microprocessorscan be carried out by means of a Tilera modified Eclipse [27]environment named the Multicore Development Environment(MDE), which produces cross-compiled code to be later ondeployed in the microprocessors. This Integrated DevelopmentEnvironment (IDE) allows us to program in C and C++ and to useproprietary compilers, which follow the same syntax as the well-known gcc and g++ (without support for an extended assembler).Since eachmicroprocessor core runs a Linux operating system, any

F.J. Esteban et al. / Future Generation Computer Systems 29 (2013) 15–26 17

Table 1Differences between many-core CPU, many-core GPU and Multicore architectures.Source: Die shots reprinted by permission of Intel corp., Tilera corp., Nvidia corp. and Oracle corp.

Many-core CPU: Tilera Tile64, IntelSCC. . .

Many-core GPU: Nvidia GT200, GTX250. . .

Regular Multicore platform: Intel i7 SunUltraSPARC T7

Well-defined cores/tiles in the die Lightweight cores (streamprocessors)

Well-defined cores in the die; optional elementslike shared cache shown

Each tile is designed for a singlethread

Originally designed forhigh-throughput graphics processing:each SP is a shader

Operating System can distributeprocesses/threads across cores

Designed for general purposeprogramming: each tile is a RISC/x86CPU

Each core is capable of runningseveral threads

Designed for general purpose programming

Each tile runs independently as aCPU; each one capable of running anentire OS

Code must be divided into ‘‘blocks’’;each block is assigned to an SM

Each core runs independently as a CPU,optionally holding several independent threads.Programs can be designed to use several cores

Any programming language Parallel programming language:CUDA

Any programming language. Many softwarepieces available

High-performance network-on-chipinter-tile communications/hardwareimplemented message-passingcapabilities

Communications by shared memory(only 16 kB); global memory is alow-performance alternative

Not designed for inter-CPU communication.

Cache memory in each tile+ sharedcache between all of them

Caches per instruction, texture, L1 perSM, L2 per TPC and L3 globally shared

L1 and L2 cache in each core. L3 shared cache

Medium tile clock rate Moderate stream processor clock rate High core clock rate

Tilera Tile 64 die shot Nvidia GeForce GTX 250 die shot Sun UltraSPARC T1 die shot

Intel SCC die shot Intel core i7 die shot

existing source code written in GNU C/C++ has a great chance torun unmodified (on a single tile), just recompiling it. The Tileracompilers do not generate parallel code automatically, so anyscheme to take advantage of the many tiles must be manuallydeveloped.

In order to exploit parallelism, Tilera provides an ApplicationProgramming Interface (API) called iLib, which contains theprimitives (low-level function and procedures) for distributingexecution tasks across the tiles, as well as for communicating suchexecutions throughout the internal network that interconnectsthem, known as iMesh (which stands for ‘‘intelligent mesh’’).The iMesh is a low-latency network-on-chip that allows each tileto communicate with any other, independently of their relativepositions in the chip matrix. iLib provides proprietary functions tolaunch processes on a specific tile, to execute the same code onseveral tiles and to communicate and synchronize the processesbeing executed on different tiles. The communication schemecan be based on common concepts like channels or messagepassing, but no standard API is followed; there is no MPI, OpenMulti-Processing (OpenMP) or Parallel Virtual Machine (PVM)implemented. Thus, a Tile64microprocessor can be seen as a grid ofresources — the tiles, each one capable of running a single threador task from a program. However, it can be modeled as a clusterof independent pico-computers connected to each other through ahigh-throughput communication mesh. Hence, our developmentswill have in mind this latter approach.

3. Porting applications to the many-core architecture

Migrating existing applications to a new platform or architec-ture is a common task in software engineering. Several approacheshave been proposed over the years [28] in order to accomplish thismission with the best results and the least possible effort, both inscientific [29] and management applications [30]. Bioinformaticsis not an exception, especially when new ways of parallel execu-tion are explored or new architectures (like GPGPU) are used [31].Indeed, new programming methodologies and design patterns arebeing studied to allow easily migrating applications to multi- andmany-core microprocessors, GPGPU [32,33] and even hybrid ar-chitectures of PC-clusters [34]. Among themigration strategies arethose which use Parallel Programming Languages (PPLs), like theOpen Computing Language (OpenCL) [35], available in the Com-pute Unified Device Architecture (CUDA) [36,37] and Intel ParallelStudio (IPS) [38]. All of these approaches take advantage of built-in commands and tools, to help the programmer to identify themost suitable code segments to be optimized by a parallel execu-tion. However, in our case, none of them may be exploited, due tothe lack of similar development tools for Tile64, and because onlya few architectures are available for such languages.

The minor-effort approach to obtain a parallel code is to usea compiler capable of parallelizing the code for a specific archi-tecture. Indeed, this is one of the most intricate and researched


fields in computer sciences, being one of the compilers’ classicchallenges [39]. However, in our case, as the tile-gcc in the MDEversion 1.3.5 (and the newer 2.0) still does not support this op-timization to generate parallel code, it is a programmer’s task tomanually parallelize the execution, distributing it among the tiles.So, the possibilities to migrate source codes are restricted to thosemanually developed in C or C++ with a trade-off between re-quired effort and expected performance.

The present work reveals which capabilities of the low-levelTile64 architecture are being fully used, exposing the bottlenecksand their impact on the overall performance. Additionally,this know-how enables us to empirically exploit the morerelevant Tile64 features to obtain themaximum performance withthe minimum possible effort, and to take apart the irrelevantones for parallelization purposes. Eventually, such knowledgeallows noting useful guidelines for future high-performanceparallel developments, some of which are well known in currentapproaches. In the present work, three common-sense approacheshave been adopted in this restricted scenario, using three differentbioinformatics algorithms on the TilExpress-20G cards: (i) entirere-design and implementation of an algorithm, taking into accountthe newplatformcharacteristics (such a techniquehas beenwidelyused previously; e.g., in [29,40], albeit being time-consuming forthe programmer); (ii) the adaptation and redesign of the parallelstrategy only in the bottlenecks of the original algorithm, as shown,for instance, in [41,42], being an intermediate approach; and(iii) the faster and easiest approach of directly porting analgorithm and trying later on to optimize the parallel behaviorand performance, as done, e.g., by [37,43]. The TilExpress-20Gperformance greatly diverges from one approach to another whencompared to a Dell T5400 original execution, as described in thefollowing sections.

4. Development from scratch

The algorithm selected in this first approach is the FastLSA [23]pairwise global aligner. From a bioinformatics point of view, thegoal of an alignment algorithm is to identify similar and discrepantregions of DNA, RNA or peptide (e.g., protein) sequences. Thealignment is called pairwise if the goal is to find the best matchamong just two sequences, being namedmultiple if more than twosequences are involved in such alignment. The FastLSA is a variantof the Needleman–Wunsch (NW) algorithm [44], with Gotoh’smodifications [45] and linear-space complexity. The interest inthis algorithm is related to its linear spatial complexity, insteadof quadratic as the NW is. The FastLSA is an optimal aligner (itreturns the best alignment from a mathematical point of view),which follows the same dynamic programming strategy as NW,but does not need to store the full Dynamic Programming Matrix(DPM). Thanks to this characteristic, the FastLSA is able to alignlonger sequences.

4.1. Overview

A pairwise aligner tries to match the input sequences with-out permuting any of their elements, but inserting gaps instead.In this case, given two sequences as A of length n (query; usuallythe sequence obtained by the researcher) and B of length m (sub-ject; usually the sequence obtained from a database), the Needle-man–Wunsch algorithmuses a dynamic programming approach tomaximize scores in an n×mmatrix with all possible base changes(mutations or polymorphisms, from a biological point of view).They include base substitutions like transitions (purine–purine orpyrimidine–pyrimidine) and transversions (purine–pyrimidine, orvice versa), as well as frameshift mutations, including insertionsand deletions (of the query) known as ‘‘indels’’ (appearance of gaps

in the subject or the query sequence, respectively). Each positionof the matrix stores the best score to align the subsequence of Awith the subsequence of B up to that point. This score is calculatedvia a reference table called the substitution matrix, and the gap in-sertion, deletion and extension penalty costs, used by the Gotoh’saffine gaps approach.

The computation of the matrix begins by assigning an initialvalue to its first row and column. The rest of the elementsin the matrix are calculated from their upper, left and upper-left neighbors, taking the maximum value from the followingpossibilities: (i) inserting a gap in the A sequence, so that theelement from Bwill be pairedwith it. The score in this case is takenfrom the upper cell, plus the insert penalty; (ii) inserting a gap inthe B sequence, so that the element from A will be paired with it.The score in this case is taken from the left cell, plus the deletepenalty; and (iii) the residues of the two sequences match in thatposition. The score in this case is taken from the upper left cell, plusthe match value in the substitution table.

The Gotoh’s affine gap improves the quality of the alignmentfrom the biological point of view (which is relevant, for instance,for the construction of phylogenetic trees or dendrograms), usingthree values in each cell. But the algorithm behavior is exactlythe same, with the difference of the memory usage. Once theentire matrix is calculated in this forward stage, the alignmentis obtained, moving backwards from the bottom right corner,following the described path in the first stage and generating thealignment from the operations associated with each movement:replacement (substitution), insertion or deletion.

With the advent ofmulti-core andmany-core technologies, sev-eral approaches have been proposed to parallelize this algorithm,like Hirschberg’s [46]. We have used the FastLSA [23], which pro-poses to save some temporal values from the forward phase incache. Thus, only a few submatricesmust be recomputed (the onesinvolved in the optimal path) when the backward stage begins. Be-sides, the memory requirements are lower with this approach, sovery long sequences can be aligned, and the forward stage may beparallelized by simultaneously calculating many submatrices.

4.2. Implementation

The source code of the original FastLSA was not available,and the best version we could obtain was an MPI-basedimplementation specifically designed for the Intel C++ compiler,called the Parallel Linear Space Algorithm (PLSA) [47]. So, the firststep was to develop a wholly new implementation for the Tile64platform, adapting the communication between sub-processes tothe available resources in the iLib API. The FastLSA algorithmrequires generating the grid cache, which is a matrix of rowsand columns available for all the participant tiles, which arecomputing submatrices in parallel. In order to coordinate thiswork, a task scheduling strategy was designed, with a controllertile distributing queued jobs among the remaining tiles, which actas workers. Due to the fact that each tile runs a complete andindependent Linux operating system, different programs can bewritten for the controller and the worker tasks, each one runningwith its own code and data memory.

Pairwise aligners need two stages: the first one is a forwardprocess that computes the full DPM; afterwards, the second oneobtains the alignment following the optimal path described by theDPM. Thus, the workers accept two different job types: forwardand backward ones. The backward stage takes linear time, incontrast with the quadratic time required by the DPM generationstage.

In MC64-NW/SW, the controller stores and encapsulates thegrid cache, and its main task is to distribute pending jobs amongthe idling workers in the queue. The cache and job sizes are


Fig. 1. MC64-NW/SW performance compared to other implementations.

determined by the sequence lengths (n and m) and a configurablek parameter, which divides the n × m matrix into blocks of k × kcells. The inherent idea of this grid cache is being able to align verylong sequences which, otherwise, would take a huge quantity ofmemory. When adjusting the k parameter, instead of storing then × m × (cell size) bytes of the full DPM matrix in memory, only2/k × n × m × (cell size) bytes are stored. However, in order toobtain the alignment traversing back the DPM in the backwardstage, the k × k submatrices in the optimal alignment path mustbe recalculated.

The grid cache calculus in the forward stage can be distributedamong all workers following a wavefront approach: a certainsubmatrix is ready to be generated when the immediately upperand left ones have been already calculated. Thus, exploiting thefull parallelismwill depend on n,m, k and the number or availabletiles, reaching the top when there are as many available jobs asavailable workers. When a forward job is finished, the resultsare stored in the grid, and when the initial row and column areavailable to start a new job, it is added to the pending job list.

The workers wait for the controller to send them jobs. Whena job is received, the worker calculates the k × k dynamicprogramming submatrix using the alignment core operation,taking into consideration the affine gap penalties, as stated in theprevious overview point. In the first stage, each worker needs onlya 2×k×(cell size) localmemory. But in the second stage, a completek × k × (cell size) memory storage is required, as the full DPMsubmatrix needs to be stored in memory to traverse it back andget the alignment.

This last stage has not been parallelized because the alignmentpath cannot be predicted. Therefore, only one worker is executingit, processing a maximum number of n/k + m/k blocks in theworst case scenario [17]. Thus, it is important to determine theoptimal k value in order to fine-tune the parallelism, taking intoconsideration the worker tile cache sizes and the time needed tohave all the tiles working due to the wavefront growth. This valuecan be empirically determined [16] and will be relevant in the lastalgorithm of this paper.

Three mechanisms are available in the Tile64 microprocessorto communicate different tiles: (i) message passing is used whenthe controller sends an outstanding job to an available worker;(ii) communication channels (arranged as data sinks) are usedby the worker to notify a job completion to the controller; and(iii) shared memory is used to temporarily store the results ofcomputation tasks, so they become available to both the controllerand workers. During the development process, a progressive

refinement method was used to optimize the performance inexecution time, and to allow processing of larger sequences,taking into account the Tile64 hardware and software features andlimitations [17].

4.3. Performance and highlights

The first version of MC64-NW (our parallel FastLSA for Tile64)generated impressive results, due to its high parallelization factor.However, several aspects proved to be very inefficient in theTile64 microprocessor, so optimization adjustments were neededin order to further improve performance [17]. Since the accessto shared memory was very slow, we focused on working withlocal memory instead, with an important positive impact on theexecution time. In this sense, v1.1 copied the substitution matrixto the local memory, whereas v1.2 allocated the two subsequencesof the job to process in the local memory, providing a performanceincrement of 600% (see Fig. 1). Having confirmed the benefits ofusing the local memory instead of the shared one, v1.3 was furtheroptimized to use the local memory to achieve a new performancegain. In addition, the scores stored in the DPM were changedto data type long in v1.4. Finally, minor changes were made toprepare MC64-NW to use affine gaps (v2.0, not shown). Affinegaps were implemented in v2.1 to obtain more accurate resultsfrom a biological point of view. In these versions the controllerwas unable to manage very long sequence alignments because ofan addressing space of 32 bits (4 GB). To overcome this problemand to take full advantage of the card 8 GB of RAM, v2.2 wasdeveloped to dump full computed grid rows to SSD files andpackaging local data. Finally, the MC64-NW was also enhancedwith a new functionality implementing the Smith–Waterman(SW) algorithm for local alignment. Thus, the initialization wasaltered, generating the local alignment from a partial backwardsstage (which starts from the maximum value cell to the first zero-value cell reached). Additionally, the maximum global value mustbe saved temporarily and the DPM only stores positive values.Hence, the user can carry out both global and local alignmentsusing the same algorithm. However, none of these changes alterthe core of the algorithm, which has been renamed as MC64-NW/SW in v3.0. All the versions used a local memory amount lessthan the 32 kB of cache data available, and the controller neverbehaved as a bottleneck in the whole process; in addition, theposition of the tile controller in the 8 × 8 grid geometry (centeredor not) had no impact on the communication times.

Fig. 1 shows a comparison of time performance betweenMC64-NW/SW and several main alignment algorithms with different


Fig. 2. MC64-NW/SW version history and performance.

sequence lengths. The Hirschberg (CMMAligner) algorithm isa ‘‘divide-and-conquer’’ approach from the National Center forBiotechnology Information (NCBI) C++ Toolkit, whose spacerequirement is linear instead of quadratic, but at the cost ofdoubling the execution time. On the other hand, the Java–FastLSAalgorithm is a multithreaded FastLSA reference implementationthat cannot align a pair of sequences longer than 500 kb. Alongwith PLSA, they have been executed on the reference PCmentionedabove, taking advantage of its four x86 cores.

The last improvements in the MC64-NW/SW allow achievingup to 897,408 Mega Cell Updates Per Second (MegaCUPS; MCUPS)in the forward stage, when aligning the longest sequences. Fig. 2shows the performance obtained with each version [17]. As faras we know, the MC64-NW/SW is the fastest pairwise alignmentalgorithm for relatively large sequences in a standalone PC. Thathas been achieved after acquiring a comprehensive knowledge ofthe strong and weak points of the Tile64 characteristics and thealgorithm internals, in order to run empirically driven refinements.

Therefore, as expected, if developing time is invested in theplatform, better results could be achieved. In this case, the timededicated to this task was about 420 h (about three workingmonths).

5. Migration with changes only in the communication layer

In this approach, the popular ABySS algorithm [24] waschosen to be migrated to the Tile64 platform with the leastpossible changes. This is a de novo sequence parallel assemblerwhich uses an MPI [48]. As Tilera does not provide any kindof MPI implementation, we have developed an ad hoc MPI-likemiddleware tomigrate the open source code of ABySS to the Tile64platform. This middleware satisfies the minimal requirements toexecute ABySS.

5.1. Overview

The so-called ‘‘Next-Generation’’ Sequencing (NGS) technolo-gies allow cheaper genomic sequencing due to the massive par-allelization and miniaturization of the biochemical reactions andequipment used. The classical Sanger sequencing methodologygenerates short sequence reads of up to about 1000 bases (b) oneach reaction. The NGS technologies, like those from Illumina, Ap-plied Biosystems and Roche [49,50], produce huge amounts ofsequencing reaction data (reads), albeit with even shorter readlengths, ranging from about 25 to 800 b. To reconstruct the orig-inal DNA genome, these small pieces must be arranged in the cor-rect order, considering that any two ordered samples may not bestrictly adjacent but overlapping to some extent.

To reach such goal, a high sequencing redundancy must begenerated. As an example, the human haploid genome is made ofthree million base pairs or 3 Gbp, and the wheat genome has anastonishing 17Gbp. Briefly, the huge genomicDNA from thousandsof cells is purified and broken into small pieces at random,

which are then massively and redundantly sequenced. Since thedouble-stranded DNA (dsDNA) has direct and reverse strands (5’-phosphate to 3’-OH), amaze of overlapping reads of both strands isgenerated that must be solved to reconstruct the original genome.The situation is further complicated because genomes may havelarge amounts of homopolymeric and repetitive DNA that maycause sequencing errors, being difficult or even impossible tosequence and sort in some instances, mainly for short read lengths,unless genome-dedicated integrated experimental approaches aredeveloped [51]. Thus, the assembly of such reads can be a dauntingtask. Indeed, all large genomes sequenced to date (including thehuman one) have many gaps yet to be filled.

Though well-known algorithms like CAP3 [52], EULER-SR [53]or the set of Velvet tools [54] focus on the problem of assemblinglarge genomes from small sequencing reads, ABySS is one ofthe most popular, allowing a high degree of parallelization. Yet,manually parallelizing the ABySS from scratch would be a harshtask due to its source code intricacy, making it a good candidatefor a straightforward migration instead. The ABySS algorithmhas been implemented in C++ and is freely available on theInternet (http://www.bcgsc.ca/platform/bioinfo/software/abyss).The algorithm is divided into six main stages:

1. The samples are partitioned and all possible k-mers and theirreverse complements are calculated and distributed amongstall the participant cores. A k-mer is a user-defined k value ofadjacent residues that form an exact match in two or moreoverlapping samples.

2. The set of possible adjacencies between the k-mers is generatedand translated into a De Bruijn graph [55].

3. The sequencing read errors and experimental noise arecorrected by trimming short branches with a length less than k.

4. The loops or ‘‘bubbles’’ produced by repeated read errors or sin-gle nucleotide allelic differences are deleted by removing one oftheir branches.

5. The resulting ‘‘clean’’ graph is analyzed for contig extension am-biguities, and the remaining connected nodes are concatenated.This produces independent sequences that overlap by no morethan k − 1 nucleotide residues.

6. These sequences are aligned, an empirical fragment size dis-tribution is calculated, and finally, the contigs with coherentand unambiguous distances are joined to build larger consen-sus contigs.

Thus, the parallelization of the algorithm distributes the De Bruijngraph across all the participant nodes/tiles (the drawback of thisstrategy is that some adjacent tiles may not be in the same node).In addition, the set of k-mers is partitioned by a hash function andeach subset is sent to a different node/tile.

5.2. Implementation

Our development startedwith the analysis of the original ABySScode, being a fine-tuned, well designed and structured set of

http://www.bcgsc.ca/platform/bioinfo/software/abyss


classes where all parallel tasks are managed by a single C++ classnamed NetworkSequenceCollection. The communicationmanage-ment is transparent to this class by using a CommLayer objectwhich, actually, encapsulates all the MPI calls. Indeed, the use ofthe MPI in the original algorithm is rather simple, and there areonly three kinds of communications: (i) each node opens a sin-gle reception pooling task in non-blocking mode to receive anydata message coming from any other node. TheMPI_ANY_SOURCEand MPI_ANY_TAG constants are used for this purpose. The datamessages are sent in blocking mode; (ii) the control messages areexchanged through a synchronous blocking mode; and (iii) thebarrier mechanism provided by the MPI is used to synchronize theexecution. Additionally, due to the Tile64 microprocessor restric-tions, we checked the code for the use of floating point instruc-tions. This kind of instructions was used only in a logging function,so any software simulation should not affect our implementationperformance.

When moving this scenario to Tile64, the high-performanceapproach was to use the MPI provided by Tilera (i.e., iLib). Itcontains equivalent functions of almost every MPI used by ABySS,with three remarkable exceptions: (i) the functionality to listen thecommunications from any source and any tag is not supported;(ii) there is only one ‘‘blocking send’’ method in iLib, so there isno distinction between the MPI ‘‘synchronous blocking send’’ and‘‘commonblocking send’’. Indeed, the iLib ‘‘blocking send’’ functioncorresponds to the MPI ‘‘synchronous blocking send’’, where thesender tile is not released until the receiver actually gets the data;and (iii) there are no MPI ‘‘reduce’’ implemented functions whichcould do global operations with values provided from every tile.Taking into account these software architectural differences andthe particular use of the MPI by the ABySS, our porting has beenbased on the following design criteria:

1. The lack of MPI_ANY_SOURCE (i.e., the need of listening fromany source) was simulated by opening a receive sink in everytile, listening to the other tiles.

2. Following the MPI specification, the sender is released in a‘‘common blocking send’’ when the message has been saved ina temporary buffer, regardless ofwhen it is actually delivered ornot to the receiver. To achieve this behavior, we have developeda middleware in which all the messages of this kind convergeinto a dedicated tile that stores and routes them to theirdestination, including extra tags, so that the sender tile can beidentified by the receiver. This method allowed us to maintainthe same schema in the overall algorithm (blocking send/non-blocking receive), by freeing the sending tile immediately afterthe router tile receives themessage, which could later completethe transmission to the destination tile. On the other hand, eachregular tile was required to add a task to listen for messagescoming from the router tile.

3. The functionality of a basic reduce function was achieved witha two-phase message passing strategy, in which the values toreduce are collected in the first stage by a specific tile and, afterperforming the required operation on them, the final value istransmitted to all the participants in the reduction. The reducefunctions are only used at the end of each phase in which thealgorithm is divided, so another more scalable and efficientbarrier algorithm such as a well-known binary tree-barrier [56]would not improve performance in this particular scenario.

In short, the three above conditions imply losing one tile inadministrative tasks and at least another in interconnection ones.The remaining tiles (excluding those reserved by the system) actas workers with the mentioned functions. From this point, therest of the functions provided by iLib are similar enough to theirequivalents in MPI, allowing a straightforward use of them inCommLayer. Additionally, the barrier mechanism provided by iLibworked particularly well in this situation.

Once this Tile64 implementationwas validated (i.e., it producedthe same results as the original implementation in the referencePC with the same parameters), we found that the heavy commu-nication load inherent to the ABySS parallelization overflowed therouter tile limits in early tests, so a further refinementwas requiredto avoid this saturation. Even more, the profiling tests given by theMDE revealed that the processing of the communication messagestook more time in the algorithm than the processing of the assem-bling tasks. Our straightforward solution has been to dedicate sev-eral tiles to do router functions, so the only change needed in thecode of a regular tile is to open a new receiving task for each routertile and alternate its sending router in a round-robin way. Regard-less of these procedural changes, the original bioinformatics algo-rithm remained unchanged, so we can maintain our qualificationof this approach as ‘‘direct translation’’. The benchmarkingwas car-ried out after such optimizations, as described below.


We have tested our development with the same input asdescribed in the simulated data section of [24]; i.e., with the humanchromosome22 (GenBankAccessionNC_000022.9), available fromthe Internet at the address http://www.ncbi.nlm.nih.gov/nuccore/NC_000022.9. This chromosome has been sliced into subsequencesof 200 b and the synthetic samples have been generated by takingthe first 36 nucleotide residues and the complement of the last 36residues of each of these 200 b sequences. Therefore, the algorithmwas tested in ideal conditions, with a coverage of 72-fold (i.e., theredundancy of the sequencing reads). The quality of the migrationis measured in execution time, as well as the sequence length thatcan be processed. This must be evaluated in the context of thehardware used; namely, the number of tiles or cores involved in theassemblingwork and the availablememory. Both latter parametersaffect the performance (time required for the execution) and theamount of data which can be managed, respectively.

Table 2 shows the execution times of our Tile64 ABySSimplementation with different number of samples against thereference (original ABySS MPI execution) on the Quad Core/8 GBmemory workstations previously described. All the executionshave been launched with a k-mer of 36, in order to match the waywe got the samples and all the other ABySS parameters fixed, sothat all phases are executed. The rightmost two columns showthe number of tiles dedicated to the assembling work and thenumber of router tiles. Under these conditions, we can see thatthe execution times are correlated with the number of processedsamples, and that our straightforwardmigration is always far fromthat of the original algorithm.

The TilExpress-20G restricts the number of samples to assembleto two million. The workstation showed a similar limit, becauseof its similar amount of RAM memory. The algorithm significantlydecreased its performance beyond this threshold, due to the factthat the high memory requirements force the operating systemto intensively use the swapping mechanism. Fig. 3 shows thereference execution time (leftmost bar) against several MC64-ABySS executions, where the number of tiles varied from five to56 (some empirical tests showed a similar behavior when using allavailable tiles – 63 – and when observing the limit of 60 effectivetiles mentioned in Section 2; not shown) and the worker/routertiles ratio varied from 4/1 to 24/32. A fixed number of samples(100,000) was used in all executions, with a k-mer of 36, and therest of the parameters remaining unchanged. Each bar shows thetime that a single tile dedicated to communication tasks (bottomsegment) and real assembling tasks (top segment). As the realassembling time decreased when more tiles were involved inthe execution, the communication time increased, even reachinga point at which the overall performance was reduced, if the

http://www.ncbi.nlm.nih.gov/nuccore/NC_000022.9











Table 2Comparison of execution times and tile usage from Tile64-ABySS versus the original algorithm.

Samples MPI execution time (s) Tile64 execution time (times more) Working tiles Router tiles

1,000 1.17 6.93 (5.9×) 31 3210,000 4.46 19.09 (4.3×) 47 16

100,000 41.78 167.62 (4.0×) 31 32200,000 79.61 287.18 (3.6×) 31 32500,000 198.90 711.57 (3.6×) 47 16

1,000,000 493.70 1,676.81 (3.3×) 47 16

Fig. 3. ABySS execution times when working with 100,000 samples and their distribution between communication and real assembling time. Values in parenthesis showthe ratio (processing tiles/communication tiles).

worker/router tile ratio was not well balanced. In this scenario,more tiles should be dedicated to the communication tasks thanto the assembling work, as previously indicated.

Given these results, a second independent implementation wasdeveloped, by means of an MPI-like micro-library programmed adhoc, so the original ABySS code could remain untouched. In thisnew library, the router tiles remained, but using a communicationapproach based on iLib channels instead of message passing. Inthis strategy, a tile is reserved to play the role of communicationscontroller and to centralize all MPI communications: all isregistered and stored in the controller buffer, so blocking and non-blocking operations canbe implemented, aswell as reductions. Yet,the resultswere very close to those of the first implementation (notshown).

Finally, in order to cover the sizes generated by the current DNAsequencers, new testswere carried outwith different k-mer values,after generating new synthetic samples suitable to be assembledwith such new k-mer sizes. In these new tests it was observedthat when the k-mer size was decreased, the execution time in thecards slightly approached that of the x86 approach, but withoutbeating it. Nevertheless, the ABySS accuracy decreased with lowerk-mer values, rendering useless such performance gain. On theother hand, the inverse effect was found in the range of higherk-mers. This behavior was due to the fact that the lower the k-mervalue, the lower the subset of k-mers that should be distributed toeach tile and the lower the iMesh overload.

Therefore, the straightforward migration of the ABySS algo-rithm to the Tile64 microprocessor (with changes only in thecommunication layer) was not effective, as it spends more time incommunication tasks than in parallelizing tasks. Because of this,the higher latencies of the iMesh than the ones of the internal In-tel Xeon cores resulted in rendering this implementation mostlyinefficient. The positive side in this approach was the lesser timededicated to the migrating process. Thus, the first running versiontook as little as 50 development hours (about sevenworking days).Obviously, different results could have been obtained if the ABySSalgorithm had been rewritten from scratch for the Tile64 architec-ture, but that was out of the scope of this work.

6. Direct porting with some optimizations

The chosen algorithm in this approach was ClustalW, beingone of the most used multiple sequence aligners. The goal of amultiple alignment is to find similarities and differences in a set ofsequences, so that evolutionary relationships can be established,

including the generation of phylogenetic trees. Likewise, thepolymorphisms in the sequences can be identified, which can beuseful, for instance, to design specific molecular markers for DNA,RNA or peptide fingerprinting.

6.1. Overview

ClustalW has been the most widely used multiple sequencealignment (MSA) tool throughout the last years [57]. Its sourcecode has been ported to many platforms, entirely rewritten inC++ to give it a modern interface [25], being also slightlyparallelized using the MPI [58]. Other MSA algorithms like theheuristic PSI-BLAST [59] have been parallelized as well usingdifferent strategies, like grid computing [60]. The algorithm isdivided into three main stages:1. First of all, all sequences are pairwise aligned between them, in

a round-robin fashion. Once the alignments are done, a matrixrepresenting the degree of divergence between each pair ofsequences is built by means of scoring or weighting (the ‘‘W’’stands for this) the quality of the pairwise alignments.

2. In the second stage, a guide tree is obtained from the distancematrix, using a clustering algorithm like the Neighbor–Joining(NJ) [61] or the Unweighted Pair Group Method Average(UPGMA) [62] methods. This tree classifies all the sequencesbased on their similarities.

3. Then, the last stage (the progressive alignment) can followthe tree from the deeper nodes to the root in the alignmentorder. This stage uses a similar approach to the one of thepairwise alignment, but implementing adaptive cost penaltiesand working with the profiles of the previous alignments (theblock of the already aligned sequences in the branch), insteadof the sequences.

As evaluated by [63], the first and last stages are the mostcomputationally demanding. Additionally, any pairwise alignmentmethod can be used in the first stage, being either heuristic orwith a fully dynamic programming, with the obvious advantage inaccuracy in the case of the latter.

6.2. Implementation

We followed a minimal-effort approach to achieve a fastperformance improvement in this algorithm, parallelizing only themost time-consuming step; in this case, the pairwise alignmentsstage. We have used the open source code from the ClustalW-MPIimplementation [58] in order to improve the global performance,


Fig. 4. ClustalW first stage performance aligning three to 10 sequences of approximately 100 kb in length.

Fig. 5. ClustalW’s first stage performance aligning 10 sequences of 50–200 kb.

replacing the first stage of pairwise alignments with our MC64-NW/SW optimal aligner. Thus, several calls to the pairwise alignerwere carried out in order to compute the matrix for the guide tree,using a mid-layer to communicate the so-called MC64-ClustalWwith the MC64-NW/SW.

In order to accomplish this wrapper strategy, the algorithmcreates a sequential script on the fly to launch all the alignmentsusing the MC64-NW/SW. The algorithm obtains the sequencesand the scoring matrix from the ClustalW internal variables,and uploads them to the Tilera executable program throughthe script, using the tile-monitor management tool. The originalClustalW function traverses each resulting pairwise alignment todetermine its score and fulfills the matrix for the guide tree. Nochanges were needed in the MC64-NW/SW code, whereas theClustalW implementation only required a fewmodifications in thepairalign.c file to create and integrate the wrapper communicationlayer.

The second and last stages of ClustalW were also executed,but parallelizing only the last one with the MPI as [58], in orderto improve the overall performance in the web service that wehave deployed (http://www.sicuma.uma.es/manycore/index.jsp?seccion=m64cl/run).


The first stage of the algorithm was completed in 2325 s(38.75 min) instead of 6631 (1.84 h), using the MPI ClustalW inan Intel Xeon quad-core microprocessor when aligning 10 humanherpes virus genomes (about 100 kb each). In this situation, ouralgorithm took 64.93% less time than the x86 one. Fig. 4 shows thespeedup gained when aligning from 3 to 10 human herpes virusgenomes. Fig. 5 shows the same performance differences whenaligning 10 virus genomes, corresponding to Acidianus filamentous(50 kb), human herpes (100 and 200 kb) and the T4 bacteriophage(150 kb). In both figures, the right axis represents the relativepercentage of gain showed by the bars, whereas the left axis is theabsolute execution time indicated by theMPI andMC-64 lines. Therelative percentage gain is indicated above the bars.

The results demonstrate the usefulness of the Tile64 mi-croprocessor when parallelizing independent tasks. An average

speedup of 60% was obtained using this approach versus thenon-parallelized version in the first stage (the slower part of theClustalW, as [63] notes). This strategy required a minor effort(reusing a previous optimized algorithm), with a developing timeof just 30 h and it is, as well, a good example of how software tech-nologies may evolve over previously developed components.

7. Discussion

The Tile64 microprocessors of the Tilera cards are the first truemany-core general-purpose chips commercially available. Sucharchitecture has a clear potential for bioinformatics as previouslyreported [16]. But obviously, the practical usefulness of the Tile64parallelization also depends on the particular algorithm used, asdemonstrated in the presentwork. In general, and not surprisingly,it is needed to thoroughly optimize the code to improve the globalperformance, as with other chips. But, unlike other architectureslike CUDA, the algorithms may be rewritten by using a higherlevel of abstraction, as in a cluster of standalone pico-computersconnected by an ultra-fast network and shared resources. Indeed,the Tile64 microprocessor can accomplish very fast responsetimes that were previously unreachable by a standalone PC, foralgorithms which can be divided in independent parallel tasks.However, when compared to an x86 multi-core system, the Tile64many-core has lower memory amounts per core ratio and moreshared resources that can produce contention and latency. Onthe other hand, Tile64 has more cores (64) than the current x86alternatives (10).

Even the widely used MPI shows that an efficient implemen-tation for a platform may produce a very low performance whendirectly migrated to another platform. In fact, the MPI makes eas-ier to develop specific parallel programs, but not general parallelones. This is because any high-performance MPI approach has totake into account the internal characteristics of the architecture tobe deployed to. In other words, what is useful in a certain archi-tecture may be useless or even adverse in others. Thus, any many-coreMPI implementation needs to consider some key features, likelimited cache, shared memory usage, inter-core communicationmethods and intrinsic core limits. Otherwise, some resources maynot be optimally exploited (e.g., idle tiles, misused file system in

http://www.sicuma.uma.es/manycore/index.jsp?seccion=m64cl/run











Table 3Algorithm implementation trade-offs.*

Algorithm Developing effort Performance Trade-off

NW-SWABySSClustalW* Solid squares represent the relative amount for each case. The value is indicated

in light gray (better), gray (medium) and dark (worse).

memory, wasted power integer calculus, etc.). Additionally, someresources may be overloaded (e.g., internal network, shared mem-ory, reduced number of tiles, software floating point operations,etc.), so an unbalanced execution may cause a poor overall per-formance. This scenario is even worse on GPU/CUDA, where anytrial to adapt an algorithm (e.g., the ABySS) would have required agreater modification degree, due, for instance, to the lack of inter-processor MPI capabilities [64]. The Tilera architecture is not ori-ented to execute as many low load threads as possible inside ablock code, but to execute entire programs or even operating sys-tem along a set of tens of cores, providing this set with high speedcommunication systems, both similar to other message passingimplementation and with other approaches like shared memoryand interconnection channels. However, there are many high-throughput developments for CUDA platforms. These algorithmshave been parallelized exclusively for the GPU approach with ahigher development effort and many of them have specific limi-tations, and thus can only work with some data sets. For instance,the CUDA-SW++ [65] pairwise aligner is not able to deal with se-quences larger than 59 kb nor scores which overflow values higherthan 65,535. Although [66] claims to reach 100 GigaCUPS (GCUPS),this only occurs in the best possible scenario, when using only onebyte per cell; i.e., the maximum score cannot surpass 256, in or-der to take advantage of the Supplemental Streaming ‘Single In-struction, Multiple Data’ (SIMD) Extensions 3 (SSSE3) instructions(the MC64-NW/SW uses four bytes, being able to align sequencesof several Mb with scores up to 232

− 1, yet preventing the use ofthe SIMD instruction set available in the RISC tiles).

Therefore, and not surprisingly, to fully exploit and optimizethe performance of the Tile64 architecture and similar many-core platforms, a rewrite of the algorithms from scratch may beneeded. Such a strategy may not represent the best yield/effortapproach in all instances, since it may require a significant effortand time to redesign the best possible parallel strategy, takinginto account the particular many-core specifications in each case.Yet, if carried out, it should usually produce a more refinedand efficient bioinformatics tool, being therefore more useful forlife-science researchers. As an example, Table 3 summarizes theefforts, performance and trade-offs for the migration of the threealgorithms used in the present work.

8. Conclusions and future prospects

The ideal scenario when working with a new bioinformaticsparallel platform (hardware) would be not only obtaining good re-sults in terms of performance, but also getting it up and runningas soon as possible, which implies the ability of executing exist-ing programs with as few modifications as possible. The knowl-edge and experience empirically gained with the straightforwardalgorithm migrations presented in this paper allow us to drawthe following conclusions about the use of many-core architec-tures, like Tile64 for bioinformatics applications: (i) the full ca-pabilities of the many-core architectures can be exploited usinga tightly coupled cluster of pico-computers; (ii) the algorithmporting may provide a significant speedup in an existing imple-mentation if time-consuming blocks of code can be identified and

parallelized in accordance with the pico-computer architecture(tiles); (iii) obviously, the best approach may be to optimize thealgorithm performance taking into account both its peculiaritiesand the many-core platform characteristics; (iv) to achieve a goodperformance in high level functionalities (MC64-ClustalW) in anotherwise acceptable development timeframe, the availability of aset of high throughput low level functionalities is usually required(MC64-NW/SW); (v) the best approach is not usually a direct portof an already parallelized implementation, as may be the case formulti-core chips where middleware solutions (e.g., MPI) are usu-ally well established and optimized, so a major effort must be ded-icated to develop from scratch this kind of standard middleware;and (vi) a shallow understanding of the specific behavior of thealgorithm may be enough to achieve an acceptable performance(MC64-ClustalW).

Nowadays, some many-core CPU emerging technologies areused only in academic and research environments. However, ithas been shown that several computing fields could take ad-vantage of such new hardware if it becomes available with anattractive performance/price ratio. This could lead to amajorwide-spreading of many-core CPU architectures and give birth to therequired programming tools required to fully exploit this newcomputing paradigm. Thus, programming should no longer be de-signed for just one or a few tiles, but for hundreds, thousands [67]or even millions of them [68]. In this sense, parallel developmenttools and methodologies are urgently needed to allow effectiveprogramming in many-core architectures. The results of the cur-rent and future developments, as well as the experience and skillsachievedwith them, should allow sketching themain guidelines ofsuch new tools andmethodologies. Furthermore, a set of standardsand methodologies is required not only on the programmer side,but also for themanufacturers, to produce really useful and conve-nient solutions for the potential customers. This way, it should bepossible to exploit emerging newmany-core architectures such asTile64, to fuel a newmassivemany-core algorithmporting erawithan affordable effort and cost. As an example, technologies like theOpenCL (which facilities the developer parallelizingwork), and theusage of multi-platform parallel libraries (like MPI), must be con-sidered in the context of many-core technologies, in order to allowsuch evolution to happen.

Acknowledgments

The authors thank Tilera (http://www.tilera.com) for provid-ing hardware and software tools. This work was supported by‘‘Ministerio de Ciencia e Innovación’’ (MICINN grants AGL2010-17316, BIO2009-07443 and BIO2011-15237); ‘‘Consejería de Agri-cultura y Pesca’’ of ‘‘Junta de Andalucía’’ (041/C/2007, 75/C/2009& 56/C/2010); ‘‘Grupo PAI’’ (AGR-248); and ‘‘Universidad de Cór-doba’’ (‘‘Ayuda a Grupos’’), Spain.

Appendix. Supplementary data

Supplementary material related to this article can be foundonline at http://dx.doi.org/10.1016/j.future.2012.03.018.

References

[1] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A.E. Lefohn,T.J. Purcell, A survey of general-purpose computation on graphics hardware,presented at the Eurographics, State of the Art Reports, 2005.

[2] J.A. Kahle, M.N. Day, H.P. Hofstee, C.R. Johns, T.R. Maeurer, D. Shippy,Introduction to the Cell multiprocessor, IBM Journal of Research andDevelopment 49 (2005).

[3] T. Chen, R. Raghavan, J.N. Dale, E. Iwata, Cell broadband engine architectureand its first implementation: a performance view, IBM Journal of Research andDevelopment 51 (2007) 559–572.

http://www.tilera.com



[4] A.K. Nanda, J.R. Moulic, R.E. Hanson, G. Goldrian,M.N. Day, B.D. D’Amora, S. Ke-savarapu, Cell/B.E. blades: building blocks for scalable, real-time, interactive,and digitalmedia servers, IBM Journal of Research andDevelopment 51 (2007)573–582.

[5] E. Lindholm, J. Nickolls, S. Oberman, J. Montrym, NVIDIA Tesla: a unifiedgraphics and computing architecture, IEEE Micro 28 (2008) 39–55.

[6] A. Ruiz, N. Guil, M. Ujaldon, Recognition of circular patterns on GPUs:Performance analysis and contributions, Journal of Parallel and DistributedComputing 68 (2008) 1329–1338.

[7] J. Kurzak, W. Alvaro, J. Dongarra, Optimizing matrix multiplication for a short-vector SIMD architecture — CELL processor, Parallel Computing 35 (2009)138–150.

[8] T.H. Beach, N.J. Avis, An intelligent semi-automatic application porting systemfor application accelerators, presented at the Proceedings of the CombinedWorkshops on UnConventional High Performance Computing Workshop PlusMemory Access Workshop, Ischia, Italy, 2009.

[9] T.G. Mattson, R.V.D. Wijngaart, M. Frumkin, Programming the Intel 80-corenetwork-on-a-chip Terascale processor, presented at the Proceedings of the2008 ACM/IEEE Conference on Supercomputing, Austin, Texas, 2008.

[10] Intel. (2010, 2010-10-31). The SCC Platform Overview.http://techresearch.intel.com/spaw2/uploads/files/SCC-Overview.pdf.

[11] Intel. (2010, 2010-10-31). Intel’s Teraflops Research Chip. http://download.intel.com/pressroom/kits/Teraflops/Teraflops_Research_Chip_Overview.pdf.

[12] Intel. (2011, 2011-04-26). Intel Xeon Processor E7 family delivers record-breaking performance, new security, reliability and energy efficiency features.http://newsroom.intel.com/community/intel_newsroom/blog/2011/04/05/performance-reliability-security-intel-xeon-processor-formula-for-mission-critical-computing.

[13] M. Shah, J. Barreh, J. Brooks, R. Golla, G. Grohoski, N. Gura, R. Hetherington,P. Jordan, M. Luttrell, C. Olson, B. Saha, D. Sheahan, L. Spracklen, A. Wynn,UltraSPARC T2: a highly-treaded, power-efficient, in: SPARC SOC Asian Solid-State Circuits Conference, ASSCC’07, 2007, pp. 22–25.

[14] Adapteva. (2011, 2011/10/4). Epihany Multicore IP. http://www.adapteva.com/index.php?option=com_content\&view=article\&id=72\&Itemid=79.

[15] W. Flohr, Implementation of and MPEG Codec on the Tilera 64 Processor,Department of Electrical and Systems Engineering, Washington University, St.Louis, 2008.

[16] S. Gálvez, D. Díaz, P. Hernández, F.J. Esteban, J.A. Caballero, G. Dorado, Next-generation bioinformatics: usingmany-core processor architecture to developa web service for sequence alignment, Bioinformatics 26 (2010) 683–686.

[17] D. Díaz, F.J. Esteban, P. Hernández, J.A. Caballero, G. Dorado, S. Gálvez,Parallelizing and optimizing a bioinformatics pairwise sequence alignmentalgorithm for many-core architecture, Parallel Computing 37 (2011) 244–259.

[18] G. Besnard, P. Hernandez, B. Khadari, G. Dorado, V. Savolainen, Genomicprofiling of plastid DNA variation in the Mediterranean olive tree, BMC PlantBiology 11 (2011).

[19] C. Trapnell, M.C. Schatz, Optimizing data intensive GPGPU computations forDNA sequence alignment, Parallel Computing 35 (2009) 429–440.

[20] S.A. Manavski, G. Valle, CUDA compatible GPU cards as efficient hardwareaccelerators for Smith–Waterman sequence alignment, BMC Bioinformatics 9(Suppl 2) (2008) S10.

[21] L. Ligowski, W. Rudnicki, An efficient implementation of Smith–Watermanalgorithm on GPU using CUDA, for massively parallel scanning of sequencedatabases, presented at the Proceedings of the 2009 IEEE InternationalSymposium on Parallel \& Distributed Processing, 2009.

[22] M.S. Farrar, (2010, 2010-08-31). Optimizing Smith–Waterman for the cellbroadband engine. http://farrar.michael.googlepages.com/SW-CellBE.pdf.

[23] A. Driga, P. Lu, J. Schaeffer, D. Szafron, K. Charter, I. Parsons, FastLSA: afast, linear-space, parallel and sequential algorithm for sequence alignment,Algorithmica 45 (2006) 337–375.

[24] J.T. Simpson, K. Wong, S.D. Jackman, J.E. Schein, S.J.M. Jones, I. Birol, ABySS: aparallel assembler for short read sequence data, Genome Research 19 (2009)1117–1123.

[25] M.A. Larkin, G. Blackshields, N.P. Brown, R. Chenna, P.A. McGettigan,H. McWilliam, F. Valentin, I.M. Wallace, A. Wilm, R. Lopez, J.D. Thompson, T.J.Gibson, D.G. Higgins, ClustalW and clustal X version 2.0, Bioinformatics 23(2007) 2947–2948.

[26] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif,L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson,E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, J. Zook, TILE64- Processor: A 64-Core SoC with Mesh Interconnect, presented at the Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEEInternational, 2008.

[27] J. des Rivieres, J. Wiegand, Eclipse: a platform for integrating developmenttools, IBM Systems Journal 43 (2004) 371–383.

[28] F. Fleurey, E. Breton, B. Baudry, A. Nicolas, J.M. Jezequel, Model-drivenengineering for software migration in a large industrial context, Model DrivenEngineering Languages and Systems, Proceedings 4735 (2007) 482–497.

[29] H. Zaidi, C. Labbe, C.Morel, Implementation of an environment forMonte Carlosimulation of fully 3-D positron tomography on a high-performance parallelplatform, Parallel Computing 24 (1998) 1523–1536.

[30] J.V. Harrison, A. Berglas, I. Peake, Legacy 4GL application migration viaknowledge-based software engineering technology: a case study, presentedat the Proceedings of the Australian Software Engineering Conference,1997.

[31] M. Charalambous, P. Trancoso, A. Stamatakis, Initial experiences porting abioinformatics application to a graphics processor, in: Proceedings of the 10thPanhellenic Conference on Informatics, PCI 2005, 2005, pp. 415–425.

[32] K. Keutzer, T. Mattson, 2010, A design pattern language for engineering (par-allel) software. http://parlab.eecs.berkeley.edu/wiki/_media/patterns/opl-new_with_appendix-20091014.pdf.

[33] R. Membarth, F. Hannig, J. Teich, M. Körner, W. Eckert, Comparison ofparallelization frameworks for shared memory multi-core architectures, in:Embedded World Conference, Nuremberg, Germany, 2010.

[34] B. Schmidt, H. Schröder, M. Schimmler, A hybrid architecture for bioinformat-ics, Future Generation Computer Systems 18 (2002) 855–862.

[35] T.K. Group, (2010, 2010/12/28). OpenCL — the open standard for parallelprogramming of heterogeneous systems. http://www.khronos.org/opencl.

[36] S. Ryoo, C.I. Rodrigues, S.S. Baghsorkhi, S.S. Stone, D.B. Kirk, W.M.W.Hwu, Optimization principles and application performance evaluation of amultithreaded GPU Using CUDA, 2008.

[37] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton,E. Phillips, Y. Zhang, V. Volkov, Parallel computing experiences with CUDA,IEEE Micro 28 (2008) 13–27.

[38] Intel. (2010, 14/8/2010). Intel Parallel Studio. http://software.intel.com/en-us/intel-parallel-studio-home.

[39] F.E. Allen, J. Cocke, A catalogue of optimizing transformations, in: Design andOptimization of Compilers, Prentice-Hall, Englewood Cliffs, NJ, 1972, pp. 1–30.

[40] S. White, A. Alund, V.S. Sunderam, Performance of the NAS ParallelBenchmarks on PVM-based Networks, Journal of Parallel and DistributedComputing 26 (1995) 61–71.

[41] V. Sachdeva, M. Kistler, E. Speight, T.H.K. Tzeng, Exploring the viability of theCell Broadband Engine for bioinformatics applications, Parallel Computing 34(2008) 616–626.

[42] R. Eigenmann, Toward a methodology of optimizing programs for high-performance computers, presented at the International Conference onSupercomputing, 1993.

[43] C. Dieterich, H. Wang, K. Rateitschak, A. Krause, M. Vingron, Annotatingregulatory DNA based onman-mouse genomic comparison, Bioinformatics 18(2002) S84–S90.

[44] S.B. Needleman, C.D. Wunsch, A general method applicable to the search forsimilarities in the amino acid sequence of two proteins, Journal of MolecularBiology 48 (1970) 443–453.

[45] O. Gotoh, An improved algorithm for matching biological sequences, Journalof Molecular Biology 162 (1982) 705–708.

[46] D.S. Hirschberg, A linear space algorithm for computing maximal commonsubsequences, Communications of the ACM 18 (1975) 341–343.

[47] X. Ye, D. Fan, W. Lin, A fast linear-space sequence alignment algorithmwith dynamic parallelization framework, presented at the Proceedings ofthe 2009 Ninth IEEE International Conference on Computer and InformationTechnology - Volume 02, 2009.

[48] W. Gropp, E. Lusk, A. Skjellum, UsingMPI: Portable Parallel Programmingwiththe Message Passing Interface, MIT Press, 1994.

[49] S. Dames, J. Durtschi, K. Geiersbach, J. Stephens, K.V. Voelkerding, Comparisonof the illumina genome analyzer and Roche 454 GS FLX for resequencingof hypertrophic cardiomyopathy-associated genes, Journal of BiomolecularTechniques 21 (2010) 73–80.

[50] T. Holtegaard, D. Oglesbee, S. Middha, Y. Asmann, E. Klee, T. Themeau,A. McDonald, B. Eckloff, E. Wieben, W.E. Highsmith, Evaluation of twonext generation sequencing platforms, Illumina genome analyzer and Roche454, for sequencing the entire mitochondrial genome, Journal of MolecularDiagnostics 11 (2009) 617.

[51] P. Hernandez, M. Martis, G. Dorado, M. Pfeifer, S. Gálvez, S. Schaaf, N. Jouve,H. Šimková, M. Valárik, J. Doležel, K.F.X. Mayer, Next-generation sequencingand syntenic integration of flow-sorted arms of wheat chromosome 4Aexposes the chromosome structure and gene content, The Plant Journal 69(2012) 377–386.

[52] X. Huang, A. Madan, CAP3: A DNA sequence assembly program, GenomeResearch 9 (1999) 868–877.

[53] M.J. Chaisson, D. Brinza, P.A. Pevzner, De novo fragment assembly with shortmate-paired reads: Does the read length matter? Genome Research 19 (2009)336–346.

[54] D.R. Zerbino, E. Birney, Velvet: algorithms for de novo short read assemblyusing de Bruijn graphs, Genome Research 18 (2008) 821–829.

[55] N.G. De Bruijn, A combinational problem, Proceedings of the KoninklijkeNederlandse Akademie van Wetenschappen 49 (1946) 758–764.

[56] B.D. Lubachevsky, Synchronization barrier and related tools for sharedmemory parallel programming, International Journal of Parallel Programming19 (1990) 225–250.

[57] J.D. Thompson, D.G. Higgins, T.J. Gibson, CLUSTALW: improving the sensitivityof progressive multiple sequence alignment through sequence weighting,position-specific gap penalties and weight matrix choice, Nucleic AcidsResearch 22 (1994) 4673–4680.

http://techresearch.intel.com/spaw2/uploads/files/SCC-Overview.pdf

http://download.intel.com/pressroom/kits/Teraflops/Teraflops_Research_Chip_Overview.pdf












http://newsroom.intel.com/community/intel_newsroom/blog/2011/04/05/performance-reliability-security-intel-xeon-processor-formula-for-mission-critical-computing



http://www.adapteva.com/index.php?option=com_content\&view=article\&id=72\&Itemid=79








http://farrar.michael.googlepages.com/SW-CellBE.pdf

http://parlab.eecs.berkeley.edu/wiki/_media/patterns/opl-new_with_appendix-20091014.pdf



http://www.khronos.org/opencl

http://software.intel.com/en-us/intel-parallel-studio-home




[58] K.-B. Li, ClustalW-MPI: ClustalW analysis using distributed and parallelcomputing, Bioinformatics 19 (2003) 1585–1586.

[59] S.F. Altschul, T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller,D.J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of proteindatabase search programs, Nucleic Acids Research 25 (1997) 3389–3402.

[60] M. Mirto, S. Fiore, I. Epicoco, M. Cafaro, S. Mocavero, E. Blasi, G. Aloisio, Abioinfomatics grid alignment toolkit, Future Generation Computer Systems 24(2008) 752–762.

[61] N. Saitou, M. Nei, The neighbor-joining method: a new method forreconstructing phylogenetic trees, Molecular Biology and Evolution 4 (1987)406–425.

[62] P.H.A. Sneath, R.R. Sokal, Numerical taxonomy. The principles and practice ofnumerical classification, 1973.

[63] D. Mikhailov, H. Cofer, R. Gomperts, Performance optimization of ClustalW:Parallel ClustalW, HT Clustal and MULTICLUSTAL, in: White papers, SiliconGraphics, Mountain View CA, 2001.

[64] J.C. Phillips, J.E. Stone, K. Schulten, Adapting a message-driven parallelapplication to GPU-accelerated clusters, presented at the Proceedings of the2008 ACM/IEEE Conference on Supercomputing, Austin, Texas, 2008.

[65] Y. Liu, B. Schmidt, D.L. Maskell, CUDASW++2.0: enhanced Smith–Watermanprotein database search on CUDA-enabledGPUs based on SIMT and virtualizedSIMD abstractions, BMC Research Notes 3 (2010) 93.

[66] T. Rognes, Faster Smith–Waterman database searches with inter-sequenceSIMD parallelisation, BMC Bioinformatics 12 (2011) 221.

[67] A. Ghuloum, (2008, 2010/12/27). Research@Intel — Unwelcome Advice.http://blogs.intel.com/research/2008/06/unwelcome_advice.php.

[68] P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. Lusk,R. Thakur, J.L. Träff, MPI on Millions of Cores, Parallel Processing Letters 21(2011) 45–60.

Francisco J. Esteban is an Analyst at the Informatics Ser-vice of Cordoba University (Spain). His research interestsinclude network protocols, grid computing, parallel pro-gramming and bioinformatics. He earned his degree inTelecommunications Engineering at Madrid PolytechnicUniversity (Spain) in 1991.

David Díaz is a graduate student at the Department ofLanguages and Computer Science of Malaga University(Spain). His research interests include parallel program-ming, bioinformatics application development and opti-mization, bioinformatics service integration and computerarchitectures. He earned his BS degree in Technical En-gineering in Computer Systems at Malaga University in2008.

Pilar Hernández is a Tenured Scientist at the Institutefor Sustainable Agriculture (IAS) of the Spanish Councilfor Scientific Research (CSIC) at Cordoba (Spain). Herresearch interests include exploring the possibilities ofparallel computing on the analysis of next-generationsequencing data. She is currently amember of the editorialboards of the journals ‘Hereditas’ and ‘InternationalJournal of Plant Genomics’. She is also a memberof the Coordinating Committee of the InternationalWheat Genome Sequencing Consortium (IWGSC; www.wheatgenome.org) and the Coordination Committee of the

European Triticeae Genomics Initiative (ETGI; http://www.etgi.org). She receivedher Agricultural Engineering degree (1993) and her Ph.D. (1998) degree fromCordoba University.

Juan A. Caballero is a Professor of Statistics at theDepartment of Statistics and Operational Research ofCordoba University (Spain), where he is Vice-chancellorfor Information and Communications Technologies (ICT).His research and teaching interest include statisticalsimulation and free-method distribution. He has beenelected member of the Conference of Chancellors ofSpanish Universities (CRUE; http://www.crue.org). Hereceived his Ph.D. degree in Mathematics in 1991 fromGranada University (Spain).

Gabriel Dorado is a Tenured Full Professor at theDepartment of Biochemistry and Molecular Biology ofCordoba University (Spain). He received the degrees ofBachelor of Science (1983), Master of Science (1983)and Ph.D. (1986) on Biology from Cordoba University.His research interests focus on both improving theuniversity teaching, aswell as usingmolecular biology andbioinformatics tools to address biotechnology challenges.He is leader of three teams: two of them to improvelecturing, and one for biotechnology. His current h-indexis 18, with 103 entries indexed by the Web of Knowledge

(Thomson Reuters). He is editor of three books (one published in 2009 aboutbiotechnology and two in 2012 about lecturing and molecular biology).

Sergio Gálvez is a Professor at the Languages and Com-puter Sciences Department of Malaga University (Spain).His research interests include optimization of algorithmsapplied to bioinformatics, bioinformatics services, inte-gration, and parallelization of algorithms for many-corearchitectures. Heworks actively in the regional technolog-ical community, spreading the Java and Oracle platforms.He is as well author of many lectures and books relatedto these technologies. He is assessor and co-founder of theG2CREA Collaborative Innotechnologies Company, estab-lished in the Technology Park of Andalusia (Spain). He re-

ceived his MS (1995) and Ph.D. (2000) degrees in Computer Science from MalagaUniversity (Spain).

http://blogs.intel.com/research/2008/06/unwelcome_advice.php

www.wheatgenome.org

www.wheatgenome.org

www.wheatgenome.org

http://www.etgi.org

http://www.etgi.org

http://www.etgi.org

http://www.etgi.org

http://www.crue.org

Documents

Direct approaches to exploit many-core architecture in bioinformatics