Long Read Alignment with Parallel MapReduce Cloud Platform

Embed Size (px)

Citation preview

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    1/15

    Hindawi Publishing CorporationISRN BiomathematicsVolume , Article ID ,pageshttp://dx.doi.org/.//

    Review ArticleAn Overview of Multiple Sequence Alignments and CloudComputing in Bioinformatics

    Jurate Daugelaite,1Aisling O Driscoll,2 and Roy D. Sleator1

    Department o Biological Sciences, Cork Institute o Technology, Rossa Avenue, Bishopstown, Cork, Ireland Department o Computing, Cork Institute o Technology, Rossa Avenue, Bishopstown, Cork, Ireland

    Correspondence should be addressed to Roy D. Sleator; [email protected]

    Received May ; Accepted June

    Academic Editors: M. Glavinovic and X.-Y. Lou

    Copyright Jurate Daugelaite et al. Tis is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

    Multiple sequence alignment (MSA) o DNA, RNA, and protein sequences is one o the most essential techniques in the elds omolecular biology, computational biology, and bioinormatics. Next-generation sequencing technologies are changing the biologylandscape, ooding the databases with massive amounts o raw sequence data. MSA o ever-increasing sequence data sets isbecoming a signicant bottleneck. In order to realise the promise o MSA or large-scale sequence data sets, it is necessary orexisting MSA algorithms to be run in a parallelised ashion with the sequence data distributed over a computing cluster orserver arm. Combining MSA algorithms with cloud computing technologies is thereore likely to improve the speed, quality,

    and capability or MSA to handle large numbers o sequences. In this review, multiple sequence alignments are discussed, with aspecic ocus on the ClustalW and Clustal Omega algorithms. Cloud computing technologies and concepts are outlined, and thenext generation o cloud base MSA algorithms is introduced.

    1. Introduction

    Multiple sequence alignments (MSA) are an essential andwidely used computational procedure or biological sequenceanalysis in molecular biology, computational biology, andbioinormatics. MSA are completed where homologoussequences are compared in order to perorm phylogenetic

    reconstruction, protein secondary and tertiary structureanalysis, and protein unction prediction analysis []. Bio-logically good and accurate alignments can have signicantmeaning, showing relationships and homology between di-erent sequences, and can provide useul inormation, whichcan be used to urther identiy new members o proteinamilies. Te accuracy o MSA is o critical importance due tothe act that many bioinormatics techniques and proceduresare dependent on MSA results [].

    Due to MSA signicance, many MSA algorithms havebeen developed. Unortunately, constructing accurate mul-tiple sequence alignments is a computationally intense andbiologically complex task, andas such, no current MSAtool is

    likely to generate a biologically perect result. Tereore, thisarea o research is very active, aiming to develop a methodwhich can align thousands o sequences that are lengthyand produce high-quality alignments and in a reasonabletime [,]. Alignment speed and computational complexityare negatively affected when the number o sequences to bealigned increases. Te recent advances in high throughput

    sequencing technologies means that this sequence outputis growing at an exponential rate, the biology, landscapebeing punctuated by a number o large-scale projects suchas the Human Genome Project [], Genomes Project[], and Genome K Project []. Indeed, technologies suchas Roche/ [], Ilumina [], and SOLiD [] are capable oproducing Giga basepairs (Gbp) per machine per day [].Te entire worldwide second-generation sequencing capacitysurpasses Pbp per year (recorded in ) and is continuingto increase yearly by a actor o ve []. Other large-scaledata is emerging rom high-throughput technologies, suchas gene expression data sets, protein D structures, protein-protein interaction, and others, which are also generating

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    2/15

    ISRN Biomathematics

    huge sequence data sets. Te analysis and storage o thegrowing genomic data represents the central challenge incomputational biology today.

    As the protein alignment problem has been studied orseveral decades, studies have shown considerable progress inimproving the accuracy, quality, and speed o multiple align-

    ment tools, with manually rened alignments continuingto provide superior perormance to automated algorithms.However, more than three sequences o biologically relevantlength can be difficult and timeconsuming to align manually;thereore, computational algorithms are used as a matter ocourse []. Sequences can be aligned using their entire length(global alignment) or at specic regions (local alignment).Multiple sequence alignment or protein sequences is muchmore difficult than the DNA sequence equivalent (containingonly nucleotides) due to the act that there are differentamino acids. Global optimization techniques, developed inapplied mathematics and operations research, provide ageneric toolbox or solving complex optimization problems.Global optimization is now used on a daily basis, and itsapplication to the MSA problem has become a routine [].Local alignments are preerable; however, they can be chal-lenging to calculate due to the difficulty associated with theidentication o sequence regions o similarity. Te two majoraspects o importance or MSA tools or the user are biolog-ical accuracy and the computational complexity. Biologicalaccuracy concerns how close the multiple alignments are tothe true alignment and are the sequences aligning correctly,showing insertions, deletions, or gaps in the right positions.Computational complexity reers to the time, memory, andCPU requirements. Complexity is o increasing relevanceas a result o the increasing number o sequences neededto be aligned. Te complexity o a primal MSA tools wasalways (), where is complexity, is length o thesequence, and is the number o sequences to be aligned.Until recently, this was not a problem because was alwayssmaller than ; thereore, most algorithms concentrated onhow to deal with lengthy sequences rather than the numbero sequences, and now the situation has changed, where alot o alignments have larger than ; thereore, new andmore recent MSA algorithms are concentrating not only onthe length o sequences but also on the increasing number osequences [].

    A wide range o computational algorithms have beenapplied to the MSA problem, including slow, yet accu-rate, methods like dynamic programming and aster but

    less accurate heuristic or probabilistic methods. Dynamicprogramming (DP) is a mathematical and computationalmethod which reers to simpliying a complicated problemby subdividing it into smaller and simpler components ina repeated manner. Te dynamic programming techniquecan be applied to global alignments by using methodssuch as the Needleman-Wunsch algorithm [] and localalignments by using the Smith-Waterman algorithm []. Upto the mid-s, the traditionalmultiple sequence alignmentalgorithms were only best suited or two sequences, so whenit came to producing multiple sequence alignment with morethan two sequences, it was ound that completing the align-ment manually was aster than using traditional dynamic

    programming algorithms []. Dynamic programming algo-rithms are used or calculating pairwise alignments (twosequence alignments) with the time complexity o(). Intheory, this method could be extended to more than twosequences; however, in practice, it is too complex, becausethe time and space complexity becomes very large [].

    Tereore, producing multiple sequence alignment requiresthe use o more sophisticated methods than those usedin producing a pairwise alignment, as it is much morecomputationally complex. Finding a mathematically optimalmultiple alignment o a set o sequences can generally bedened as a complex optimization problem or NP-completeproblem as it must identiy an MSA with the highest scorerom the entire set o alignments; thereore, heuristic (bestguess) methods must be used.

    2. Multiple Sequence Alignment Algorithms

    Te most popular heuristic used rom which the major-ity o multiple sequence alignments are generated is thatdeveloped by Feng and Doolittle [], which they reerredto as progressive alignment [,]. Progressive alignmentworks by building the ull alignment progressively, rstlycompleting pairwise alignments using methods such as theNeedleman-Wunsch algorithm, Smith-Waterman algorithm,k-tuple algorithm [], or k-mer algorithm [], and thenthe sequences are clustered together to show the relationshipbetween them using methods such as mBed and k-means[]. Similarity scores are normally converted to distancescores and guide trees are constructed using these scores byguide tree building methods such as Neighbour-Joining (NJ)[] and Unweighted Pair Group Method with ArithmeticMean UPGMA []. Once the guide tree is built, the multiplesequence alignment is assembled by adding sequences to thealignment one by one according to the guide tree, that is,the most similar sequences added rst and then graduallyadding more distant sequences. Unortunately, this heuristichas a greedy nature; that is, it only looks at two sequences ata time and ignores the remaining data and thereore cannotguarantee an optimal solution. Also, i mistakes are made inthe initialstages o the alignment,they cannot be xed in laterstages, and the mistake will continue throughout the align-ment process with the problem worsening as the number osequences increases. Progressive alignment is the oundationprocedure o several popular alignment algorithms such as

    ClustalW [], ClustalOmega [],MAFF[], Kalign [],Probalign [], MUSCLE [], DIALIGN [], PRANK [],FSA [], -Coffee [, ], ProbCons [],and MSAProbs[]. Different methods or producing multiple sequencealignment exist, and their use depends on user preerencesand sequence length and type, as shown inable .

    An improved version o the progressive alignmentmethod was developed called iterative progressive algo-rithms. Tese algorithms work in a similar manner toprogressive alignment; however, this approach repeatedlyapplies dynamic programming to realign the initial sequencesin order to improve their overall alignment quality, also atthe same time adding new sequences to the growing MSA.

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    3/15

    ISRN Biomathematics

    : ypes o multiple sequence alignment and corresponding algorithms.

    ypes o MSA alignment MSA algorithms

    Pairwise alignment Needleman-Wunsch, k-mer, k-tuple, and Smith-Waterman algorithms.

    Progressive alignment Clustal Omega, ClustalW, MAFF, Kalign, Probalign, MUSCLE, Dialign, ProbCons, and MSAProbs.

    Iterative progressive alignment PRRP, MUSCLE, DIALIGN, SAGA, and -COFFEE.

    Homology search tools BLAS, PSI-BLAS, and FASA.Structure incorporating alignment D-COFFEE, EXPRESSO, and MICAlign.

    Moti alignment PHI-Blast, GLAM.

    Short-read alignment Bowtie, Maq, and SOAP.

    Te iteration benets the alignment by correcting any errorsproduced initially, thereore improving the overall accuracyo the alignment []. Iterative methods are able to give %% more accurate alignments; however, they are limitedto alignments o a ew hundred sequences only []. Temost used iterative alignment algorithms include PRRP [],MUSCLE [], Dialign [], SAGA [], and -COFFEE[,].

    Multiple sequence alignments can also be constructedby using already existing protein structural inormation. Itis believed that by incorporating structural inormation tothe alignment, the nal MSA accuracy can be increased;thereore, most structure-based MSA are o higher qualitythan those based on sequence alignment only. Te reasonor structure-based MSA being o better quality is not dueto a better algorithm but rather an effect o structuresevolutionary stability that is, structures evolve more slowlythan sequences []. Te most popular structure and basedMSA is D-COFFEE [], and others include EXPRESSO[] and MICAlign [].

    Moti discovery algorithms are another type o MSAalgorithms that are used. Tese methods are used to ndmotis in the long sequences; this process is viewed as aneedle in a haystack problem, due to the act that thealgorithm looks or a short stretch o amino acids (moti)in the long sequence. One o the most widely used toolsor searching or motis is PHI-Blast [] and Gapped LocalAlignments o Motis (GLAM) [].

    Short sequence alignment algorithms are also beginningto emerge, primarily due to advances in sequencing tech-nologies. Most genomic sequence projects use short readalignment algorithms such as Maq [], SOAP [], and the

    very ast Bowtie [] algorithms.

    3. Top Multiple SequenceAlignment Algorithms

    Te number o multiple sequence alignment algorithms isincreasing on almost monthly bases with-new algorithmspublished per month. Te computational complexity andaccuracy o alignments are constantly being improved; how-ever, there is no biologically perect solution as yet. ClustalW(one o the rst members o the Clustal amily afer ClustalV)is probably the most popular multiple sequence alignmentalgorithm, being incorporated into a number o so-called

    black box commercially available bioinormatics packagessuch DNASAR, while the recently developed Clustal Omegaalgorithm is the most accurate and most scalable MSAalgorithms currently available. ClustalW and Clustal Omegaare described later, and also a brie description is provided orthe -Coffee, Kalign, Ma, and MUSCLE multiple sequencealignment algorithms.

    .. ClustalW. ClustalW [] was introduced by Tompsonet al. in and quickly became the method o choice orproducing multiple sequence alignments as it presented adramatic increase in alignment quality, sensitivity, and speedin comparison with other algorithms. ClustalW incorporatesa novel position-specic scoring scheme and a weight-ing scheme or downweighting overrepresented sequencegroups, with the W representing weights. Firstly, thealgorithm perorms a pairwise alignment o all the sequences(nucleotide or amino acid) using the k-tuple method byWilbur and Lipman [] which is a ast, albeit approximate,method or the Needleman-Wunsch method [] which isknown as the ull dynamic programming method. Tesemethods calculate a matrix which shows the similarity oeach pair o sequences. Te similarity scores are convertedto distance scores, and then the algorithm uses the distancescores to produce a guide tree, using the Neighbour-Joining(NJ) method [] or guide tree construction. Te last stepo the algorithm is the construction o the multiple sequencealignment o all the sequences. Te MSA is constructedby progressively aligning the most closely related sequencesaccording to the guide tree previously produced by the NJmethod (seeFigure or an overview).

    ... Pairwise Alignment. Te k-tuple method [], a astheuristic best guess method, is used or pairwise alignmento all possible sequence pairs. Tis method is specicallyused when the number o sequences to be aligned is large.Te similarity scores are calculated as the number o k-tuplematches (which are runs o identical residues, usually or or protein residues or or nucleotide sequences) inthe alignment between a pair o sequences. Similarity scoreis calculated by dividing the number o matches by the sumo all paired residues o the two compared sequences. Fixedpenalties or every gap are subtracted rom the similarityscore with the similarity scores later converted to a distance

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    4/15

    ISRN Biomathematics

    Pairwise alignmentsby k-tuple method

    Guide treeconstruction by

    Neighbour-Joiningmethod

    Progressive alignment

    Amino acid ornucleic acid sequence

    MSA

    Distance scores

    Distance scores

    F : ClustalW algorithm, which works by taking an inputo amino acid or nucleic acid sequences, completing a pairwisealignment using the k-tuple method, guide tree construction usingthe Neighbour-Joining method, ollowedby a progressive alignmentto output a multiple sequence alignment.

    score by dividing the similarity score by and subtractingit rom . to provide the number o differences per site.

    Ten, all o the k-tuples between the sequences arelocated using a hash table. A dot matrix plot between the twosequences is produced with each k-tuple match representedas a dot. Te diagonals with the most matches in the plot areound and marked within a selected Window Size o eachtop diagonal. Tis sets the most likely region or similaritybetween the two sequences to occur. Te last stage o the k-tuple method is to nd the ull arrangement o all k-tuplematches by producing an optimal alignment similar to theNeedleman-Wunsch method but only using k-tuple matchesin the set window size, which gives the highest score. Te

    score is calculated as the number o exactly matching residuesin the alignment minus a gap penalty or every gap that wasintroduced.

    ... Guide Tree Construction. ClustalW produces a guidetree according to the Neighbor-Joining method. Te NJmethod is ofen reerred to as the star decomposition method[]. Te NJ method keeps track o nodes on a tree ratherthan a taxa (a taxonomic category or group, such as phy-lum, order, amily, genus, or species) or clusters o taxa.Te similarity scores are used rom the previous k-tuplemethod and stored in a matrix. A modied distance matrixis constructed in which the separation between each pair

    o nodes is adjusted by calculating an average value ordivergence rom all other nodes. Te tree is then built bylinking the least distant pair o nodes. When two nodes arelinked, their common ancestral node is added to the treeand the terminal nodes with their respective branches areremoved rom the tree. Tis process allows the conversiono the newly added common ancestor into a terminal nodetree o reduced size. At each stage in the process, two terminalnodes are replaced by onenew node. Te process is completedwhen two nodes remain separated by a single branch. Tetree produced by the NJ method is un-rooted and its branchlengths are proportional to divergence along each branch.Te root is placed at the position at which it can make the

    equal branch length on either side o the root. Te guide treeis then used to calculate weight or each sequence, whichdepends on thedistancerom branchto theroot. I a sequenceshares a common branch with another sequence, then thetwo or more sequences will share the weight calculated romthe shared branch, and the sequence lengths will be added

    together and divided by the number o sequences sharing thesame branch.

    ... Progressive Alignment. ClustalWs progressive align-ment uses a series o pairwise alignments to align sequencesby ollowing the branching order o the guide tree previouslyconstructed by the NJ method. Te procedure starts at thetips o the rooted tree proceeding towards the root. At eachstep, a ull dynamic programming algorithm is used with aresidue weight matrix (BLOSUM) and penalties or openingand extending gaps.

    .. Clustal Omega. Clustal Omega is the latest MSA algo-

    rithm rom the Clustal amily. Tis algorithm is used toalign protein sequences only (though nucleotide sequencesare likely to be introduced in time). Te accuracy o ClustalOmega on small numbers o sequences is similar to otherhigh-qualityaligners; however, on large sequence sets, ClustalOmega outperorms other MSA algorithms in terms o com-pletion time and overall alignment quality. Clustal Omega iscapable o aligning , sequences on a single processorin a ew hours []. Te Clustal Omega algorithm producesa multiple sequence alignment by rstly producing pairwisealignments using the k-tuple method. Ten, thesequences areclustered using the mBed method. Tis is ollowed by the k-means clustering method. Te guide tree is next constructed

    using the UPGMA method. Finally, the multiple sequencealignment is produced using the HHalign package, whichaligns two prole hidden Markov models (HMM) as showninFigure .

    ... Pairwise Alignment. Pairwise alignment o ClustalOmega is produced using the k-tuple method, the sametechnique as employed by ClustalW, described earlier.

    ... Sequence Clustering. Afer the similarity scores aredetermined rom the pairwise alignment, Clustal Omegaemploys the mBed method which has a complexity o( log). mBed works by emBedding each sequence in

    a space odimensions, where is proportional to log.Each sequence is then replaced by an element vector. Eachelement is the distance to one o reerence sequences.Tese vectors can then be clustered extremely quickly bymethods such as k-means or UPGMA [].

    ... k-Means Clustering. Clustal Omega uses the k-means++ clustering method by Arthur and Vassilvitskii[]. Te k-means method is a widely used clusteringtechnique which seeks to minimise the average squareddistance between points in the same cluster. Tis method is

    very simplistic and ast at clustering sequences. k-means++successully overcomes the problems o dening initial

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    5/15

    ISRN Biomathematics

    Pairwise alignmentsby k-tuple method

    Sequence clusteringby mBed method

    Sequence clusteringby k-means method

    Guide treeconstruction byUPGMA method

    Progressive alignmentby HHalign package

    Guidetree byUPGM

    A

    Guidetree byUPGM

    A

    Guidetree byUPGM

    A

    Amino acidsequences

    MSA

    Distance scores

    Distance scores

    Distance scores

    Distance scores

    Distance scores

    F : Clustal Omega algorithm, which worksby takingan inputo amino acid sequences, completing a pairwise alignment using thek-tuple method, sequence clustering using mBed method, and k-means method, guide tree construction using the UPGMA method,ollowed by a progressive alignment using HHalign package tooutput a multiple sequence alignment.

    cluster centres or k-means and improves the speed andaccuracy o the k-means method [].

    ... Guide Tree Construction. Clustal Omega uses theUPGMA method or sequence guide tree construction.UPGMA is a straightorward method o tree constructionwhich uses a sequential clustering algorithm in which localhomology between operational taxanomic units (OUs) isidentied in order o similarity. Te tree is constructed ina stepwise ashion. Pairs o OUs that are most similar arerst determined and then are treated as a new single OU.Subsequently, rom the new group o OUs, the pair withthe highest similarity is identied and clustered. Tis processcontinues until only two OUs remain [].

    ... Progressive Alignment. Clustal Omega uses theHHalign package by Johannes Soding [] orcompleting progressive alignments. Tis method alignstwo prole hidden Markov models, instead o a prole-prole comparison; this improves the sensitivity andalignment quality signicantly. All sequence-prole andsequence HMM comparison methods are based on thelog-odds score. Te log-odds score is a measure or howmuch more probable it is that a sequence is emitted by anHMM rather than by a random null model.

    .. T-Coffee. -Coffee, which stands or tree-based con-sistency objective unction or alignment evolution, is

    Weighting signal

    addition

    Extension

    Primary library

    Extension library

    Progressive

    alignment

    ClustalW library(global pairwise alignment)

    Lalign library(local pairwise alignment)

    A

    AC

    C

    B

    A

    CB

    AB

    AB

    B

    AC

    CB

    CB

    F : -COFFEE diagram. Steps involvedin producing multiplesequence alignment by -Coffee method.

    an iterative MSA algorithm. -Coffee provides a simple andexible means o producing multiple sequence alignmentsby using heterogeneous data sources which are providedto -Coffee via library o global and local pairwise align-ments. In the progressive alignment, pairwise alignmentsare completed rst in order to produce a distance matrix.Tis matrix is then used to produce a guide tree usingthe Neighbour-Joining method. Te tree is then used togroup the sequences together during the multiple sequencealignment process. Te closest two sequences on the tree arealigned rst using normal dynamic programming method.

    Te alignment uses weighting in the extended library asshown inFigure . Tis is done in order to align the residuesin two sequences. Te next two closest sequences suggestedby the guide tree or prealigned group o sequences arealways joined. Tis continues until all the sequences havebeen aligned. o align two groups o prealigned sequences,the scores rom the extended library are used; however, theaverage library scores in each column o existing alignmentare taken. -Coffee increases the accuracy o the alignments% in comparison to ClustalW; however, the algorithmpresents disadvantages such as weak scalability. -Coffee canonly align maximum sequences without loss o accuracy[].

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    6/15

    ISRN Biomathematics

    .. MAFFT. Another good quality, highly accurate mul-tiple sequence alignment is an algorithm called MAFF.MAFF uses two novel techniques; rstly, homologousregions are identied by the ast Fourier transorm (FF).In this method, the amino acid sequences are converted toa sequence composed o volume and polarity values o each

    amino acid residue. Secondly, a simplied scoring systemis introduced which reduces CPU time and increases theaccuracy o alignments. MAFF uses two-cycle heuristics,the progressive method (FF-NS-) and iterative renementmethod (FF-NS-i). In the (FF-NS-) method, low-qualityall-pairwise distances are rapidly calculated, a provisionalMSA is constructed, rened distances are calculated rom theMSA, and then the second method (FF-NS-i) is perormed.(FF-NS-i) is a one cycle progressive method; it is aster andless accurate than the FF-NS-. Part tree option is availableto alignments o, sequences, and this method allowsscalability [].

    .. Kalign. Kalign is yet another good quality multiplesequence alignment algorithm. Te algorithm ollows a strat-egy that is very similar to the standard progressive methodsor sequence alignments, such as pairwise distances whichare calculated rstly by using k-tuple method adopted romClustalW. Te guide tree is constructed using either UPGMAor Neighbour-Joining method, and progressive alignment iscompleted by ollowing the guide trees. In contrast to theexisting methods, what makes this algorithm different is theuse o Wu-Manber approximate string-matching algorithm.Tis method is used in the distance calculation and in thedynamic programing used to align the proles. Tis methodallows string matching with mismatches. Also, the distances

    between two strings are measured using Levenshtein editdistance.

    .. MUSCLE. MUSCLE stands or multiple sequence com-parison by log expectation. MUSCLE uses two distancemeasures, kmer distance or unaligned pairs o sequencesand the Kimura distance method or aligned pairs osequences. Guide trees are produced using UPGMA method.A progressive alignment is then constructed ollowing theorder o the guide tree. Tis process produces an initialmultiple sequence alignment. Te program carries out stagetwo which is completed in order to improve the progres-sive alignment. Te initial guide tree is reestimated using

    Kimura distance method, and this method is known to bemuch more accurate than kmer, and however it requires analignment. Once the distances are computed, the UPGMAmethod reclusters the sequences producing second guidetree. A progressive alignment is calculated ollowing secondtree, producing second multiple sequence alignment. A newmultiple sequence alignment is produced using both therst multiple sequence alignment and the second one. Newmultiple sequence alignment is produced by realigning thetwo proles. I the SP score is improved on the second MSA,then the new alignment is kept and the old is discarded;otherwise, it is deleted and the rst alignment is used[].

    4. Cloud Computing Technologies forthe MSA Problem

    In order to realise the promise o MSA or large-scalesequence data sets, it is necessary orexistingMSA algorithmsto be executed in a parallelised ashion with the sequence

    data distributed over a computing cluster or server arm.High perormance computing has become very importantin large-scale data processing. Te message passing interace(MPI) and graphics processing unit (GPU) are the primalprogramming APIs or parallel computing. In recent years,cloud computing has received signicant attention, thoughmany different denitions o the technology exist. As anofen mistakenly used analogy or the Internet or anythingonline, the cloud is a amiliar buzzword. Alternatively,some analysts tend to provide a very narrow denition ocloud computing as an updated version o utility computing[]. While such a denition is not inaccurate, it doesnot describe the whole picture. A more precise denitionis provided by the National Institute o Standards andechnology (NIS) who describe it as a pay-per-use modelo enabling available, convenient and on-demand networkaccess to a shared pool o congurable computing resourcesthat can be rapidly provisioned and released with minimalmanagement effort or service provider interaction []. Teterm cloud computing was cocoined by an Irish entrepreneurSean OSullivan, coounder o Avego Ltd., along with GeorgeFavaloro rom Boston, Massachusetts []. It should benoted that while cloud computing is a recent technology,some o the concepts behind cloud computing are not new,such as distributed systems, grid computing, and parallelisedprogramming. Cloud services remove the need or the userto be in the same physical location as the hardware that stores

    its data and/or applications. Te cloud can do both, own andstore the hardware and the sofware needed or a user torun their applications or processes. ruly, one o the biggestenablers o cloud computing is the virtualisation technology.Virtualisation is a layered approach or running multipleindependent virtual machines (VM) on a single physicalmachine, sharing the resources yet running on its ownoperating systems and applications []. Te concept o virtu-alisation canbe applied to devices, servers, operating systems,applications, and networks. Virtualisation is benecial due toproviding easy access to data, the ability to share applicationsrom central environment, and it reduces the cost associatedwith data backups, maintenance personnel, and sofware

    licensing []. A virtual machine (VM) is a piece o sofwarethat runs on a local machine emulating the properties o acomputer. Te emulator provides a virtual central processingunit (CPU), network card, and hard disk. Popular productssuch asVMware []andKVM[] provide virtual machinesto customers. Te differences between traditional and virtualserver models can be seen inFigure .

    Cloud computing is an inormation technology disci-pline, which provides computing, such as the necessarystorage space and processing power, on demand and as aservice. Such a rental model is ofen reerred to as Inras-tructure as a Service (IaaS) or utility computing. However,another aspect o cloud computing is Platorm as a Service

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    7/15

    ISRN Biomathematics

    ApplicationlApplicationi a

    lApplicationi a

    pApplicationnatiopic

    Application

    UnixVM

    Application

    UnixVM

    Application

    Unix

    VM

    Application

    xx

    pp icatio

    UnixVM

    Application

    LinuxVM

    Application

    LinuxVM

    Application

    Linux

    VM

    Application

    ux

    pp icatio

    LinuxVM

    Application

    WindowsVM

    Application

    WindowsVM

    Application

    Windows

    VM

    Application

    owsows

    WindowsVM

    Application

    Traditional server model Virtual server model

    Operating system Physical server operating system

    Physical server Physical server

    F : ransition rom traditional computing where applications interact with the hardware via one instance o the operating system (OS)to virtualised computing where multiple OS images share the hardware resources.

    : Cloud computing deployment models.

    Deploymentmodel

    Description

    Public

    A cloud inrastructure that is owned by a cloudprovider, who made resources such asinrastructure, sofware, and platorm available togeneral public or pay-per-use basis, viaInternet.

    PrivateA cloud which is owned and used by a singleorganisation. Tis model provides more securitymeasures in comparison to other models.

    Community A cloud which is shared amongst several users or

    organizations.

    Hybrid A cloud that is a combination o public,

    community, and private clouds.

    (PaaS) which allows users to build sofware applicationsby building on sofware libraries or development platorms

    already developed by the cloud provider. Another enablerincludes advances in Big Data technologies that haverealised the potential o distributed systems, grid computing,and parallelised programming enabling developers to ocuson solving the problem at hand rather than maintainingthe robustness o the distributed system and the parallelisedprogramming structure. Te negative consequences causedby a ailure o a single machine in a distributed system havebeen eliminated by improving the overall distributed systemsstructure []. Cloud computing provides our deploymentmodels. Selecting a deployment model depends on the users,organisations requirements according to their suitability, asshown inable .

    In order to use cloud computing services, one require-ment has to be met, which is Internet connection. Clouds areaccessed via the Internet which is o benet; it enables usersor organisations to access their stored data, to download orupload data at any given time or place through any devicewhich has wireless or wired Internet connection.

    Cloud computing is ofen considered to provide only

    rental o computing storage and power; however, cloudcomputing provides many service models according to anXaaS paradigm, representing X as a Service, Anything asa Service, or Everything as a Service. Te acronym reersto an increasing number o services that are provided overthe Internet rather than the local services. XaaS is the essenceo cloud computing. Te primary service models o cloudcomputing are Sofware as a service (SaaS), Inrastructure asa service (IaaS) and Platorm as a service (PaaS) as illustratedinFigure .

    .. Inrastructure as a Service (IaaS). An IaaS provider allowssubscribed users to completely outsource the storage and

    resources, such as hardware and sofware that the user mayrequire. IaaS companies provide offsite servers, storage, andalso networking hardware which users can rent and access ondemand. An IaaS service offers benets to users such as nomaintenance, no up-ront capital costs, / accessibility toapplications and data, and elastic inrastructure that allowsthe user to scale up and down on demand []. IaaS isofen reerred to as elastic computing as a consequence othis ability to scale up and down on demand. As seen inFigure , users maintain signicant management capabilitywhen it comes to this service model. Te user can create theirown VM, with a specic operating system and applications.Tis can be custom built or chosen rom an IaaS catalogue.

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    8/15

    ISRN Biomathematics

    IaaSinrastructure

    as a service

    PaaSplatorm as a

    service

    SaaSsofware as a

    service

    Applications

    Data

    Runtime

    O/S

    VM

    Storage

    Networking

    Servers

    Applications

    Data

    Runtime

    O/S

    VM

    Storage

    Networking

    Servers

    Applications

    Data

    Runtime

    O/S

    VM

    Storage

    Networking

    Servers

    Managedbyuser

    Managedbyvendor

    Managedbyvendor

    Managedbyvendor

    F : Cloud computing service models: Inrastructure as aService(IaaS), Platorm as a Service(PaaS), andSofwareas a Service(SaaS). Tis gure highlights the unctionality provided by thevendor to the user which largely depends on the users requirements.

    Some o the leading companies in the IaaS space includeAmazon Web Services (AWS) [], Microsof [], Rackspace[], and VMware []. Amazon Web Services (AWS) is theleading IaaS provider, widely recognized or providing themost reliable, scalable, cost-efficient, and user riendly web

    inrastructure. In , alone the company has earned billion based on providing public cloud services. Exampleso the services offered by AWS are Amazon Elastic ComputeCloud (EC), Amazon Simple Storage Service (S), AmazonSimpleDB, Amazon Simple Queue Service (SQS), AmazonSimple Notication Service (SNS), Amazon CloudFront, andAmazon Elastic MapReduce (EMR).

    Bioinormaticians use IaaS or building databases, storingdata, and in developing a pipeline or comparative analysison various genes and/or proteins. Tey can now also accessdatabases through AWS, who have started to provide bioin-ormatics data sets in their publically hosted datasets such asEnsembl [] and Genbank [].

    .. Sofware as a Service (SaaS). SaaS reers to cloud baseddelivery o sofware applications which are hosted by cloudproviders. SaaS eliminates the need to install sofware locallyon computers with the user instead accessing sofware viathe Internet. Te vendors own the applications and the usersmay pay a subscription ee to access them via a VM, whereall the applications are installed, without the necessity orthe user to have a physical copy o the sofware installedon their own device. SaaS providers run and maintainall necessary hardware and sofware. One o the top SaaSproviders, Salesorce.com, offers on-demand customer rela-tionship management (CRM) sofware solutions built on

    its own inrastructure. Salesorce.com does not sell licenceor this sofware, instead it charges a monthly subscriptionee starting rom per user per month and delivers thissofware directly to users via Internet []. Te Economistestimates that the market or SaaS is growing at % eachyear []. SaaS vendors have a complete management control

    o applications, O/S, Runtime, VM, and Servers as shown inFigure ; however, as a user is simply using the sofware anddoes not require control over the OS, and this model is notrestrictive and meets the users requirements.

    An example o SaaS used in bioinormatics is CloudBioLinux, which was developed at the J. Craig Venter Insti-tute. Cloud BioLinux is a publicly accessible Virtual Machine(VM) which offers an on-demand, cloud computingsolutionsor the bioinormatics eld. Users have access to a range oprecongured command line and graphical sofware appli-cations, documentation, and more than bioinormaticstools or applications suchas sequence alignments,clustering,tree construction, editing, and phylogeny. Cloud BioLinux isstored on Amazon EC and is reely available to its users [].

    .. Platorm as a Service (PaaS). PaaS providers allowsubscribed users to access the components that are requiredor the user to develop or operate applications. Tese PaaSsystems are web-based application development platorms,providing either end-to-end or partial environments orimplementing ull programs/algorithms online. Tis includestasks such as editing code, debugging, deployment, and run-time. As seen in Figure , users can build the application withthe vendors on-demand tools and collaborative developmentenvironment. Te benets o PaaS include the eliminationo complex evaluation, conguration, and management andcost reduction in buying, updating, and maintaining o allhardware and sofware needed or custom built applications[]. PaaS also saves time and resources, that is, no needto reinvent the wheel; developers simply build more com-plex systems using existing platorms. Some o the biggestPaaS platorms today are Google App Engine, MicrosofAzure, and MapReduce/Hadoop. Google App Engine wasrst released in and is used or developing and hostingweb applications. First created by Google in order to process

    vast amounts o data, MapReduce is a programming modeland an implementation or storing and processing large datasets. Using the MapReduce paradigm, the user species amap unction which analyses data with the reduce unction

    merging all the results associated with the values rom themap phase [].

    One such PaaS technology, Hadoop and Map/Reduce,driven by big data, distributes the data over commodityhardware and provides parallelised processing and analytics.MapReduce developed by Google is a general purpose, rela-tively easy-to-use parallel programming model that is perector carrying out analysis o large data sets on commodityhardware clusters. Additionally, Apache Hadoop is a sofwareramework that implements the distributed processing obig data sets across cluster arms based on the MapReducemodel. With the combination o MapReduce and HadoopDistributed File System (HDFS), Hadoop intends to enable

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    9/15

    ISRN Biomathematics

    : Related sofware and projects on MapReduce.

    Projects Description

    Avro A data serialization system. URL: http://avro.apache.org/

    Cassandra A highly scalable, consistent, distributed, and structured multimaster database. URL:

    http://cassandra.apache.org/

    Chukwa An open source data collection system or monitoring large distributed systems. URL:http://incubator.apache.org/chukwa/

    Dryad An inrastructure which allows the use o resources o a computer cluster or running data-parallel programs.

    URL: http://research.microsof.com/en-us/projects/dryad/

    Hadoop Common Te common utilities that support the other Hadoop subprojects. URL: http://hadoop.apache.org/

    Hadoop MapReduce A programming model and an associated implementation or processing and generating large data sets. URL:

    http://research.google.com/archive/mapreduce.html

    HaLoopA modied version o the Hadoop MapReduce ramework, which supports iterative applications by making thetask scheduler loop-aware and by adding various catching mechanisms. URL:https://code.google.com/p/haloop/

    HBase A Hadoop database, a distributed, scalable big data store. URL:http://hbase.apache.org/

    HDFS Hadoop Distributed File System is a distributed le system designed to run on commodity hardware. URL:

    http://hadoop.apache.org/docs/r../hds design.html

    Hive A data warehouse system or Hadoop that acilitates data summarization and ad hoc queries. URL:http://hive.apache.org/

    Mahout A scalable machine learning and data mining library.

    URL: http://mahout.apache.org/

    MapR A complete distribution or Apache Hadoop and HBase that includes Hive, Mahout, Pig, Cascading, and many

    other projects. URL: http://www.mapr.com/

    Pig A platorm or analysing large data sets that consists o high-level language or expressing data analysis

    programs. URL: http://pig.apache.org/

    Pregel A system or large-scale graph processing. URL:http://kowshik.github.com/JPregel/pregel paper.pd

    wister A support or iterative MapReduce computations.

    URL: http://www.iterativemapreduce.org/

    YARN Next Generation Apache Hadoop MapReduce Framework. URL:

    http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

    ZooKeeper A high-perormance manager or distributed applications. URL: http://hadoop.apache.org/zookeeper/

    reliable, scalable, and distributed computing. Hadoop []was initiated by Doug Cutting, who worked on the ApacheNutch project (Hadoop is named afer his sons toy, a stuffedyellow elephant). It has been designed to scale out rom aslittle as one server to thousands o machines each offeringlocal computation and storage. One o the biggest advantageso Hadoop is speed, being able to process data stored inbillions o records, overnight, a process which would havetaken several weeks to process []. Apache offers otherprojects similar to Hadoop, these include HBase, HaLoop,

    Chukwa, Cassandra, Hive, Mahout, Pig, ZooKeeper, andothers as outlined inable .

    Such technology provides a scalable and cost-efficientsolution to the big data challenge. Recent years have showna massive increase in the size o biological data sets and thegrowth o new, highly exible on-demand computing tech-nologies. Te data rom genomic, proteomic, and metage-nomic sequencing projects are increasing at exponentialrates, providing inormation or widening the overall insighto genomes and proteins; however, it is also introducingnew challenges such as need or increased storage space,higher power computation, and large data analysis. Cloudcomputing resources have the potential to aid in solving

    these problems, by offering a utility model o computing andstorage, such as almost unlimited storage capacity, anytimeusage, and cheap exible payment models. Effective use ocloud computing on large biological datasets requires dealingwith nontrivial problems o scale and robustness, sinceperormance-limiting actors can change substantially whena dataset grows. New computing paradigms are thus ofenneeded. Te use o cloud platorms also creates new oppor-tunities to make data widely available and share it amongstdifferent research laboratories by uploading data to the cloud.

    Cloud Web services such as Amazon Elastic Compute Cloud(EC) and Amazon Elastic MapReduce are commerciallyavailable, but there are also clouds that provide ree service;IBM/Google Cloud Computing University Initiative and theUnited States Department o Energys Magellan provide reeservices, so the users can upload their data by using a webinterace, and then they can perorm all o their operationson a remote client webpage.

    A case in point is Amazon Web Services (AWS) whichprovides a centralized repository o public data sets, includ-ing archives o GenBank, Ensembl, Genomes, ModelOrganism Encyclopedia o DNA Elements, Unigene, andInuenza Virus. As a matter o act, AWS contains multiple

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    10/15

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    11/15

    ISRN Biomathematics

    public datasets or a variety o scientic elds, such asbiology, astronomy, chemistry, climate, and economics [].All public datasets in AWS are delivered as services andthereore can be easily integrated into cloud-based applica-tions. Also, using cloud platorms would reduce duplicationand provide easy reproducibility by making the sequence

    datasets and computational methods easily available []. Bigdata technology algorithms are increasing on monthly bases,acilitating different unctional sequence analysis, as outlinedinable .

    .. Cloud-Based MSA-Next-Generation MSA. Te multi-ple sequence alignment algorithms certainly need to beimproved in order to be able to handle large amounts oDNA/RNA/proteinsequences and most importantly producemultiple sequence alignmentso highquality. At the moment,the only algorithm able to handle as much as ,sequences is Clustal Omega, which even at that takes aew hours on a single processor to produce an alignment

    o all the sequences. In a recent study by Sievers et al.,, standard automated multiple sequence alignmentpackages were compared with the main ocus being on howwell they scaled, aligning rom to , sequences.It was ound that most o the algorithms, or example, -Coffee, PSAlign, Prank, FSA, and Mummals, did not producealignments above sequences. Some o the algorithmsproduced alignment o max , sequences; these wereProbcons, MUSCLE, MAFF, ClustalW, and MSAProbs.Finally, the only MSA algorithmsthat completedalignmento, sequences were Clustal Omega, Kalign, and Part-ree.Furthermore, it was noted that the quality o the alignmentor each o the algorithms decreased progressively as thenumber o sequences increased []. Tis could possibly beexplained by the nature o progressive alignments, which areheuristic in nature, thereore introducing noise and mistakesat the start o the alignment. Te main concerns with scalingup and producing MSA o large sets o sequences are thecomputational complexity, the time it takes to produce thealignment and the accuracy o the nal alignment. Also, atpresent, there are no systematic benchmarks tests that canhandle testing alignments o massively increasing numbero sequences; thereore, new benchmarks must be developeddue to the act that new algorithms are created on monthlybases and soon will be able to align massive numbers osequences.

    By employing big data technologies and cloud comput-

    ing projects, the aim would be to develop an improvedversion o Clustal Omega which could produce alignmentso very large data sets and in a shorter time rame thanthe original Clustal Omega algorithm. Also, other popularmultiple sequence alignments could possibly be recoded,so it could complete MSA algorithm over a cluster omachines in a distributed, parallelised way by using theHadoop/MapReduce ramework. An example o multiplesequence alignment that is optimized in the cloud is theFASA algorithm published by Vijaykumar et al. in .Tis study presents an implementation o the FASA algo-rithm built on the Hadoop/MapReduce ramework and MPPDatabase. In this experiment it was observed that the time

    taken or sequence alignment increased as the number osequences also increased. However, the experiment alsoshowed that as the number o nodes increased the numbero alignments that was executed in parallel also increased,resulting in a time decrease or alignment completion [].Tis proves the theory o parallelization and the use o

    the cloud computing technologies or improving multi-ple sequence alignment tools. Parallel, distributed multiplesequence alignments in the cloud is likely our only realmeans o keeping pace with todays sequence tsunami andwill ultimately aid in the discovery o novel genes, entiremetabolic pathways, novel proteins and potentially medically

    valuable end-products rom the global metabolome [].Teimplication o such scalable worldwide analysis and sharingo data sets is enormous in that it will not only change livesbut may also save them.

    Conflict of Interests

    No potential conict o interests was disclosed.

    Acknowledgment

    Jurate Daugelaite is unded under the Embark Initiative by anIrish Research Council (IRC) Grant RS//. Aisling ODriscoll and Dr. Roy D. Sleator are Principal Investigators onClouDx-i an FP-PEOPLE--IAPP project.

    References

    [] C. Kemena and C. Notredame, Upcoming challenges ormultiple sequence alignment methods in the high-throughput

    era, Bioinormatics, vol. , no. , pp. , .[] R. C. Edgar and S. Batzoglou, Multiple sequence alignment,

    Current Opinion in Structural Biology, vol. , no. , pp. , .

    [] C. Notredame, Recent evolutions o multiple sequence align-ment algorithms, PLoS Computational Biology, vol. , no. ,article e, .

    [] Human Genome Project Inormation, ,http://web.ornl.gov/sci/techresources/Human Genome/home.shtml.

    [] Home, Genomes, ,http://www.genomes.org/.

    [] G. K. C. O.Scientists,GenomeK:a proposal toobtain whole-genome sequence or , vertebrate species, Journal oHeredity, vol. , no. , pp. , .

    [] Lie Sciences, a Roche Company, , http://www..com/.

    [] Illumina, Inc, ,https://ww w.illumina.com/.

    [] SOLiDM System, ,http://www.appliedbiosystems.com/absite/us/en/home/applications-technologies/solid-next-generation-sequencing/next-generation-systems/solid--system.html?CID=FL- solid.

    [] H. Li and N. Homer, A survey o sequence alignment algo-rithms or next-generation sequencing, Briengs in Bioinor-matics, vol. , no. , pp. , .

    [] SourceForge.net: jnomics, , http://sourceorge.net/apps/mediawiki/jnomics/index.php?title=Jnomics.

    [] C. B. Do and K. Katoh, Protein multiple sequence alignment,Methods in Molecular Biology, vol. , pp. , .

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    12/15

    ISRN Biomathematics

    [] R. C. Edgar, MUSCLE: a multiple sequence alignment methodwith reduced time and space complexity,BMC Bioinormatics,vol. , article , .

    [] S. B. Needleman and C. D. Wunsch, A general method appli-cable to the search or similarities in the amino acid sequenceo two proteins,Journal o Molecular Biology, vol. , no. , pp., .

    [] . F. Smith and M. S. Waterman, Identication o commonmolecular subsequences,Journal o Molecular Biology, vol. ,no. , pp. , .

    [] I. M. Wallace, G. Blackshields, and D. G. Higgins, Multiplesequence alignments, Current Opinion in Structural Biology,vol. , no. , pp. , .

    [] K. Katoh and H. oh, Recent developments in the MAFFmultiple sequence alignment program, Briengs in Bioinor-matics, vol. , no. , pp. , .

    [] D.-F. Feng and R. F. Doolittle, Progressive sequence alignmentas a prerequisitetto correct phylogenetic trees, Journal oMolecular Evolution, vol. , no. , pp. , .

    [] W. J. Wilbur and D. J. Lipman, Rapid similarity searches o

    nucleic acid and protein data banks, Proceedings o the NationalAcademy o Sciences o the United States o America, vol. , no., pp. , .

    [] R. C. Edgar, MUSCLE: multiple sequence alignment with highaccuracy and high throughput,Nucleic Acids Research, vol. ,no. , pp. , .

    [] F. Sievers, A. Wilm, D. Dineen et al., Fast, scalable generationo high-quality protein multiple sequence alignments usingClustal Omega,Molecular Systems Biology, vol. , article ,.

    [] N. Saitou and M. Nei, Te neighbor-joining method: a newmethod or reconstructing phylogenetic trees,Molecular Biol-ogy and Evolution, vol. , no. , pp. , .

    [] I. Gronau and S. Moran, Optimal implementations o UPGMAand other common clustering algorithms, Inormation Process-ing Letters, vol. , no. , pp. , .

    [] J. D. Tompson, D. G. Higgins, and . J. Gibson, CLUSALW: improving the sensitivity o progressive multiple sequencealignment through sequence weighting, position-specic gappenalties and weight matrix choice, Nucleic Acids Research, vol., no. , pp. , .

    [] K. Katoh and D. M. Standley, MAFF multiple sequencealignment sofware version : improvements in perormanceand usability,Molecular Biology and Evolution, vol. , no. ,pp. , .

    [] . Lassmann and E. L. L. Sonnhammer, Kalignan accurateand ast multiple sequence alignment algorithm,BMC Bioin-

    ormatics, vol. , article , .

    [] U. Roshan and D. R. Livesay, Probalign: multiple sequencealignment using partition unction posterior probabilities,Bioinormatics , vol. , no. , pp. , .

    [] B. Morgenstern, DIALIGN: multiple DNA and proteinsequence alignment at BiBiServ, Nucleic Acids Research, vol. ,supplement , pp. WW, .

    [] A. Loytynoja and N. Goldman, Phylogeny-aware gap place-ment prevents errors in sequence alignment and evolutionaryanalysis,Science, vol. , no. , pp. , .

    [] R. K. Bradley, A. Roberts, M. Smoot et al., Fast statisticalalignment,PLoS Computational Biology, vol. , no. , ArticleID e, .

    [] P. Di ommaso, S. Moretti, I. Xenarios et al., -Coffee: aweb server or the multiple sequence alignment o protein andRNA sequences using structural inormation and homologyextension,Nucleic Acids Research, vol. , supplement , pp.WW, .

    [] C. Notredame, D. G. Higgins, and J. Heringa, -coffee: a novelmethod or ast and accurate multiple sequence alignment,Journal o Molecular Biology, vol. , no. , pp. , .

    [] C. B. Do, M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou,ProbCons: probabilistic consistency-based multiple sequencealignment,Genome Research, vol. , no. , pp. , .

    [] Y. Liu, B. Schmidt, and D. L. Maskell, MSAProbs: multiplesequence alignment based on pair hidden Markov models andpartition unction posterior probabilities, Bioinormatics, vol., no. , pp. , .

    [] D. W. Mount, Using iterative methods or global multiplesequence alignment,Cold Spring Harbor Protocols, vol. , no., Article ID pdb.top, .

    [] O. Gotoh, Optimal alignment between groups o sequencesand its application to multiple sequence alignment,Computer

    Applications in the Biosciences, vol. , no. , pp. , .[] C. Notredame and D. G. Higgins, SAGA: sequence alignment

    by genetic algorithm,Nucleic Acids Research, vol. , no. , pp., .

    [] J. D. Tompson, F. Plewniak, and O. Poch, A comprehensivecomparison o multiple sequence alignment programs,NucleicAcids Research, vol. , no. , pp. , .

    [] A. M. Lesk and C. Chothia, How different amino acidsequences determine similar protein structures: the structureand evolutionary dynamics o the globins,Journal o MolecularBiology, vol. , no. , pp. , .

    [] O. OSullivan, K. Suhre, C. Abergel, D. G. Higgins, andC. Notredame, DCoffee: combining protein sequences and

    structures within multiple sequence alignments, Journal oMolecular Biology, vol. , no. , pp. , .

    [] F. Armougom, S. Moretti, O. Poirot et al., Expresso: automaticincorporation o structural inormation in multiple sequencealignments using D-Coffee,Nucleic Acids Research, vol. ,supplement , pp. WW, .

    [] X. Xia, S. Zhang, Y. Su, and Z. Sun, MICAlign: a sequence-to-structure alignment tool integrating multiple sources oinormation in conditional random elds, Bioinormatics, vol., no. , pp. , .

    [] Z. Zhang, A. A. Schaffer, W. Miller et al., Protein sequencesimilarity searches using patterns as seeds, Nucleic AcidsResearch, vol. , no. , pp. , .

    [] M. C. Frith, N. F. W. Saunders, B. Kobe, and . L. Bailey,Discovering sequence motis with arbitrary insertions anddeletions,PLoS Computational Biology, vol. , no. , Article IDe, .

    [] H. Li, J. Ruan, and R. Durbin, Mapping short DNA sequenc-ing reads and calling variants using mapping quality scores,Genome Research, vol. , no. , pp. , .

    [] R. Li, Y. Li, K. Kristiansen, and J. Wang, SOAP: short oligonu-cleotide alignment program, Bioinormatics, vol. , no. , pp., .

    [] B. Langmead, C. rapnell, M. Pop, and S. L. Salzberg, Ultraastand memory-efficient alignment o short DNA sequences tothe human genome,Genome Biology, vol. , no. , article R,.

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    13/15

    ISRN Biomathematics

    [] A Denition o Te Cloud at Last?Web PerormanceWatch, ,http://blogs.keynote.com/the watch///or-as-long-as-ive-been-involved-in-writing-about-the-cloud-and-its-related-applications-ive-seen-the-basic-question.html.

    [] Virtualization is a key enabler o cloud computing, ,http://www.itnewsarica.com///virtualization-is-a-key-enabler-o-cloud-computing/ .

    [] D. Arthur and S. Vassilvitskii, k-means++: the advantages ocareul seeding, inProceedings o the th Annual ACM-SIAMSymposium on Discrete Algorithms, Society or Industrial andApplied Mathematics, .

    [] J. Soding, Protein homology detection by HMM-HMM com-parison, Bioinormatics, vol. , no. , pp. , .

    [] F. Sievers, D. Dineen, A. Wilm, and D. G. Higgins, Makingautomated multiplealignments o very large numbers o proteinsequences, Bioinormatics, vol. , no. , pp. , .

    [] K. Katoh, K. Misawa, K.-I. Kuma, and . Miyata, MAFF: anovel method or rapid multiple sequence alignment based onast Fourier transorm,Nucleic Acids Research, vol. , no. ,pp. , .

    [] What cloud computing really means, , http://www.inoworld.com/d/cloud-computing/what-cloud-computing-really-means-.

    [] MI credits Irish-based entrepreneur with co-coiningterm cloud computing, , http://www.siliconrepublic.com/cloud/item/-mit-credits-irish-based-ent.

    [] Te Benets O Data Center Virtualization For Businesses,Cloudweaks, , http://www.cloudtweaks.com///the-benets-o-data-center-virtualization-or-businesses/.

    [] VMware Virtualization Sofware or Desktops, Servers &Virtual Machines or Public and Private Cloud Solutions, ,http://www.vmware.com/.

    [] Main PageKVM, , http://www.linux-kvm.org/page/

    Main Page.[] Accelerating high-perormance computing applications

    using parallel computing, , http://embedded-computing.com/articles/accelerating-using-parallel-computing/.

    [] Cloud : What the heck do IaaS, PaaS and SaaS companiesdo? , http://venturebeat.com////cloud-iaas-paas-saas/.

    [] Amazon Web Services, Cloud Computing: Compute, Storage,Database, ,http://aws.amazon.com/.

    [] Microsof Home Page, Devices and Services, ,http://www.microsof.com/en-us/deault.aspx.

    [] Rackspace, ,http://www.rackspace.com/.

    [] Ensembl Genome Browser, , http://www.ensembl.org/

    index.html.[] GenBank Home, ,http://www.ncbi.nlm.nih.gov/genbank.

    [] CRMTe Enterprise Cloud Computing CompanySalesorce.com Europe, ,http://www.salesorce.com/eu/.

    [] V. Choudhary, Sofware as a service: implications or invest-ment in sofware development, in Proceedings o the thAnnual Hawaii International Conerence on System Sciences(HICSS ), January .

    [] G. Lawton, Developing sofware online with platorm-as-a-service technology,Computer, vol. , no. , pp. , .

    [] J. Deanand S. Ghemawat,MapReduce: simplieddata process-ing on large clusters,Communications o the ACM, vol. , no., pp. , .

    [] . Nguyen, W. Shi, andD. Ruden, CloudAligner:a ast andull-eatured MapReduce based tool or sequence mapping, BMCResearch Notes, vol. , article , .

    [] M. C. Schatz, CloudBurst: highly sensitive read mapping withMapReduce, Bioinormatics , vol. , no. , pp. , .

    [] L. Pireddu, S. Leo, and G. Zanetti, Seal: a distributedshort readmappingand duplicate removal tool, Bioinormatics , vol. , no., pp. , .

    [] Blastreduce: high perormance short read mapping withmapreduce, ,http://www.cbcb.umd.edu/sofware/blastre-duce.

    [] B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg,Searching or SNPs with cloud computing, Genome Biology,vol. , no. , article R, .

    [] M. C. Schatz, D. Sommer, D. Kelley, and M. Pop, De novoassembly o large genomes using cloud computing, in Proceed-ings o the CSHL Biology o Genomes Conerence, .

    [] Y.-J. Chang, C. C. Chen, C. L. Chen, and J. M. Ho, A denovo next generation genomic sequence assembler based onstring graph and mapreduce cloud computing ramework,

    BMC Genomics, vol. , supplement , p. S, .[] B. Langmead, K. D. Hansen, and J. . Leek, Cloud-scale

    RNA-sequencing differential expression analysis with Myrna,Genome Biology, vol. , no. , p. R, .

    [] D. Hong, A. Rhie, S.-S. Park et al., FX: an RNA-seq analysis toolon the cloud, Bioinormatics, vol. , no. , pp. , .

    [] L. Jourdren, M. Bernard, M. A. Dillies, and S. Le Crom,Eoulsan: a cloud computing-based ramework acilitating highthroughput sequencing analyses, Bioinormatics, vol. , no. ,pp. , .

    [] M. Niemenmaa, A. Kallio, A. Schumacher, P. Klemela, E.Korpelainen, and K. Heljanko, Hadoop-BAM: directly manip-ulating next generation sequencing data in the cloud,Bioinor-matics, vol. , no. , pp. , .

    [] B. D. OConnor, B. Merriman, and S. F. Nelson, SeqWare QueryEngine: storing and searchingsequence datain the cloud, BMCBioinormatics , vol. , supplement , article S, .

    [] A. McKenna, M. Hanna, E. Banks et al., Te genome analysistoolkit: a MapReduce ramework or analyzing next-generationDNA sequencing data, Genome Research, vol. , no. , pp., .

    [] S. J. Matthews and . L. Williams, MrsRF: an efficient MapRe-duce algorithm or analyzing large collections o evolutionarytrees, BMC Bioinormatics, vol. , supplement , article S,.

    [] M. E. Colosimo, M. W. Peterson, S. Mardis, and L. Hirschman,Nephele: genotyping via complete composition vectors and

    MapReduce, Source Code or Biology and Medicine, vol. ,article , .

    [] P. D. Vouzis and N. V. Sahinidis, GPU-BLAS: using graphicsprocessors to accelerate protein sequence alignment,Bioinor-matics, vol. , no. , pp. , .

    [] C.-M. Liu, . Wong, E. Wu et al., SOAP:ultra-ast GPU-basedparallel alignment tool or short reads, Bioinormatics, vol. ,no. , pp. , .

    [] S. Lewis, A. Csordas, S. Killcoyne et al., Hydra: a scalableproteomic search engine which utilizes the Hadoop distributedcomputing ramework, BMC Bioinormatics, vol. ,article ,.

    [] A. Matsunaga, M. sugawa, and J. Fortes, CloudBLAS: com-bining MapReduce and virtualization on distributed resources

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    14/15

    ISRN Biomathematics

    or bioinormatics applications, inProceedings o the th IEEEInternational Conerence on eScience (eScience ), pp. ,December .

    [] X. Feng, R. Grossman, and L. Stein, PeakRanger: a cloud-enabled peak calleror ChIP-seq data, BMC Bioinormatics, vol., article , .

    [] L. Zhang, S. Gu, Y. Liu, B. Wang, and F. Azuaje, Gene setanalysis in the cloud, Bioinormatics , vol. , no. , pp. ,.

    [] S. Leo, F. Santoni, and G. Zanetti, Biodoop: bioinormatics onhadoop, in Proceedingso the th International Conerence Par-allel Processing Workshops (ICPPW ), pp. , September.

    [] H. Huang, S. ata, and R. J. Prill, BlueSNP: R package orhighly scalable genome-wide association studies using Hadoopclusters, Bioinormatics, vol. , no. , pp. , .

    [] D. R. Kelley, M. C. Schatz, and S. L. Salzberg, Quake: quality-aware detection and correction o sequencing errors,GenomeBiology, vol. , no. , article R, .

    [] Apache Hadoop, , http://hadoop.apache.org/.

    [] How Hadoop Makes Short Work O Big DataForbes, ,http://www.orbes.com/sites/netapp////hadoop-big-data/.

    [] Public Data Sets on Amazon Web Services (AWS), ,http://aws.amazon.com/publicdatasets/.

    [] P. M. Kasson, Computational biology in the cloud: methodsandnew insights romcomputing at scale, in PacicSymposiumon Biocomputing. Pacic Symposium on Biocomputing, WorldScientic, .

    [] S. Vijayakumar, A. Bhargavi, U. Praseeda, and S. A. Ahamed,Optimizing sequence alignment in cloud using hadoop andMPP database, in Proceedings o the th IEEE InternationalConerence on Cloud Computing (CLOUD ), .

    [] R. D. Sleator, Proteins: orm and unction,Bioeng Bugs, vol. ,no. , pp. , .

  • 7/25/2019 Long Read Alignment with Parallel MapReduce Cloud Platform

    15/15

    Submit your manuscripts at

    http://www.hindawi.com