6
Please cite this article in press as: M. Krachunov, D. Vassilev, An approach to a metagenomic data processing workflow, J. Comput. Sci. (2013), http://dx.doi.org/10.1016/j.jocs.2013.08.003 ARTICLE IN PRESS G Model JOCS-222; No. of Pages 6 Journal of Computational Science xxx (2013) xxx–xxx Contents lists available at ScienceDirect Journal of Computational Science journa l h om epage: www.elsevier.com/locate/jocs An approach to a metagenomic data processing workflow Milko Krachunov a,, Dimitar Vassilev b a Faculty of Mathematics and Informatics, University of Sofia “St. Kliment Ohridski”, 5 James Bourchier Blvd., 1164 Sofia, Bulgaria b Bioinformatics Group, AgroBio Institute, 8 Dragan Tsankov Blvd., 1164 Sofia, Bulgaria a r t i c l e i n f o Article history: Received 15 January 2013 Received in revised form 16 June 2013 Accepted 13 August 2013 Available online xxx Keywords: Metagenomics Error detection NGS data analysis workflow a b s t r a c t Metagenomics is a rapidly growing field, which has been greatly driven by the ongoing advancements in high-throughput sequencing technologies. As a result, both the data preparation and the subse- quent in silico experiments pose unsolved technical and theoretical challenges, as there are not any well-established approaches, and new expertise and software are constantly emerging. Our project main focus is the creation and evaluation of a novel error detection and correction approach to be used inside a metagenomic processing workflow. The approach, together with an indirect validation technique and the already obtained empirical results, are described in detail in this paper. To aid the development and testing, we are also building a workflow execution system to run our experiments that is designed to be extensible beyond the scope of error detection which will be released as a free/open-source software package. © 2013 Elsevier B.V. All rights reserved. 1. Introduction 1.1. Problems in metagenomics Metagenomics deals with the mixed genetic data found in samples collected from heterogeneous biological environments, ranging from soil to the insides of various macro-organisms. These microbial communities are still largely unexplored, presenting the researchers with samples containing a large number of organisms from a great variety of microbial species, a large portion of which are presently unknown. Comparative analysis of these microbial communities is cru- cial for studies that explore issues ranging from human health [1] to bacterial and viral evolution [2]. They can have an impact on our understanding of the past of our biosphere as well as dealing with potential future threats as the most rapidly mutating agents, microbes can provide a lot of insight on evolution, and are also a critical factor in unexpected disease outbreaks. A researcher in the field of metagenomics has to deal with a variety of challenges [3,4]. As a new field there are yet no well- established methods to approach it, and they often have to face unsolved technical or methodological problems. The datasets are large and heterogeneous, most of the microbial species comprising them are not sequenced elsewhere, and with their rate of mutation Corresponding author. Tel.: +359 885988001. E-mail addresses: [email protected] (M. Krachunov), [email protected] (D. Vassilev). it is unclear if the present means of cataloguing genomes can be a feasible approach to simplify this task. Due to the nature of the data obtained which presently lacks any inherent reference points or a standard for validation every study involves the computational challenges associated with high- throughput de novo sequencing, which are further exacerbated by the need to deal with a larger degree of uncertainty, significantly larger amount of data and the need to adapt the data processing to every particular experiment, often multiple times. At present, all researchers have to deal with deficiencies in the data quality and limited capabilities of the software tools and processing methods. Their work involves time consuming processing of huge datasets and a great deal of uncertainty about the correctness of the input data as well as the results. 1.2. Our project and goal Initially, our project began as an attempt to reduce the impact of errors on the quality of the metagenomic studies by propos- ing a new error detection approach and comparing it with other approaches on metagenomic data. Soon it became clear, however, that obtaining a pristine metagenomic test data set, that can be used to give a definite confirmation of the advantage of one error detection method over another, could prove to be very difficult if not impossible because of the difficulty in taking the same sam- ple again. To deal with this, we had to come up with roundabout approaches to indirectly estimate the number of false positives and false negatives that an error detection procedure suffers from. These approaches, however, are neither as reliable nor as easy as a direct measurement of the quality of a real dataset. As a result, they 1877-7503/$ see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jocs.2013.08.003

An approach to a metagenomic data processing workflow

  • Upload
    dimitar

  • View
    215

  • Download
    3

Embed Size (px)

Citation preview

J

A

Ma

b

a

ARRAA

KMEN

1

1

srmrfa

ctowmc

veult

j

1h

ARTICLE IN PRESSG ModelOCS-222; No. of Pages 6

Journal of Computational Science xxx (2013) xxx–xxx

Contents lists available at ScienceDirect

Journal of Computational Science

journa l h om epage: www.elsev ier .com/ locate / jocs

n approach to a metagenomic data processing workflow

ilko Krachunova,∗, Dimitar Vassilevb

Faculty of Mathematics and Informatics, University of Sofia “St. Kliment Ohridski”, 5 James Bourchier Blvd., 1164 Sofia, BulgariaBioinformatics Group, AgroBio Institute, 8 Dragan Tsankov Blvd., 1164 Sofia, Bulgaria

r t i c l e i n f o

rticle history:eceived 15 January 2013eceived in revised form 16 June 2013ccepted 13 August 2013vailable online xxx

a b s t r a c t

Metagenomics is a rapidly growing field, which has been greatly driven by the ongoing advancementsin high-throughput sequencing technologies. As a result, both the data preparation and the subse-quent in silico experiments pose unsolved technical and theoretical challenges, as there are not anywell-established approaches, and new expertise and software are constantly emerging.

eywords:etagenomics

rror detectionGS data analysis workflow

Our project main focus is the creation and evaluation of a novel error detection and correction approachto be used inside a metagenomic processing workflow. The approach, together with an indirect validationtechnique and the already obtained empirical results, are described in detail in this paper. To aid thedevelopment and testing, we are also building a workflow execution system to run our experiments that isdesigned to be extensible beyond the scope of error detection which will be released as a free/open-sourcesoftware package.

. Introduction

.1. Problems in metagenomics

Metagenomics deals with the mixed genetic data found inamples collected from heterogeneous biological environments,anging from soil to the insides of various macro-organisms. Theseicrobial communities are still largely unexplored, presenting the

esearchers with samples containing a large number of organismsrom a great variety of microbial species, a large portion of whichre presently unknown.

Comparative analysis of these microbial communities is cru-ial for studies that explore issues ranging from human health [1]o bacterial and viral evolution [2]. They can have an impact onur understanding of the past of our biosphere as well as dealingith potential future threats – as the most rapidly mutating agents,icrobes can provide a lot of insight on evolution, and are also a

ritical factor in unexpected disease outbreaks.A researcher in the field of metagenomics has to deal with a

ariety of challenges [3,4]. As a new field there are yet no well-stablished methods to approach it, and they often have to facensolved technical or methodological problems. The datasets are

Please cite this article in press as: M. Krachunov, D. Vassilev, An appro(2013), http://dx.doi.org/10.1016/j.jocs.2013.08.003

arge and heterogeneous, most of the microbial species comprisinghem are not sequenced elsewhere, and with their rate of mutation

∗ Corresponding author. Tel.: +359 885988001.E-mail addresses: [email protected] (M. Krachunov),

[email protected] (D. Vassilev).

877-7503/$ – see front matter © 2013 Elsevier B.V. All rights reserved.ttp://dx.doi.org/10.1016/j.jocs.2013.08.003

© 2013 Elsevier B.V. All rights reserved.

it is unclear if the present means of cataloguing genomes can be afeasible approach to simplify this task.

Due to the nature of the data obtained – which presently lacksany inherent reference points or a standard for validation – everystudy involves the computational challenges associated with high-throughput de novo sequencing, which are further exacerbated bythe need to deal with a larger degree of uncertainty, significantlylarger amount of data and the need to adapt the data processing toevery particular experiment, often multiple times.

At present, all researchers have to deal with deficiencies inthe data quality and limited capabilities of the software toolsand processing methods. Their work involves time consumingprocessing of huge datasets and a great deal of uncertainty aboutthe correctness of the input data as well as the results.

1.2. Our project and goal

Initially, our project began as an attempt to reduce the impactof errors on the quality of the metagenomic studies by propos-ing a new error detection approach and comparing it with otherapproaches on metagenomic data. Soon it became clear, however,that obtaining a pristine metagenomic test data set, that can beused to give a definite confirmation of the advantage of one errordetection method over another, could prove to be very difficult ifnot impossible because of the difficulty in taking the same sam-ple again. To deal with this, we had to come up with roundabout

ach to a metagenomic data processing workflow, J. Comput. Sci.

approaches to indirectly estimate the number of false positivesand false negatives that an error detection procedure suffers from.These approaches, however, are neither as reliable nor as easy as adirect measurement of the quality of a real dataset. As a result, they

ARTICLE ING ModelJOCS-222; No. of Pages 6

2 M. Krachunov, D. Vassilev / Journal of Comp

ae

cpcim

wca

2

2

istn

bttopb

2

taiFoma

cwIeo

aaa

Noidtsmbt

Fig. 1. An excerpt from the input datasets.

re dependent on the execution of a big number of computationalxperiments on a very large number of datasets.

These experiments constitute a processing workflow that exe-utes multiple genomic software packages in which both thearameters and the procedure need to be varied. We came to theonclusion that building a tool for managing, running and distribut-ng the genomic toolchain would greatly reduce the amount of

anual work required to run any metagenomic experiment.Thus our initial goal of building and validating error detection

as extended to the larger project of developing a library for exe-uting configurable genomic workflows capable of interfacing withrbitrary external tools.

. Material and methods

.1. The input data

16S RNA is very attractive for metagenomic analysis, because its highly conserved and thus largely similar across a great deal ofpecies, while at the same time it contains hypervariable regionshat are incredibly helpful for identifying species, individual orga-isms and finding their evolutionary relationships [5].

The sample datasets for our experiments contain short readsetween 300 and 500 bases in length, divided in sets of tens ofhousands of sequences – between 20000 and 50000 after filteringhem by length (≥300, ≤500 bp) and quality (throwing out ambigu-us bases). All our sample datasets were sequenced using the 454latform by Roche. It is very suitable for metagenomic experimentsecause it produces short reads of sufficient length.

.2. Data preparation using sequence alignment

One of the crucial steps in a typical metagenomic workflow ishe sequence alignment and any processing heavily relies on one’sbility to do fast multiple sequence alignments of acceptable qual-ty. If we look at any sample excerpt from our datasets like the one inig. 1 we can easily notice that the sequences are displaced becausef missing or extra bases. This makes it impossible to perform anyeaningful column-wise analysis unless such displacements are

ccounted for by the sequence alignment.A high-quality alignment is particularly important for the exe-

ution of the error detection approach proposed in the next sectionhich relies on column-wise comparison across multiple species.

f alignment of the denoised data is desired, a modification of therror detection method extended to perform correction can be usedn a preliminary alignment before the real alignment is executed.

Unfortunately for metagenomics the datasets are much largernd far more varied than those found in regular genomics, and thepproaches for alignment used in de novo sequencing, resequencingnd sequence searching are no longer suitable [6–8].

Finding the globally optimal alignment for n sequences is anP-complete problem. For any considerably sized dataset like thenes found in metagenomics finding this optimum is a practicalmpossibility. Furthermore, inexact methods are more likely to pro-uce bad results for heterogeneous data. This is in contrast withhe highly similar data that is found in genomic studies where the

Please cite this article in press as: M. Krachunov, D. Vassilev, An appro(2013), http://dx.doi.org/10.1016/j.jocs.2013.08.003

equences are usually limited to a single species, or where there areeans for obtaining representative sequences that input data can

e aligned against, which allows one to align fast without sacrificinghe quality.

PRESSutational Science xxx (2013) xxx–xxx

In our experiments, we ran several alignment software packagesintended for large datasets. We discovered that when we ran themwith the stricter parameters intended for higher quality results theycould not process our data in a reasonable time (the execution timewas in the range of days), but when we ran them in a less accuratemode they did not produce acceptable results.

To remedy this we used a surprisingly simple and straight-forward approach. We performed a quick clustering of the datasetusing the CD-HIT-454 software [9]. We aligned each cluster with asoftware solution and settings for a high-quality alignment, in par-ticular we used MAFFT [10] and MUSCLE [11]. Then, we aligned theclusters against each other and combined them in a manner similarto the one used in multiple sequence alignment using a guide tree.

Counter-intuitively, the alignment took significantly less timeand was significantly superior in quality to the alignment thatwe got when we ran MAFFT or MUSCLE directly. While evidentlythe alignment of the metagenomic datasets is feasible, we did notfind a straightforward solution and we had to improvise despitethe fact alignment is a very basic component of the metagenomicprocessing.

Such makeshift solutions are not always obvious and can differgreatly in quality depending on how they are constructed, and assuch can be greatly facilitated by software for building and launch-ing preconfigured workflows. Such software would also allow fora quicker comparison between the various options as this wouldno longer need to be done by hand. One of our major goals is notsimply to build a metagenomic workflow or pipeline that performsmultiple sequence alignment, but to extend it as to allow easierexperimentation by allowing arbitrary combinations of softwarepackages to perform this task.

2.3. Improving read quality by error detection and correction

One significant obstacle in metagenomic studies is the uncer-tainty about the data correctness. Sequencing equipment producesa great deal of errors that can be intermixed with meaningful dif-ferences with biological origin such as mutations and meaninglesserrors with biological origin such as errors during amplification, allof which initially occur randomly.

The mutations are an important subject of evolutionary studiesand can provide invaluable insight on the development and prop-agation of microbial species. Unlike the other two kinds of errors,mutations are most often found at an evolutionary dead-end thatkills the organism, which makes the surviving ones peculiar in thatthey are an object of interest with the information about the speciesthey carry, while at the same time they provide an opportunity fordistinguishing them from actual errors.

A common approach to tell them apart is to use their frequencyof appearance, which is not always reliable. It is common practiceto throw out any reads suspected to have errors in them, but thiscan reduce the size of the dataset by an order of magnitude, whilemost of the discarded information was correct.

Improving the means for detecting and correcting those errors,as well as proposing ways to utilise the information present in thoseoften discarded sequences, is one thing that can lead to a significantimprovement in all metagenomic studies.

2.3.1. The naïve approachThe most obvious way to spot errors is simply look for data that

occurs rarely. This can be done by counting the frequency of occur-rence of each base in each column. The bases that appear less oftenthan a threshold that was established beforehand are considered

ach to a metagenomic data processing workflow, J. Comput. Sci.

errors.The assumption behind this approach is that while mutations

happen at a slower rate then errors, their numbers are multipliedby inheritance, as the surviving ones will span through multiple

ARTICLE ING ModelJOCS-222; No. of Pages 6

M. Krachunov, D. Vassilev / Journal of Comp

gMti

srr

s

naagera[

mtta

2

tapw

salmb

ahat

Fig. 2. Similarity-based error detection.

enerations, while the deadly ones are not a good target for studies.oreover, the samples are usually made large enough to con-

ain multiple samplings of the same organism, which can furtherncrease the occurrences of a single mutation.

To mathematically express this naïve approach, we will define acore function that corresponds to the frequency. If R are the set ofeads, the frequency (“simple”, “naïve”) score for position k in read

is:

core(r, k) =p /= r∑

p∈R

[rk = pk]n − 1

=∑p /= r

p∈R [rk = pk]∑p /= r

p∈R 1(1)

To extend this error detection approach to error correction, oneeeds to simply replace the erroneous bases with the ones thatppears most often in their column. And this approach would make

lot of sense if the dataset is a newly-sequenced genome of a sin-le macro-organism. It is, in fact, mathematically identical to thestablished approach used in de novo sequencing – acquire enougheads to produce a reasonable coverage of the region in question,nd for each position, select the base that is occurring most often12].

Unfortunately, this will not work when the dataset containsultiple distinct organisms. For each position, there are often mul-

iple competing options that are correct at time same time. Andhere might be entire reads that occur less often than the threshold,nd they would be corrected in their entirety.

.3.2. Similarity-based approachTo make the error detection and correction more suitable for

he heterogeneous nature of metagenomic datasets, we proposen improved method that extends the one described above. In ourroposal, we will still count the frequencies of each base, but weill do so while taking the context into account.

Our proposal will try to fulfil the following requirements (Fig. 2):

A mismatch between two similar sequences is more importantthan a mismatch between two dissimilar reads. In more generalterms, the importance of the mismatch should be proportional tothe similarity between the reads.The similarity in the proximity of the mismatch is more impor-tant than the similarity away from it. In more general terms,the importance of the similarity at a given position should beinversely proportional to the distance from the mismatch.

The reason we want to count only mismatches in similar readshould be obvious – the probability of multiple errors occurringlongside each other is significantly lower than a probability of aone error; if an entire region of the sequence is different, it is much

ore likely that it came from a more distant organism than that allases had been miraculously changed.

By taking the similarity into account we immediately gain an

Please cite this article in press as: M. Krachunov, D. Vassilev, An appro(2013), http://dx.doi.org/10.1016/j.jocs.2013.08.003

dvantage over the naïve approach – if we find an entire region thatas been replaced, we would not mistake it for an error, and if therere two distinct sets of sequences, we would correctly separatehem when we are correcting errors in each of them.

PRESSutational Science xxx (2013) xxx–xxx 3

The proximity requirement should also be evident, but it isimportant to remember that if two bases are close to each otherthey are much more likely to be functionally related, and are alsomuch more likely to simultaneously participate in a long insertion,deletion or horizontal gene transfer.

Algorithm 1. Similarity-based error detection

1. A window is created around each evaluated position.2. For each two reads compared for a mismatch at that position, the

similarity in the window is calculated. The positions right nextto the evaluated one have a higher weight in the similarity score.

3. The mismatch score for that pair and position is incrementedwith the similarity score divided by the sum of all similarityscores in the position.

If we amend the formula from (1) with our extended approach,we get the following “similarity” score:

score(r, k) =∑p /= r

p∈R similarity(r, p, k)[rk = pk]∑p /= r

p∈R similarity(r, p, k)(2)

The choice of a good similarity function is non-obvious. Ourproposal uses an exponentially decreasing function which quicklybecomes near-zero when moving away from the position of inter-est. This means that the window size becomes irrelevant after acertain point as the importance quickly tends to zero – thus, thewindow size would not be a parameter of the method.

Let q be some parameter that has to be experimentally evaluatedand w be the size of the window that we deemed suitable, we cancalculate the similarity as following:

similarity(r, p, k) =∑

i∈window(r,k)q|m−i|[ri = pi]∑

i∈window(r,k)q|m−i| (3)

window(r, k) = {i : ∃ri} ∩ ({k − w, . . ., k − 1} ∪ {k + 1, . . ., k + w})(4)

The time-complexity of our proposal is O(wn2). However, wehave the ability to cache all the possibilities within a window.From a theoretical point of view, this leads to a time-complexityof O(n · exp(w)) which would render the optimisation useless, butsince we can work under a certain limit for the window size andthe variation per position is significantly lower than the one forrandom data, caching can become practically faster for extremelylarge datasets.

We should also mention a curious side-effect that the similarity-based approach has. Normally, when one is doing sequencing,they only use sequences from one particular organism to con-firm a sequence. In a metagenomic dataset, however, one is forcedto utilise the information from different organisms which is themain source of the additional noise in the dataset. However, whilethe local similarity should filter that noise and silence the foreignsequences, in rare occasions two different organisms can confirmeach other for regions that are highly conserved in cases where thecoverage for each organism is below the one needed.

To extend the similarity-based approach to error correction wecan simply apply the score formula (2) for all potential replacementbases for the given position. This makes more sense than with thenaïve approach, because of the property of the similarity score togroup multiple possibilities together.

ach to a metagenomic data processing workflow, J. Comput. Sci.

2.4. Validating error detection and correction

As mentioned earlier, it is practically very difficult to obtaina reliable test dataset that is confirmed to be a representative

ING ModelJ

4 Comp

sstfcra

ra

2

sisoe

dctaaub

fif

P

12

3

4

5

dhu1trta

patn

2

icucae

d

ARTICLEOCS-222; No. of Pages 6

M. Krachunov, D. Vassilev / Journal of

ample of a real metagenomic dataset. Any available error-freeets of sequencing runs would be distorted by any error detec-ion or filtering that was used to construct them. Sets constructedrom databases of known micro-organisms will be biased when itomes to error detection, because of the high likelihood of missingare mutations. We were also unable to find any established freelyvailable methods for the construction of simulated reads.

Because of the aforementioned issues, we chose to use two indi-ect approaches for performing the validation of our error detectionnd correction.

.4.1. Validation through repeated application approachIt is difficult to reliably simulate sequencing runs, but it is

impler to simulate errors similar to the ones that the sequenc-ng equipment produces. One can estimate the error patterns byequencing sets of known sequences and comparing the sequencednes to the known ones. Then the same error patterns can be gen-rated inside some other sequences that are available.

This provides an opportunity to indirectly validate an erroretection approach and compare it against another. The goal toreate a better error detection approach can be restated as the goalo create an error detection with less false positives for the samemount of false negatives. In other words, the better error detectionpproach should be expected to flag less correct data as incorrectnder the settings with which it properly identifies the same num-er of real errors.

We will call the description of those error patterns an error pro-le. Once the error profile has been estimated, one can apply the

ollowing procedure:

rocedure 1. Repeated application procedure

. Obtain dataset 0e from the sequencing equipment.

. Apply the error correction approach to dataset 0e, producingdataset 1.

. Simulate errors in dataset 1 according to the error profile thathas been estimated, producing dataset 1e.

. Apply the error correction approach to dataset 1e, producingdataset 2.

. Measure the differences between datasets.

It might sounds counter-productive to introduce errors in aataset that contained errors and re-apply the error correction,owever this procedure yields a couple of useful statistical fig-res. If the error profile is correct, the comparison between 1,e and 2 gives an accurate estimate about the false negatives ofhe approach. The amount of missed simulated errors should beoughly the same as the amount of missed real errors. At the sameime, the difference between dataset 0e and dataset 1 gives the totalmount of corrections made.

These two figures can be used to indirectly measure and com-are the false positives of two approaches. If the “improved”pproach under validation leads to the same number of false nega-ives yet makes less corrections overall, this is a clear sign that theumber of false positives has also decreased.

.4.2. Validation through subset approachThe reason it is difficult to obtain a test dataset in metagenomics

s the difficulty in sequencing the exact same sample again. If oneould do that, they would sequence the sample over and over again,ntil they knew the genomes of all the microbial species inside withertainty. Then it would be possible to correct most of the reads in

Please cite this article in press as: M. Krachunov, D. Vassilev, An appro(2013), http://dx.doi.org/10.1016/j.jocs.2013.08.003

sequence run using the fully sequenced genomes and use that tostimate the usefulness of the error correction.

Another way to state this is that if one has a large enoughataset, i.e. a very large number of sequence runs, one could use it to

PRESSutational Science xxx (2013) xxx–xxx

validate the error correction methods on a small subset of it, i.e. onesequence run. But even if a humongous dataset to work with is notavailable, it is still true that a larger dataset gives a higher reliabilitythan a smaller one.

This makes it possible to apply the following procedure:

Procedure 2. Subset procedure

1. Obtain the largest usable dataset available.2. Apply the error correction to it, and use that as a standard for

validation.3. Take a random subset of it that is one order of magnitude smaller

than the whole set.4. Apply the error correction to the subset and compare with the

standard.

Both datasets are representative of a typical sequencing run,having the right distribution of errors as well as the right distri-bution of sequences. At the same time, the larger dataset is muchmore reliable than the smaller one, meaning that both error detec-tion approaches would make more mistakes on the smaller one.If the large dataset is large enough – in other words, as large as itis needed for the two error correction approaches to have as lit-tle errors as possible – then the differences between the large andthe small ones can be used to estimate false positives and falsenegatives.

2.5. Building the processing workflow

The preparation, denoising and validation steps described thusfar can be delivered by a parametric metagenomic workflow whichdescribes the computational experiment used for confirmation ofthe denoising procedure.

The availability of a tool for managing, running and distribut-ing these workflows would greatly reduce the amount of manualwork required to perform them, would facilitate the use of dis-tributing them across a computer cluster and can also greatly helpwith the storage and management of intermediate and final results.An example of such execution is shown in Fig. 3.

One of our final goals is to create a library that provides themeans for performing precisely such executions in a simple enoughmanner. We have chosen to develop it as a configurable, network-capable and fully-asynchronous workflow library that provides asimple description language for the workflows as well as enoughflexibility.

3. Results and discussion

3.1. Experimental results

On a test run of 4540 sequences, we performed both correctionwith the similarity-based approach and correction with the naïveapproach. The naïve approach produced 673 corrections, while thesimilarity-based produced only 607, or 66 less, which is a 10%decrease in the number of corrections.

We then simulated errors using our estimated error profile, andthen ran the error correction approaches again. As illustrated inFig. 4 in the set initially corrected by the naïve approach, the falsenegatives were 34 for both, while in the set initially corrected bythe similarity-based approach, the false negatives were 33 for both.In other words, the number of false negatives has remained almostthe same, and has even seen a statistically insignificant decrease.

ach to a metagenomic data processing workflow, J. Comput. Sci.

This result is consistent with our expectation for a decrease inthe number of overall corrections for the same number of false neg-atives. The weighted approach was intended to filter out spuriouscorrections without eliminating valid ones. As the number of false

ARTICLE IN PRESSG ModelJOCS-222; No. of Pages 6

M. Krachunov, D. Vassilev / Journal of Computational Science xxx (2013) xxx–xxx 5

mple

ntr

ftsic

tbitpwa

deytt

Fig. 3. Exa

egatives has been preserved, we have strong reason to suspecthat these 66 additional corrections made by the naïve approachepresent actual such false positives.

During the second correction run we counted the false positivesor completeness. It should be noted that unlike the false negatives,his count is not necessarily meaningful, as it is measured againstimulated errors when the real errors are still unknown, and as sucht serves mostly as an estimate for the number of unexpected extraorrections made during the second run.

There were virtually no such extra corrections on the set ini-ially corrected with the naïve approach as no corrections haveeen filtered out by similarity weights in the first run. The set

nitially corrected with the similarity-based approach had 67 addi-ional corrections made by the naïve one – corresponding to theotential 66 false positives that we earlier suspected in the first run,hile only 18 new corrections were made by the similarity-based

pproach.These results were obtained on the largest cluster of our test

atasets with a threshold value derived from the known global

Please cite this article in press as: M. Krachunov, D. Vassilev, An appro(2013), http://dx.doi.org/10.1016/j.jocs.2013.08.003

rror rate used for error simulation. The remaining test dataielded similar results, but they were less significant because ofhe smaller number of sequences. It is clear that more computa-ional experiments are required for the comparison between the

Fig. 4. Repeated application

workflow.

two approaches to be more reliable, which is what our workflowlibrary is intended to facilitate. In particular, it allows us to processmore datasets while varying any parameters freely, including theerror threshold.

The results from running the subset approach on small datasetsproved to be inconclusive showing the need to use a larger datasetin a higher number of experiments with a highly variable choice ofparameters (e.g. subset size) using the completed workflow library.

3.2. Metagenomic workflow software library

The first experiments we ran were evidence that a flexible yetefficient means to manage and execute procedures will be a veryhelpful contribution to the bioinformatics community. Any work inthe area involves the execution of a mixture of software packages ina variety of orders and combinations, which often involves a lot ofmanual and unnecessary work that’s unrelated to the actual studyor software development.

To perform our initial tests we had to implement off-the-cuff

ach to a metagenomic data processing workflow, J. Comput. Sci.

programs to execute such workflows, including launching ouraligners, performing the error corrections and running the vali-dation. They are now being reworked into a more robust andgeneral-purpose library together with a command-line tool.

validation approach.

ING ModelJ

6 Comp

p

1

2

3

4

5

iabpmcd

(Mwteagep

aac

4

ioflepah

[

[

[

[

software solutions for de novo next generation sequencinganalysis in metagenomics and plant and human genomics

ARTICLEOCS-222; No. of Pages 6

M. Krachunov, D. Vassilev / Journal of

We are now in the process of developing a Python softwareackage that:

. Can execute any required genomic processing task asyn-chronously so that they can be easily distributed across multiplecores or machines. The interface is network-capable and basedon the Twisted networking framework [13].

. Has a simple modular design that allows for easy extensibility,scalability and improvements.

. Has a common application programming interface (API) forevery supported task.

. Has modules that are easily replaceable or extensible. For exam-ple, it should be trivial to introduce a cluster execution pluginthat replaces the internal aligner with the same aligner ran on acluster.

. Provide access to the functionality through both a Python API,a command-line interface and a simple task mini-language thatallows the execution of simple workflows based on the YAMLmarkup language.

By using an asynchronous networking framework and creatingnterfaces fully adhering to its idioms, the tasks of scheduling, par-llelising and distributing the jobs across networked computersecomes easy. It also becomes easier to run a variety of com-onents in a variety of combinations while doing basic resourceanagement. Such approach would also facilitate the running and

omparison of different error detection approaches with differentata processing layers.

The processing components are split into generic operationse.g. alignment) and their more specific flavours (e.g. alignment by

USCLE), and operation providers (e.g. MUSCLE running locally),hich can be superseded by plugins (e.g. distributed execu-

ion). The workflow describes the operation used to generateach intermediate dataset from a number of required intermedi-te or input datasets. The workflow is translated in an implicitraph of dependencies. The scheduler solves the dependencies forach intermediate dataset and runs any required operations asrocessing resources are available.

All the software developed as part of this project will be releaseds free software under the X11 license once the central componentsnd overall design of rework have reached a stable stage that isonductive to decentralised development.

. Conclusions

The suggested method for improvement of error detection, dur-ng our experimental validation, showed results consistent withur expectations related to improvement in the quality. The work-ow library and command-line tool will allow for the flexible

Please cite this article in press as: M. Krachunov, D. Vassilev, An appro(2013), http://dx.doi.org/10.1016/j.jocs.2013.08.003

xecution of a large number of such experiments that wouldrovide a more reliable means of validation for this error detectionpproach as well as error detection approaches in metagenomicigh-throughput sequencing in general.

PRESSutational Science xxx (2013) xxx–xxx

References

[1] K. Nelson, B. White, Metagenomics and its applications to the study of thehuman microbiome, Metagenomics: Theory, Methods and Applications (2010)171–182.

[2] D. Kristensen, A. Mushegian, V. Dolja, E. Koonin, New dimensions of the virusworld discovered through metagenomics, Trends in Microbiology 18 (1) (2010)11–19, doi:10.1016/j.tim.2009.11.003. http://www.biomedsearch.com/nih/New-dimensions-virus-world-discovered/19942437.html

[3] J. Wooley, A. Godzik, I. Freiedberg, A primer on metagenomics, PLoS Computa-tional Biology 6 (2) (2010) 289–290, doi:10.1371/journal.pcbi.1000667.

[4] T. Thomas, J. Gilterr, F. Meyer, Metagenomics – a guide from samplingto data analysis, Microbial Informatics and Experimentation 2 (1) (2012)3, doi:10.1186/2042-5783-2-3. http://www.microbialinformaticsj.com/content/2/1/3

[5] W. Weisburg, S. Barns, D. Pelletier, L.D.J., 16S ribosomal DNA amplification forphylogenetic study, Journal of Bacteriology 173 (2) (1991) 697–703.

[6] W. Li, L. Fu, B. Niu, S. Wu, J. Wooley, Ultrafast clustering algorithms for metage-nomic sequence analysis, Briefings in Bioinformatics 13 (6) (2012) 656–668,doi:10.1093/bib/bbs035.

[7] C.I. Hunter, A. Mitchell, P. Jones, C. McAnulla, S. Pesseat, M.Scheremetjew, S. Hunter, Metagenomic analysis: the challenge ofthe data bonanza, Briefings in Bioinformatics 13 (6) (2012) 743–746,doi:10.1093/bib/bbs020. http://bib.oxfordjournals.org/content/early/2012/09/07/bib.bbs020.full.pdf

[8] S. Mande, M. Mohammed, T. Ghosh, Classification of metagenomicsequences: methods and challenges, Briefings in Bioinformatics 13 (6)(2012) 669–681, doi:10.1093/bib/bbs054. http://bib.oxfordjournals.org/content/early/2012/09/07/bib.bbs054.full.pdf

[9] W. Li, A. Godzik, Cd-hit: a fast program for clustering and comparing large setsof protein or nucleotide sequences, Bioinformatics 22 (13) (2006) 1658–1659,doi:10.1093/bioinformatics/btl158. http://bioinformatics.oxfordjournals.org/content/22/13/1658.full.pdf

10] K. Katoh, K. Kuma, H. Toh, T. Miyata, MAFFT version 5: improvement in accuracyof multiple sequence alignment, Nucleic Acids Research 33 (2) (2005) 511–518.

11] R. Edgar, MUSCLE: a multiple sequence alignment method with reduced timeand space complexity, BMC Bioinformatics 5 (1) (2004) 113, doi:10.1186/1471-2105-5-113.

12] D. Zerbino, E. Birney, Velvet: algorithms for de novo short read assem-bly using de bruijn graphs, Genome Research 18 (5) (2008) 821–829,doi:10.1101/gr.074492.107.

13] M. Zadka, G. Lefkowitz, The twisted network framework, 10th InternationalPython Conference, https://twistedmatrix.com/users/glyph/ipc10/paper.html

Milko Krachunov studied Applied Mathematics in theUniversity of Sofia in Bulgaria, and has completed a Mas-ter’s degree in Artificial Intelligence there. He is nowa Ph.D. student working with the analysis of metage-nomic data from high-throughput sequencing in a teamfrom University of Sofia and ArgoBio Institute, Sofia. Heis involved with the development of an error detectionmethod for metagenomic datasets and a tool for automat-ing custom sequencing workflows.

Dr. Dimitar Vassilev is Associate Professor at AgroBioInstitute in Sofia, Bulgaria. He is leader of Bioinformaticsgroup. His research interests and projects are focused onthe development of bioinformatics tools for the analysisof “-omics” data, like the development of algorithms and

ach to a metagenomic data processing workflow, J. Comput. Sci.

and transcriptomics studies. Parallel computing applica-tions in high-throughput sequencing data analysis are alsowithin the research scope of the group.