Barcodes Genomes

Embed Size (px)

Citation preview

  • 8/3/2019 Barcodes Genomes

    1/29

    This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formattedPDF and full text (HTML) versions will be made available soon.

    Barcodes for genomes and applications

    BMC Bioinformatics 2008, 9:546 doi:10.1186/1471-2105-9-546

    Fengfeng Zhou ([email protected])Victor Olman ([email protected])

    Ying Xu ([email protected])

    ISSN 1471-2105

    Article type Research article

    Submission date 16 June 2008

    Acceptance date 17 December 2008

    Publication date 17 December 2008

    Article URL http://www.biomedcentral.com/1471-2105/9/546

    Like all articles in BMC journals, this peer-reviewed article was published immediately uponacceptance. It can be downloaded, printed and distributed freely for any purposes (see copyright

    notice below).

    Articles in BMC journals are listed in PubMed and archived at PubMed Central.

    For information about publishing your research in BMC journals or any BioMed Central journal, go to

    http://www.biomedcentral.com/info/authors/

    BMC Bioinformatics

    2008 Zhou et al. , licensee BioMed Central Ltd.This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),

    which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    mailto:[email protected]:[email protected]:[email protected]://www.biomedcentral.com/1471-2105/9/546http://www.biomedcentral.com/info/authors/http://creativecommons.org/licenses/by/2.0http://creativecommons.org/licenses/by/2.0http://www.biomedcentral.com/info/authors/http://www.biomedcentral.com/1471-2105/9/546mailto:[email protected]:[email protected]:[email protected]
  • 8/3/2019 Barcodes Genomes

    2/29

    -1-

    Barcodesforgenomesandapplications

    FengfengZhou,VictorOlmanandYingXu*

    DepartmentofBiochemistryandMolecularBiologyandInstituteofBioinformatics,

    andBioEnergyScienceCenter(BESC),UniversityofGeorgia,Athens,GA30602,

    USA.

    Theseauthorscontributedequallytothiswork*Correspondingauthor

    Emailaddresses:

    FZ:[email protected]

    VO:[email protected]

    YX:[email protected]

  • 8/3/2019 Barcodes Genomes

    3/29

    -2-

    Abstract

    Background

    Eachgenomehasastabledistributionofthecombinedfrequencyforeachk-merand

    itsreversecomplementmeasuredinsequencefragmentsasshortas1000bpsacross

    thewholegenome,for1

  • 8/3/2019 Barcodes Genomes

    4/29

    -3-

    Background

    The challenges being faced in sorting out short genomic fragments generated by

    metagenome sequencing projects [1] pose a fundamental question: does each

    genome have a unique signature imprinted on its short sequence fragments so that

    fragmentsfromthesamegenomesinametagenomecanbeidentifiedaccurately?A

    positiveanswertothisquestioncouldhavesignificantimplicationstomanyimportant

    genomeandmetagenomeanalysisproblemssuchasidentificationofgeneticmaterial

    transferredfromotherorganisms[2]orthroughvirusinvasions[3,4],separationof

    short sequence fragments generated by metagenome sequencing into individual

    genomes[5]andphylogeneticanalysesofgenomes[6].

    Understandingthe intrinsic properties of genomesequences,either generalto allor

    specifictosomeclassesofgenomes,hasbeenthefocusofmanystudiesinthepast

    twodecades.EarlierworkincludesthediscoveryoftheperiodicitypropertyofDNA

    sequencesacrossbothprokaryoticandeukaryoticgenomes[7]andtherealizationthat

    codingsequencesfollowMarkovchainproperties[8-10].Karlinandcolleagueshave

    studiedvariousgenomepropertiesbasedonanalysesofk-merfrequencydistributions,

    and have observed that the di-nucleotide relative abundance, a normalized di-mer

    frequency with respect to the mono-mer frequencies, is generally stable across a

    genomemeasuredon50Kbase-pair(bp)fragments[11-13].Theyevensuggestedthat

    such normalized di-mer frequency distributions can possibly serve as signatures of

    genomes.

    Inthispaper,wepresentabarcodingschemeforallsequencedgenomes,andillustrate

    a number of interesting and useful properties of the barcodes, which we can take

    advantagetosolvechallenginggenomeanalysisproblems.Wehighlightthepowerof

    this barcoding scheme through addressing two application problems: metagenome

    binningproblemandidentificationofhorizontallytransferredgenes.

  • 8/3/2019 Barcodes Genomes

    5/29

    -4-

    Results

    Barcodesandtheirproperties

    We have calculatedthe barcodefor each sequenced prokaryotic genome, using the

    followingprocedure.Foreachgenome,wepartitionitssequenceintoaseriesofnon-

    overlappingandequal-sizedfragmentsofMbps;thenforeachk-mer(1

  • 8/3/2019 Barcodes Genomes

    6/29

    -5-

    givingrisetotheverticalbandswithconsistentgreylevelsacrosseachbarcode;(b)

    thesmallfractionofthefragmentswithclearlydifferent,abnormal,barcodes

    (horizontalstripesinthebarcodes)thantherestofthegenometypicallyrepresent2-3

    specialclassesofgenes(seediscussionlater);(c)multiplechromosomesofthesame

    organismsgenerallyhavehighlysimilarbarcodes(Figure2(a))buttheyeachhave

    theiruniquepatternsofabnormalfragments;and(d)thebarcodessimilaritiestendto

    begenerallyproportionaltothegenomesphylogeneticcloseness(Figure2(b)).

    Tounderstandwhyagenomicsequencehasthebarcodeproperty,wehaveexamined

    randomnucleotidesequencesgeneratedusingdifferentmodels,includingMarkov

    chainmodelsoforderfrom0to6.Weobservedthatbarcodesforrandomnucleotide

    sequencesgeneratedusingathird-orderMarkovchainmodelaretheclosesttothe

    barcodesofgenomicsequencesintermsoftheirappearances(Additionalfile1),and

    higherorderMarkovchainmodelsdonotseemtoaddmuchtothisproperty.Hence

    webelievethatthebarcodepropertyofprokaryoticgenomesismainlyduetothe

    third-orderMarkovchainpropertyofthecodingsequencesinthegenomes,which

    countfor80-90%ofatypicalprokaryoticgenome.Itisworthnotingthatbarcodesfor

    codingandnon-codingsequencesofthesamegenomearegenerallydifferentthough

    theyshareaweaklysimilarbackbonestructurewhileeachofthesetwoclassesof

    (composite)regionsgenerallyhashighlysimilarbarcodes(Figure3).

    Extensiontoothergenomes

    Inadditiontoprokaryotes,wehavealsocalculatedthebarcodesfortheotherclasses

    of sequenced genomes, namely eukaryotic, mitochondrial, plastid and plasmid

    genomes. For eukaryotes, we studied the barcodes for four key components in

    eukaryotic genomes, namely the (composite) regions of repetitive sequences,

    promotersequences(the1000-bpupstreamregionfromeachtranslationstart),coding

  • 8/3/2019 Barcodes Genomes

    7/29

    -6-

    regions and introns, respectively (Figure 3(b)-(e)). We observed that (i) different

    regions in a high-level eukaryotic genome (e.g., human) have similar backbone

    structuresintheirbarcodes,and(ii)thebarcodesforthefourtypesofregionshave

    increasinglyhighercomplexity,goingfromrepetitivesequencestocodingregionsto

    introns and promoter sequences. This is consistent with the belief that introns and

    promotersequencesareprobablythemostinformationrichamongthefourbecauseof

    thepossiblylargenumbersofregulatoryelementstheyencode.

    Thebarcodesofthemitochondrialgenomesaregenerallyuniquecomparedtothe

    barcodesoftheothergenomesastheyhaveadistinctoverallappearance(e.g.,Figure

    3(h)-(i)).Theirsimilarappearancemaybetheresultofallmitochondriaoriginating

    fromProteobacteria.Thebarcodesofallplasmidgenomesalsotendtohavesimilar

    characteristicsamongthemselves,possiblyduetobeingundersimilarselection

    pressurecausedbytheirfrequenttransferringamongcellcultures.Thebarcodesofall

    theplastidgenomesarealsogenerallyuniquecomparedtothebarcodesoftheothers

    (e.g.,Figure3(j)-(k)).Forexample,amajorityofthemeachconsistoftwodark

    horizontalbendstowardoneendintheirbarcodesalongthegenomeaxis,whose

    correspondinggenomicregionsconsistofRNAgenessuchasribosomalRNAsand

    tRNAs,plusribosomalproteins.Thefuzzierappearanceoftheplastidbarcodes

    indicatesthattheirk-merfrequenciesalongthegenomeaxisarenotasstableasinthe

    othergenomes.Theoverallsimilarappearancesoftheplastidbarcodesmaybedueto

    alloriginatingfromtheCyanobacteria.

    Oneinterestingquestionisdodifferentclassesofgenomeshavetheirunique

    characteristicsintheirbarcodes?Ouranswerisyes,basedontheirhighlyseparable

    distributions in the feature space defined by two particular features, as shown in

    Figure4,oneofwhichmeasurestheoverallfrequencyvariationforall4-mersacross

  • 8/3/2019 Barcodes Genomes

    8/29

    -7-

    thegenomesbarcode,andtheothermeasurestheoverallsimilaritylevelamongall

    theM-bpfragmentsofthegenome,eachconsideredasavectorof4-merfrequencies.

    While Figure 2(b) indicates that barcodes generally preserve sequence-level

    similarities, Figure 4 suggests that barcodes also capture a higher-level similarity

    beyondindividualgenomesequencesimilaritiesthroughthetexturesoftheirimages,

    which are the common and unique characteristics of different classes of genomes.

    Thispropertyindicatesthatbarcodesarenotjustasimplevisualizationtool,instead

    they havecapturedsome fairly basic informationabout genomes!Fromapplication

    point of view, we believe that this feature will prove to be useful to metagenome

    analyses as fragments from different classes of genomes such as eukaryotes,

    prokaryotes or different organelle genomes, have different characteristics in their

    barcodeimages.

    Identificationofabnormalsequencefragments

    Our procedure for identifying sequence fragments with abnormal barcodes in a

    genome employs a clustering strategy to divide all the sequence fragments in a

    genomeintotwogroups:(a)alargegroupoffragmentswiththeirbarcodesallsimilar

    toeachotherand(b)therest(seeMETHODSsection).

    Usingthisprocedure,wehaveidentified30,582abnormalfragments,covering

    30,889genesacrossall thecompleteprokaryoticgenomes.Specifically28,460such

    fragments are identified in the 542 bacterial genomes, covering 28,562 genes, and

    2,122suchfragmentsareidentifiedinthe46archaealgenomes,covering2,327genes.

    We found that the percentage of fragments with abnormal barcodes ranges from

    9.40% to 32.32% across all thebacterialgenomes,withtheaveragebeing 12.85%.

    Among the 46 sequenced archaeal genomes, the percentage of fragments with

    abnormal barcodes ranges from 9.86% to 23.14%, with the average being 13.58%.

    Further information can be found from Additional file 1. The detailed frequency

  • 8/3/2019 Barcodes Genomes

    9/29

    -8-

    informationforabnormalfragmentsacrossdifferentgenomesisinAdditionalfile2

    [15].

    While we found that it is generally more challenging to study the abnormal

    fragments in eukaryotes, we did apply the same procedure to different human

    chromosomes, and found that the percentage of abnormal fragments ranges from

    10.08%to31.32%,withtheaveragebeing12.10%.

    Wehaveanalyzedtheabnormalfragmentsacrosstheprokaryoticgenomes,and

    foundthefollowing:~30%oftheabnormalfragmentscanbeexplainedintermsof(a)

    horizontalgenetransfers,(b)phageinvasionsand(c)highlyexpressedgenes,based

    on PHX-PA [16, 17] and Prophinder [18],respectively.Amongthe genes that fall

    intothis30%,6.99%arehorizontallytransferredgenes,4.97%bacteriophagegenes

    and18.90%highlyexpressedgenes,basedontheabovetwopredictionprograms

    notethatthesenumbersdonotadduptoexactly30%sincethereareoverlapsamong

    them.The genesfalling into different categories are given inAdditional file 3[15].

    We have carried out an enrichment analysis of such elements in regions with

    abnormal versus normal barcodes. We found that the highly expressed genes are

    enriched in the abnormal fragments, with the enrichment ratio >1 across all the

    genomes and the average enrichment ratio being 1.90. Similar results hold for the

    horizontallytransferredgenesandbacteriophagegenes.Allthedetaileddatacanbe

    foundinAdditionalfile2

    We noted that our estimate of the percentages of foreign fragments in

    bacterial genomes (after deducting the highly expressed genes) is in general

    agreement with the previous estimates though different information and techniques

    areusedtoderivetheestimates[19].

  • 8/3/2019 Barcodes Genomes

    10/29

    -9-

    We do not yet have an explanation for the remaining ~70% of abnormal

    fragmentsinprokaryoticgenomes ,althoughwesuspectthattheymostlyfallintothe

    samethree categoriesone reason that wecould not explain themnow ispossibly

    duetothelimitedcoverageofthecurrentdatabasesforhorizontallytransferredgenes,

    bacteriophagegenomesandhighlyexpressedgenes.Webelievethatbyusingmore

    sophisticated computational procedures, one may be able to derive the level of

    abnormalityofafragmentsbarcodeinagenome,andpossiblylinksuchinformation

    towhensuchfragmentswerehorizontallytransferred[20].

    Binningmetagenomesequence

    Theabilitytosequenceamicrobialcommunityhasledtothesequencingofatleast

    7.04 Giga bps of metagenome sequences, already 2.22 times the total complete

    genome sequences accumulated in the past two decades [21]. These metagenome

    sequences have opened many doors to new research possibilities, and have posed

    some challenging problems.Onesuch problem is determining which fragments are

    from the same organisms in a large pool ofmetagenomicfragments [22], typically

    ~1000 bps in lengths after the initial assembly using the Sanger sequencing

    techniques.

    We have applied a clustering algorithm (see METHODS section) forbinning

    sequence fragments together based on their barcode similarities, and tested the

    clustering strategy on three sets of simulated metagenome data created by cutting

    actualbacterialgenomesintofragmentsandmixingthemtogether.Thethreetestsets

    consistofall sequencefragmentsfromthreesetsofgenomes,respectively,extracted

    fromtheGenBank.Thefirstsetconsistsof11genomesrandomlyselectedfromthe

    samegenusbutfrom11differentspecies(thegenushasonly11sequencedspecies)

    whilethelasttwosetseachconsistof30and100genomesrandomlyselectedfrom30

    and 100 different bacterial genera, respectively. The genome names are given in

    Additionalfile4[15].

  • 8/3/2019 Barcodes Genomes

    11/29

    -10-

    Toassessthebinningabilityofouralgorithmasafunctionofthefragmentsize,

    wehaveconsideredfragmentsizeM=1000,2000,5000and10000.Totestthelimit

    ofourbinningalgorithm,wehavealsoconsideredM=500.Foreachsetofgenomes,

    wepartitionedeachgenomeintofragmentsofsizeM,andthenmixedthefragments

    ofthesamelengthintoonepool.Wethencalculatedthebarcodeforeachfragment,

    anddida clusteringanalysis,assumingthatthenumberofgenomesineachpoolis

    known (this information is derivable from the 16S rRNAs). We have carried out

    binningpredictions,onedirectlyonthegeneratedfragmentsandoneonareducedset

    ofgeneratedfragments,inwhichweremove10%ofthefragmentsfromeachgenome

    whose barcodes are most different from the average barcode of the genome. The

    consideration is that each bacterial genome has ~13% of fragments with abnormal

    barcodesonaverage,whicharenotexpectedtobebinnedcorrectlywiththerestof

    theirhostgenome.Thiswaywecanmoreaccuratelyassessthebinningabilityofour

    algorithm.Table1givesthebinningresultsonthethreesetsofsyntheticmetagenome

    data,boththeoriginalsetandthereducedset.

    Fromthetable,wecanseethatthebinningaccuracy(intothecorrectgenomes)

    ishighforfragmentsizeM=1000andabove,atboththespeciesandthegenuslevel.

    Fromthetable,weseethatthereisadropinthebinningaccuracywhenthenumberof

    the underlying genomes is increased from 30 to 100. This indicates the increased

    complexityoftheproblemasafunctionofthenumberofunderlyinggenomes.

    Wehavecomparedourbinningperformancewiththepublishedresultsbythe

    bestavailablealgorithmPhyloPythia[5].Ourcomparisonindicatesthatouralgorithm

    gives consistently more accurate and more specific binning results across different

    fragmentsizes.Forexample,atthespecieslevel,ouralgorithmhasbetterthan50%

    accuracyonourtestsetwhenthefragmentsizeisatleast2000bpswhilenobinning

    results at the species level is given in McHardy et al. [5]. At the genus level, the

    accuracy (the average of binning specificity and sensitivity, extracted from Figure

  • 8/3/2019 Barcodes Genomes

    12/29

    -11-

    1(a)and(b)inMcHardyetal.[5])byPhyloPythiais45.5%forfragmentsize1000,

    56%for2000,74%for5000and82.5%for10000(nodataisprovidedfor500),all

    measuredin termsof putting fragments into the correctgenerawhileours isto the

    correctgenomesandwithmoreaccuratebinningresults.Itshouldbenotedthatthe

    testsetusedbyPhyloPythiaisdifferentthanours,whichmayaffecttheperformance

    statisticssomewhatthoughwesuspectthatwillbeinsignificant,consideringthesizes

    of the test sets. Another key difference between the two algorithms is that while

    PhyloPythia is a supervised learning algorithm, which requires a training set, our

    algorithmdoesnotrequireatrainingset,andhenceitismoregeneral.

    Onethingworthnotingisthataprokaryoticgenome,onaverage,has~13-14%

    ofabnormalsequencefragments, when the fragmentsize isM = 1,000, suggesting

    that the theoretical limit for binning accuracy should be no better than 86-87%.

    Similarlywe expect that thetheoretical limitsof binning accuracy for2,000, 5,000

    and 10,000 fragments-based binning should, in general, be no better than 87.36%,

    87.58%and88.4%,respectively.

    DiscussionandConclusion

    A natural question is do all nucleotide sequences have the barcode property like

    genomesequenceshave?Theanswerisno,basedonthelargenumberofrandomly

    generatedsequencesthatwehaveexamined.Figure1(e)showsatypicalbarcodeofa

    randomsequencegeneratedusingazeroth

    orderMarkovchainmodel.Wefoundthat

    noneofthesogeneratednucleotidesequenceshastheverticalbandstructuresasin

    genomes barcodes. More generally, barcodes for genomes and the randomly

    generatednucleotide sequences have different characteristics as shown in Figure 5.

    Thebarcodeanalysesinthispaperaremainlybasedondatafromprokaryotes.

    Thoughwehaveappliedthesamebarcodemodeltoeukaryotesandmadeinteresting

    observations,wesuspectthatthecurrentbarcodingschemeisrichenoughtocapture

  • 8/3/2019 Barcodes Genomes

    13/29

    -12-

    all the complexity of eukaryotes. Further studies along this direction are clearly

    needed.

    We believe that for many genome analysis problems, particularly for

    prokaryoticgenomes,thebarcodesprovideanatural,intuitive,information-richand

    unified framework for studying them. Further applications of this capability to

    numerousgenomeanalysisproblemscanbe envisioned,such asphylogenystudies,

    particularly for genomes without obvious marker genes such as viruses, more

    thorough examination of different types of genomic regions in eukaryotes, their

    structures andorganization, further studies of horizontalgene transfers,assistingin

    genome assembly of higher-order organisms (e.g.,populus which we are currently

    workingon)andpossiblymanymore.Webelievethatwehaveonlybegunexploring

    thetruepowerofthisnewcapabilityforgenomestudies.

    MethodsMappingfrequenciestogreylevels

    Thefrequencyofeach k-merismappedtoagreylevelasfollows.Wefirstcountthe

    frequencyofeachk-meracrossallprokaryoticgenomes,andsortthefrequencylist

    S[1:N(k)]intheincreasingorderofthefrequencieswithN(k)beingthenumberofk-

    mers.WethenfindanintegerL,L>0,andpartitionS[1:N(k)]intoLsub-listssothat

    the following function is minimized:=

    =

    Li

    i

    i SS1

    )( , where iS is the sum of all

    frequencies in the ith

    sub-list, S is the average of S, andL is a parameter to be

    determined bytheminimizationresult. ForM=1000and k=4,wefoundL=14

    gives the best valuefor the above objective function. The computedpartition of S

    givesamappingoffrequenciestothegreylevels.Notethatthismappingisgenome-

    independent so each grey level in the barcodes has the same meaning in different

    genomes.

  • 8/3/2019 Barcodes Genomes

    14/29

    -13-

    Barcodesimilaritycalculation

    We define the distance (or dissimilarity) between two barcodes based on their

    simplified representations, each of which is a matrix having the same number of

    columnsofthebarcodeandthenumberofgreylevels,L,usedinbarcodeimagesas

    the number of rows; each element in the matrix represents the frequency of the

    correspondinggreylevelacrosseachcolumninthebarcode.Fortwosuchmatrices

    M1andM2withKcolumnsandLrows,wedefinetheirbarcodedistanceas

    2

    2

    1 1

    1 )),(),(( jiMjiML

    i

    K

    j

    = =

    .

    ClearlythisisageneralizationoftheEuclideandistancebetweentwovectorsofthe

    averaged k-mer frequencies across each genome, widely used for genome

    comparisonsasintheworkofKarlinandcolleagues[13,16,23,24]andmanyothers.

    ThisisequivalenttothespecialcaseofourbarcodedistancewhenL=1.FigureS4in

    Additionalfile1providesacomparisonbetweenthetwodistances.

    Identificationofabnormalfragmentsinagenome

    We have used the following procedure to identify fragments in a genome with

    substantially different barcodes than the average barcode of the genome. The

    procedureconsistsoftwokeysteps.First,foreachk-mer,weselectthefragmentsin

    thegenomethathavethehighestorthelowestX%ofthisk-mersfrequencyamong

    all fragments, with X being a parameter. Then we sort all the fragments in the

    increasing order of the number of times they are selected in the first step, termed

    function F(p), withp being the index of a fragment. Let 0p be the fragment index

    havingthehighestsecond-orderderivativeof F(p).Weconsiderallfragmentspwith

    F(p)>F( 0p )tobethenon-nativefragmentsofthegenomeastheyhaveusedthemost

    numberofk-merswithfrequenciesthataresubstantiallydifferentthanthetypicalk-

    mer frequencies throughout the genome. We found that the abnormal fragment

  • 8/3/2019 Barcodes Genomes

    15/29

    -14-

    predictionisnotverysensitivetothedetailedvalueofXwithintherangefrom5to

    20.SowehavechosenX=10asthedefaultvalueofourprogram.

    The rationale for this procedure is that fragments with higher F(p) values

    represent fragments that have more abnormal k-mer frequencies compared to the

    average k-mer frequenciesin the genome, and hence are more probable tobenon-

    native fragments. By examining the curveof theF(p) function, wefoundthat it is

    convex with one sharp transition point 0p , indicating a transition point from the

    typicalfragmentstotheabnormalfragmentsinthegenome(seeAdditionalfile1).

    Hencewehaveusedthispointastheseparationpointbetweenthenormal(ornative)

    fragmentsandtheabnormalfragments.

    Metagenomebinningalgorithm

    Our binningprocedure startswithanapplicationof the CLUMPprogram [25] toa

    givenpooloffragments(notnecessarilyofthesamelengths)tobeclusteredbasedon

    theirbarcodesimilarities.AuniquefeatureofCLUMPisthatitisquiteaccuratein

    identifyingthecoreelementsofeachclusteraswehavepreviouslydemonstrated[25],

    though a weakness of the algorithm could be that it does not always handle the

    boundary elements accurately. Hence we have combined CLUMP with a K-means

    basedclusteringapproachthatweimplemented.Afteridentifyingtheinitialclusters

    formedbyCLUMPbasedonbarcodesimilarities,assumingthatweknowthenumber

    of clusters to be identified, we pick a seed from each predicted cluster randomly

    according to the density distribution of the cluster. Then we run the K-means

    algorithm,usingtheselectedseeds.Foreachpooloffragments,werunthistwo-step

    clustering algorithm multiple times, using a different set of seeds for each run. In

    decidingthenumberofruns,ourruleofthumbbasedonourexperienceworkingon

    themetagenomedataistouse500*(thenumberofclusters),.Foreachgivensetof

    seeds, we run the K-means algorithm 10000 iterations. Among all the computed

  • 8/3/2019 Barcodes Genomes

    16/29

    -15-

    clustering results for each pool, we choose the clustering result 1 2, , ..., KC C C that

    minimizesthefollowingfunctionasthefinalbinningresult:

    =

    K

    i CX i

    XiX1

    2)( ,

    where 1 2, , ..., KC C C isapartitionofagivenpoolofmetagenomicfragmentswitheach

    Cibeingasubsetofthepooland iX beingtheaverageofthebarcodesofallfragments

    in , 1,..., .iC i K=

    Authors'contributionsY.X.conceivedtheproject.F.Z.andV.O.analyzedthedataandperformedthe

    experiments.Y.X.supervisedthisprojectasP.I.andwrotethemanuscript.

    AcknowledgementsThisworkwas supportedby National Science Foundation (DBI-0354771, ITR-IIS-

    0407204, DBI-0542119, CCF0621700), a U.S. Department of Energy BioEnergy

    ResearchCentergrantfromtheOfficeofBiologicalandEnvironmentalResearchin

    the DOE Office of Science, and a Distinguished Scholar grant from the Georgia

    Cancer Coalition. We would like to thankthe two anonymous reviewers for their

    helpful comments on our work. We would also like to thank Ms Joan Yantko for

    preparingthemanuscript.

  • 8/3/2019 Barcodes Genomes

    17/29

    -16-

    References

    1. BackhedF,LeyRE,SonnenburgJL,PetersonDA,GordonJI:Host-

    bacterialmutualisminthehumanintestine.Science2005,

    307(5717):1915-1920.

    2. JainR,RiveraMC,LakeJA:Horizontalgenetransferamonggenomes:

    thecomplexityhypothesis.ProceedingsoftheNationalAcademyof

    SciencesoftheUnitedStatesofAmerica1999,96(7):3801-3806.

    3. FreyTK:Neurologicalaspectsofrubellavirusinfection.Intervirology

    1997,40(2-3):167-175.

    4. RybchinVN,SvarchevskyAN:TheplasmidprophageN15:alinearDNA

    withcovalentlyclosedends.MolMicrobiol1999,33(5):895-903.

    5. McHardyAC,MartinHG,TsirigosA,HugenholtzP,RigoutsosI:

    Accuratephylogeneticclassificationofvariable-lengthDNAfragments.

    NatMethods2007,4(1):63-72.

    6. YangE,BinW,PengJ,ZhangX,WangJ,YangJ,DongJ,ChuY,Zhang

    J,JinQ:ComparativegenomicsandphylogeneticanalysisofS.

    dysenteriaesubgroup.SciChinaCLifeSci2005,48(4):406-413.

    7. TrifonovEN,SussmanJL:ThepitchofchromatinDNAisreflectedinits

    nucleotidesequence.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica1980,77(7):3816-3820.

    8. BorodovskyM,SprizhitskiiY,GolovanovE,AleksandrovA:Statistical

    patternsinprimarystructuresoffunctionalregionsintheE.coligenome.

    I.Oligonucleotidefrequenciesanalysis.MolecularBiology1986,20:826-

    833.

    9. BorodovskyM,SprizhitskiiY,GolovanovE,AleksandrovA:Statistical

    patternsinprimarystructuresoffunctionalregionsintheE.coligenome.

    II.Non-homogeneousMarkovmodels.MolecularBiology1986,20:833-

    840.

    10. BorodovskyM,SprizhitskiiY,GolovanovE,AleksandrovA:Statistical

    patternsinprimarystructuresoffunctionalregionsintheE.coligenome.

    III.Computerrecognitionofcodingregions.MolecularBiology1986,

    20:1145-1150.

    11. KarlinS,BurgeC:Dinucleotiderelativeabundanceextremes:agenomic

    signature.TrendsGenet1995,11(7):283-290.

    12. KarlinS,ZhuZY,KarlinKD:Theextendedenvironmentof

    mononuclearmetalcentersinproteinstructures.Proceedingsofthe

  • 8/3/2019 Barcodes Genomes

    18/29

    -17-

    NationalAcademyofSciencesoftheUnitedStatesofAmerica1997,

    94(26):14225-14230.

    13. KarlinS,BrocchieriL,MrazekJ,CampbellAM,SpormannAM:A

    chimericprokaryoticancestryofmitochondriaandprimitiveeukaryotes.

    ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica1999,96(16):9190-9195.

    14. Computed_barcodes:[http://csbl.bmb.uga.edu/~ffzhou/BoDB/].

    15. Supplementary_material:[http://csbl.bmb.uga.edu/~ffzhou/BoDB/supp/].

    16. MrazekJ,BhayaD,GrossmanAR,KarlinS:Highlyexpressedandalien

    genesoftheSynechocystisgenome.NucleicAcidsRes2001,29(7):1590-

    1601.

    17. KarlinS,MrazekJ:Predictedhighlyexpressedgenesofdiverse

    prokaryoticgenomes.JBacteriol2000,182(18):5238-5250.

    18. Lima-MendezG,HeldenJV,ToussaintA,LeplaeR:Prophinder:a

    computationaltoolforprophagepredictioninpro-karyoticgenomes.

    Bioinformatics2008.

    19. OchmanH,LawrenceJG,GroismanEA:Lateralgenetransferandthe

    natureofbacterialinnovation.Nature2000,405(6784):299-304.

    20. LawrenceJG,OchmanH:Ameliorationofbacterialgenomes:ratesof

    changeandexchange.JMolEvol1997,44(4):383-397.

    21. LioliosK,MavromatisK,TavernarakisN,KyrpidesNC:TheGenomes

    OnLineDatabase(GOLD)in2007:statusofgenomicandmetagenomic

    projectsandtheirassociatedmetadata.NucleicAcidsRes2008,

    36(Databaseissue):D475-479.

    22. McHardyAC,RigoutsosI:What'sinthemix:phylogeneticclassification

    ofmetagenomesequencesamples.Currentopinioninmicrobiology2007,

    10(5):499-503.

    23. KarlinS,MrazekJ,MaJ,BrocchieriL:Predictedhighlyexpressedgenes

    inarchaealgenomes.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica2005,102(20):7303-7308.

    24. MrazekJ,KarlinS:Detectingaliengenesinbacterialgenomes.AnnNY

    AcadSci1999,870:314-329.

    25. OlmanV,MaoF,WuH,XuY:ParallelClusteringAlgorithmforLarge

    DataSetswithapplicationsinBioinformatics.IEEE/ACMTransactionson

    ComputationalBiologyandBioinformatics2007,Accepted.

    26. DeSantisTZ,Jr.,HugenholtzP,KellerK,BrodieEL,LarsenN,Piceno

    YM,PhanR,AndersenGL:NAST:amultiplesequencealignmentserver

  • 8/3/2019 Barcodes Genomes

    19/29

    -18-

    forcomparativeanalysisof16SrRNAgenes.NucleicAcidsRes2006,

    34(WebServerissue):W394-399.

    27. CormenTH,LeisersonCE,RivestRL,SteinC:Introductionto

    Algorithms,SecondEdition.Cambridge,MATheMITPress;2001.

  • 8/3/2019 Barcodes Genomes

    20/29

    -19-

    Figures

    Figure1Barcodesforfiveprokaryoticgenomes.

    (a)E.coliK-12;(b)E.coliO157;(c)chromosome1ofB.pseudomalleiK96243;(d)

    archaeanP.furiosusDSM3638;and(e)arandomnucleotidesequencegenerated

    usingazeroth

    orderMarkovchainmodel.Thex-axisforeachbarcodeisthelistofall

    4-mersarrangedinthealphabeticalorder,andthey-axisisthegenomeaxiswitheach

    pixelrepresentingafragmentofMbplong.

    Figure2Basicfeaturesofbarcodes.

    (a)Barcodedistancedistributionamongchromosomesfromthesameorganisms,

    acrossallprokaryoticandeukaryoticchromosomalgenomes.Thex-axisisthe

    barcodedistanceandthey-axisisthefrequencyofchromosomepairsofthesame

    organismhavingaparticularbarcodedistance.(b)Genomebarcodedistancesversus

    sequencesimilaritiesamongthecorresponding16SrRNAs(basedonthemultiple

    sequencealignmentgiveninDeSantisTZetal.[26]).They-axisrepresentsthe

    barcodedistance,andthex-axisisthesequenceidentityaxisbetweentwo16SrRNAs

    groupedintoninebins,wherethesequenceidentityiscalculatedastheaverage

    sequenceidentityoverall16Spairsineachbin.

    Figure3Barcodesofsomeorganisms.

    Barcodesof(a)Humanchromosome1(226.21Mbps);majorcomponentsofhuman

    chromosomeinacompositeform:(b)repetitivesequence,(c)promotersequence,(d)

    codingregionsand(e)introns;and(f)codingand(g)non-codingregionsofE.coli

    K-12.Onlya639-Kbpregionofeachsequencein(b)(g)isdisplayedsoeachpixel

    representsthesamesequencelength.639Kbpsisusedsincethisisthelengthofthe

    shortestregionamongthemall,i.e.thetotalnon-codingregionofE.coliK-12.

    Mitochondrialgenomebarcodesof(h)C.elegans(13794bps)and(i)Drosophila

    melanogaster(19517bps).Plastidgenomebarcodesof(j)aquaticplant

  • 8/3/2019 Barcodes Genomes

    21/29

    -20-

    Ceratophyllumdemersum(156252bps)and(k)landplantPopulustrichocarpa

    (157033bps).

    Figure4Barcodesinfeaturespace.Thex-axisistheaverageofvariationsofthe4-merfrequenciesacrossawhole

    genomeacrossall4-mers,andthey-axismeasuresthesimilaritylevelamongall

    1000-bppartitionedfragmentsofthegenome,eachrepresentedasa136-dimensional

    vectorof4-merfrequencies;Specifically,foreachgenome,webuildaminimum

    spanningtree[27]basedonthe4-merfrequencyvectorsforitssequencefragments

    andtheirdistances.They-axisistheaveragedweight(distance)ofalledgesinthe

    minimumspanningtree.Thegreendotsrepresentprokaryotes(586genomes),the

    blueonesforeukaryotes(83chromosomes),theredonesforplastids(101genomes

    withlengths>20,000bps),thebrownonesforplasmidsofprokaryoticgenomes(237

    plasmids>20,000bps)andtheblackformitochondria(120genomeswithlengths>

    20,000bps).

    Figure5Distributionofratiosbetweenbarcodevariationsofallprokaryoticgenomesandtheircorrespondingrandomlygeneratednucleotidesequences.

    Foreachgenome,acorrespondingrandomnucleotidesequenceisdefinedasa

    randomsequenceofthesamelengthandwiththesamemono-nucleotidefrequencies

    asthoseofthegenome,generatedusingazeroth

    orderMarkovchainmodel.The

    variationofabarcodeisdefinedasthestandarddeviationofthelistoftheaveraged

    frequenciesofallthek-mersalongthegenome.Thex-axisistheratioofthebarcode

    variationsbetweenagenomeandacorrespondingrandomsequence,andthey-axis

    representsthefrequencyofcaseswithaparticularvariationratio.

  • 8/3/2019 Barcodes Genomes

    22/29

    -21-

    Tables

    Table1Binningaccuraciesofourbarcode-basedclusteringalgorithm.

    11genomes 30genomes 100genomes

    Originalgenomes

    Filteredgenomes

    Originalgenomes

    Filteredgenomes

    Originalgenomes

    Filteredgenomes

    FS=500bps 71.10% 77.30% 51.6% 55.70% 40.50% 41.10%

    FS=1000bps 79.90% 85.90% 65.30% 70.30% 51.10% 52.60%

    FS=2000bps 86.30% 91.70% 74.80% 80.60% 61.00% 68.53%

    FS=5000bps 91.10% 98.10% 86.60% 93.20% 79.40% 81.90%

    FS=10000bps 95.80% 99.30% 91.90% 97.50% 86.60% 89.18%

    Thebinningaccuracyisdefinedas(predictionspecificity+predictionsensitivity)/2,

    andFSisforfragmentsize,whereboththe specificityandsensitivityaremeasuredin

    terms of putting the fragments into the correct bin corresponding to each genome,

    definedbythemajorityofthefragmentsinthebin.ThecolumnOriginalgenomes

    liststhebinning accuracy of ouralgorithmonall thenon-overlapping fragmentsin

    eachgroupof genomes,andthecolumnFilteredgenomesgivestheaccuracyafter

    removingthe10%fragmentswiththemostabnormalbarcodesfromeachgenome.

  • 8/3/2019 Barcodes Genomes

    23/29

    -22-

    AdditionalfilesAdditionalfile1

    Fileformat:DOC

    Title:Supplementarymaterial.

    Description:Supplementarymaterial1-3.

    Additionalfile2

    Fileformat:XLS

    Title:SupplementaryTable1.

    Description:HXishighlyexpressedgene,HTishorizontallytransferredgene,and

    PHisthephagegene.UnknownGeneconsistsofgeneswithinfragmentsofabnormal

    barcodesbutdonotbelongtotheabovethreecategories.

    Additionalfile3

    Fileformat:ZIPcompressedTXT

    Title:SupplementaryTable2.

    Description:Geneclassificationsofalltheprokaryoticgenomes.

    Additionalfile4

    Fileformat:XLS

    Title:SupplementaryTable3

    Description:Genomesusedinthebinningsection.

  • 8/3/2019 Barcodes Genomes

    24/29

  • 8/3/2019 Barcodes Genomes

    25/29

  • 8/3/2019 Barcodes Genomes

    26/29

  • 8/3/2019 Barcodes Genomes

    27/29

  • 8/3/2019 Barcodes Genomes

    28/29

  • 8/3/2019 Barcodes Genomes

    29/29

    Additional files provided with this submission:

    Additional file 1: barcode-suppl.doc, 737Khttp://www.biomedcentral.com/imedia/7848742892070350/supp1.docAdditional file 2: stable1.xls, 185K

    http://www.biomedcentral.com/imedia/5579318302405972/supp2.xlsAdditional file 3: stable2.zip, 4115Khttp://www.biomedcentral.com/imedia/2665914822070350/supp3.zipAdditional file 4: stable3.xls, 27Khttp://www.biomedcentral.com/imedia/7405639442070350/supp4.xls

    http://www.biomedcentral.com/imedia/7848742892070350/supp1.dochttp://www.biomedcentral.com/imedia/5579318302405972/supp2.xlshttp://www.biomedcentral.com/imedia/2665914822070350/supp3.ziphttp://www.biomedcentral.com/imedia/7405639442070350/supp4.xlshttp://www.biomedcentral.com/imedia/7405639442070350/supp4.xlshttp://www.biomedcentral.com/imedia/2665914822070350/supp3.ziphttp://www.biomedcentral.com/imedia/5579318302405972/supp2.xlshttp://www.biomedcentral.com/imedia/7848742892070350/supp1.doc